Skip to content

Benchmark: azula sampler comparison on two_moons#1825

Open
JiwaniZakir wants to merge 2 commits intosbi-dev:mainfrom
JiwaniZakir:benchmark/azula-sampler-comparison
Open

Benchmark: azula sampler comparison on two_moons#1825
JiwaniZakir wants to merge 2 commits intosbi-dev:mainfrom
JiwaniZakir:benchmark/azula-sampler-comparison

Conversation

@JiwaniZakir
Copy link
Copy Markdown

Summary

Benchmark comparing azula's diffusion samplers (DDIM, DDPM, Euler, Heun) against sbi's default Euler-Maruyama on the two_moons task from mini-sbibm, for both VE and VP SDE types.

Building on the architectural design proposed by @satwiksps in #1468, this implements the score-to-denoiser bridge and addresses noise schedule alignment. Responds to @janfb's request for performance benchmarks before proceeding with the full azula integration.

Related to #1468

Approach

  • SBISchedule: Custom noise schedule adapter that delegates to sbi's mean_t_fn/std_fn, resolving the parameterization mismatch between sbi's VP-SDE (beta_min/beta_max) and azula's VPSchedule (alpha_min). Includes numerical validation showing the schedules match exactly.
  • SBIDenoiser: Adapter converting sbi's score output to azula's denoiser protocol via Tweedie's formula: θ̂₀ = (θₜ + σₜ² · score) / αₜ
  • Evaluation: C2ST + MMD against mini-sbibm reference posteriors, 4 step counts, 3 seeds, 3 observations

Technical Note: Noise Schedule Mismatch

A key finding during implementation: sbi's VP-SDE and azula's VPSchedule use mathematically incompatible formulas for α(t):

  • sbi: α(t) = exp(-0.25t²(β_max - β_min) - 0.5t·β_min)α(1) = 0.082
  • azula: α(t) = exp(t²·log(α_min))α(1) = 0.001

Using azula's built-in VPSchedule directly would produce incorrect samples. The SBISchedule adapter resolves this by delegating to the score estimator's own mean_t_fn and std_fn, which is also the approach I'd recommend for the full integration. (For VE-SDE, both libraries use the same geometric sigma schedule, so this mismatch only affects VP/SubVP.)

Note: mean_t_fn (the raw signal scaling) must be used rather than approx_marginal_mean (which multiplies by mean_0). After z-scoring, mean_0 ≈ 0, so the latter collapses to zero and breaks the Tweedie division.

Results

VE-SDE (default)

Sampler Steps NFE C2ST (mean±std) Time (s)
Euler-Maruyama (sbi) 50 50 0.637 ± 0.018 0.16
DDIM (η=0) 50 50 0.627 ± 0.023 0.16
DDPM 50 50 0.614 ± 0.016 0.16
Heun 50 100 0.624 ± 0.022 0.32
Euler-Maruyama (sbi) 500 500 0.630 ± 0.015 1.32
DDIM (η=0) 500 500 0.620 ± 0.024 1.41
DDPM 500 500 0.628 ± 0.012 1.40
Heun 500 1000 0.625 ± 0.024 2.75

All samplers converge to similar quality for VE, confirming the adapter works correctly.

VP-SDE

Sampler Steps NFE C2ST (mean±std) Time (s)
Euler-Maruyama (sbi) 50 50 0.855 ± 0.007 0.15
DDIM (η=0) 50 50 0.801 ± 0.008 0.24
DDPM 50 50 0.787 ± 0.006 0.15
Heun 50 100 0.794 ± 0.021 0.36
Euler-Maruyama (sbi) 500 500 0.782 ± 0.007 1.88
DDIM (η=0) 500 500 0.787 ± 0.013 1.92
DDPM 500 500 0.774 ± 0.010 1.76
Heun 500 1000 0.792 ± 0.019 3.19

Key observations:

  • At low step counts (50), azula's DDPM and DDIM outperform Euler-Maruyama — C2ST of 0.787/0.801 vs 0.855 (lower is better, 0.5 = perfect).
  • Heun converges fastest per step (2nd-order accuracy) but at 2x NFE cost — at 50 steps (100 NFE) it matches DDIM at 100 steps (100 NFE).
  • All methods converge at high step counts, confirming schedule alignment is correct.
  • VP overall has higher C2ST than VE, consistent with VP being a harder optimization landscape for this task.

NFE = number of function (score) evaluations. Heun uses 2 per step; all others use 1. For fair efficiency comparisons, NFE is the relevant cost metric.

Reproducibility

pip install azula
python benchmarks/azula_sampler_benchmark.py --task two_moons --sde-types ve vp
# Quick run: python benchmarks/azula_sampler_benchmark.py --sde-types ve --steps 50 100 --seeds 0

Test plan

  • Script runs end-to-end on CPU for both VE and VP
  • Schedule validation assertions pass (α/σ alignment < 1e-7)
  • All sampler configurations produce finite, correctly-shaped output
  • C2ST values converge at high step counts (confirms correctness)
  • ruff check + ruff format + pre-commit hooks all pass
  • No modifications to sbi core files — benchmark is fully self-contained

Zakir Jiwani added 2 commits March 24, 2026 16:17
Add benchmark script that trains NPSE on two_moons (mini-sbibm) and compares
sampling performance between sbi's default Euler-Maruyama predictor and azula's
DDIM, DDPM, Euler, and Heun samplers via a score-to-denoiser adapter.

Key components:
- SBISchedule: custom noise schedule adapter resolving the parameterization
  mismatch between sbi's VP-SDE (beta_min/beta_max) and azula's VPSchedule
  (alpha_min), with numerical validation
- SBIDenoiser: score-to-denoiser conversion via Tweedie's formula
- Evaluation: C2ST + MMD against mini-sbibm reference posteriors across
  multiple step counts, seeds, and observations

Related to sbi-dev#1468
- Replace sys.path.insert hack with importlib-based loader for mini_sbibm,
  registering the module in sys.modules so relative imports work without
  polluting the global path
- Add smoke test suite (tests/test_azula_benchmark.py) verifying schedule
  alignment, denoiser output shape, and finite samples from all 4 azula
  samplers — all 6 tests pass
@JiwaniZakir
Copy link
Copy Markdown
Author

Just pushed a follow-up addressing a couple things I noticed after the initial commit:

Import cleanup — Replaced the sys.path.insert hack for loading mini_sbibm with an importlib-based loader that registers the module in sys.modules without polluting the global path. Feels less brittle if the repo structure changes.

Smoke tests — Added tests/test_azula_benchmark.py with 6 tests covering schedule alignment validation, denoiser output shape, and finite sample generation from all 4 azula samplers (DDIM, DDPM, Euler, Heun). All pass in ~1.5s. Marked @pytest.mark.slow so they don't run in standard CI unless opted in, and @pytest.mark.skipif so they're skipped gracefully if azula isn't installed.

A few notes on scope:

  • Results shown are for two_moons since that's what was suggested in Adding interface to azula samplers #1468. The script also supports --task slcp if results on a higher-dimensional task would be useful — happy to add those.
  • azula is only imported inside the benchmark script and test file (behind try/except), not added to pyproject.toml — keeping it out of core dependencies. If this lands, it could be added as an optional extra (sbi[azula]) during the full integration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant