Skip to content

Add interim resource settings for UMAP 1.3.0, tsne 1.2.0, fast_tSNE 0.1.6#334

Open
fnaji wants to merge 1 commit into
masterfrom
add-umap-1.3.0-and-tsne-1.2.0
Open

Add interim resource settings for UMAP 1.3.0, tsne 1.2.0, fast_tSNE 0.1.6#334
fnaji wants to merge 1 commit into
masterfrom
add-umap-1.3.0-and-tsne-1.2.0

Conversation

@fnaji
Copy link
Copy Markdown
Contributor

@fnaji fnaji commented Apr 27, 2026

Summary

Add three rows to operator_resource_settings_auto.json for operator versions that have shipped but never been registered, so the platform stops falling back to stale opEnvs.last matches and starts giving them appropriate memory estimates.

Motivation

Investigation of an OOM on a zliu_sartorius UMAP run today (2026-04-25) showed the operator container launched with --memory 603M against a 5.56 M-row dataset and was killed with exit 137 ~11s in. Tracing the path:

  1. The user's UMAP step is on v1.3.0 (tagged 2026-01-14).
  2. The auto registry has no row for 1.3.0, so getOperatorEnv() falls back to opEnvs.last for the URI = v1.2.5 (base=500 MB, ratio=7.4).
  3. Sized against the input's 21 MB logical size: 500 MB + 21 × 7.4 ≈ 603 MB → exactly what we saw.

t-SNE has the same issue: user is on 1.2.0 (tagged 2025-05-28), no registry row, falls back to 1.1.0's applied ratio=5 (despite that version's fitted ratio_upper=16.25 — flagged but never promoted because r²=0.25). Result: similar OOM pattern.

fast_tSNE_operator 0.1.6 is already in operator_resource_settings_custom.json (base=5 GB, ratio=10); custom takes precedence over auto, so the platform behavior is already correct for it. The auto entry here is purely for discoverability/consistency.

Why interim values, not data-fitted

The daily fitter (tercen/tirmo resource-settings.yaml workflow) has been disabled_manually since 2025-04-28, after 17 consecutive failures (last successful run: 2025-04-11). Until that's revived, no data-driven values can be produced. These rows are placeholders — ratio_est / ratio_upper / r² / typical_duration / ratio_rounded are all 0 to make it obvious nothing was fit; only ratio and base_memory are meaningful.

Values chosen

operator version base_memory ratio rationale
umap_operator 1.3.0 1 GB 10 small bump from 1.2.5's ratio=7.4; 1.3.0 added operatorSpec and feature changes that may increase memory
tsne_operator 1.2.0 1 GB 15 between 1.1.0's floored applied-5 and its ratio_upper=16.25 (the fit said this was needed, but was r²=0.25-blocked from promotion)
fast_tSNE_operator 0.1.6 5 GB 10 matches existing custom override; this row purely mirrors it

Numbers will be overwritten on the next tirmo run once the workflow is revived.

Limits of this fix

The legacy formula base_memory + sizeInBytes × ratio is fundamentally wrong-shape for UMAP and t-SNE — both grow super-linearly in n (kNN graph + embedding optimization with copies). Bumping ratio and base_memory raises the floor, but on a sufficiently large dataset both will OOM again. The full fix requires either:

  • The model-based tier (memory_model.json or resource_model.json per operator with the power-law formula), or
  • The new sarno-based pipeline (already implements power-law fits with ontology-aware features in sarno_executor::resource_estimator / model_fit)

In the meantime, these rows reduce harm.

Tests / verification

  • JSON schema preserved (added 4 fields per row matching existing rows; one cosmetic -00 adjustment in the file from JSON re-emission).
  • getOperatorEnv() will now exact-match these versions instead of falling back to the wrong row.
  • Per-step memory overrides still take precedence over the registry; this PR only changes the default if no override is set.

….1.6

Three operator versions are missing from the auto registry, which
causes the platform to fall back to opEnvs.last (the most recent
older version with a fit) and produce too-low memory estimates:

- umap_operator 1.3.0   tagged 2026-01-14, fell back to v1.2.5
- tsne_operator 1.2.0   tagged 2025-05-28, fell back to v1.1.0
- fast_tSNE_operator 0.1.6 — already in operator_resource_settings_custom.json
                            (custom takes precedence; this row mirrors it
                            for discoverability in the auto file)

These are interim values, not data-fitted (the daily fitter in
tirmo's resource-settings.yaml workflow has been disabled_manually
since 2025-04-28 — last successful run 2025-04-11). The rows here
use placeholder zeros for ratio_est / ratio_upper / r2 /
typical_duration / ratio_rounded; only the applied 'ratio' and
'base_memory' are meaningful. They will be overwritten on the next
fitter run.

Numbers chosen conservatively:
- UMAP 1.3.0:  base 1 GB, ratio 10  (vs 1.2.5: base 500MB, ratio 7.4)
- tsne 1.2.0:  base 1 GB, ratio 15  (1.1.0 had ratio_upper 16.25 but
                                     was floored to 5 due to r2=0.25)
- fast_tSNE 0.1.6: matches existing custom override (5 GB, 10)

Linear formula limits the help these values can give: UMAP and
t-SNE both have super-linear memory profiles (kNN graph + embedding)
that a base+size*ratio model can't capture. Real fix is reviving
the fitter and/or moving to the model-based tier with a power-law
formula.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant