Add interim resource settings for UMAP 1.3.0, tsne 1.2.0, fast_tSNE 0.1.6#334
Open
fnaji wants to merge 1 commit into
Open
Add interim resource settings for UMAP 1.3.0, tsne 1.2.0, fast_tSNE 0.1.6#334fnaji wants to merge 1 commit into
fnaji wants to merge 1 commit into
Conversation
….1.6
Three operator versions are missing from the auto registry, which
causes the platform to fall back to opEnvs.last (the most recent
older version with a fit) and produce too-low memory estimates:
- umap_operator 1.3.0 tagged 2026-01-14, fell back to v1.2.5
- tsne_operator 1.2.0 tagged 2025-05-28, fell back to v1.1.0
- fast_tSNE_operator 0.1.6 — already in operator_resource_settings_custom.json
(custom takes precedence; this row mirrors it
for discoverability in the auto file)
These are interim values, not data-fitted (the daily fitter in
tirmo's resource-settings.yaml workflow has been disabled_manually
since 2025-04-28 — last successful run 2025-04-11). The rows here
use placeholder zeros for ratio_est / ratio_upper / r2 /
typical_duration / ratio_rounded; only the applied 'ratio' and
'base_memory' are meaningful. They will be overwritten on the next
fitter run.
Numbers chosen conservatively:
- UMAP 1.3.0: base 1 GB, ratio 10 (vs 1.2.5: base 500MB, ratio 7.4)
- tsne 1.2.0: base 1 GB, ratio 15 (1.1.0 had ratio_upper 16.25 but
was floored to 5 due to r2=0.25)
- fast_tSNE 0.1.6: matches existing custom override (5 GB, 10)
Linear formula limits the help these values can give: UMAP and
t-SNE both have super-linear memory profiles (kNN graph + embedding)
that a base+size*ratio model can't capture. Real fix is reviving
the fitter and/or moving to the model-based tier with a power-law
formula.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add three rows to
operator_resource_settings_auto.jsonfor operator versions that have shipped but never been registered, so the platform stops falling back to staleopEnvs.lastmatches and starts giving them appropriate memory estimates.Motivation
Investigation of an OOM on a
zliu_sartoriusUMAP run today (2026-04-25) showed the operator container launched with--memory 603Magainst a 5.56 M-row dataset and was killed withexit 137~11s in. Tracing the path:getOperatorEnv()falls back toopEnvs.lastfor the URI = v1.2.5 (base=500 MB,ratio=7.4).500 MB + 21 × 7.4 ≈ 603 MB→ exactly what we saw.t-SNE has the same issue: user is on 1.2.0 (tagged 2025-05-28), no registry row, falls back to 1.1.0's applied
ratio=5(despite that version's fittedratio_upper=16.25— flagged but never promoted becauser²=0.25). Result: similar OOM pattern.fast_tSNE_operator 0.1.6is already inoperator_resource_settings_custom.json(base=5 GB, ratio=10); custom takes precedence over auto, so the platform behavior is already correct for it. The auto entry here is purely for discoverability/consistency.Why interim values, not data-fitted
The daily fitter (
tercen/tirmoresource-settings.yamlworkflow) has beendisabled_manuallysince 2025-04-28, after 17 consecutive failures (last successful run: 2025-04-11). Until that's revived, no data-driven values can be produced. These rows are placeholders —ratio_est / ratio_upper / r² / typical_duration / ratio_roundedare all0to make it obvious nothing was fit; onlyratioandbase_memoryare meaningful.Values chosen
ratio=7.4; 1.3.0 addedoperatorSpecand feature changes that may increase memoryratio_upper=16.25(the fit said this was needed, but was r²=0.25-blocked from promotion)Numbers will be overwritten on the next
tirmorun once the workflow is revived.Limits of this fix
The legacy formula
base_memory + sizeInBytes × ratiois fundamentally wrong-shape for UMAP and t-SNE — both grow super-linearly inn(kNN graph + embedding optimization with copies). Bumpingratioandbase_memoryraises the floor, but on a sufficiently large dataset both will OOM again. The full fix requires either:memory_model.jsonorresource_model.jsonper operator with the power-law formula), orsarno-based pipeline (already implements power-law fits with ontology-aware features insarno_executor::resource_estimator/model_fit)In the meantime, these rows reduce harm.
Tests / verification
-0→0adjustment in the file from JSON re-emission).getOperatorEnv()will now exact-match these versions instead of falling back to the wrong row.