Agents caching by podkidyshev · Pull Request #837 · NVIDIA/cloudai

podkidyshev · 2026-03-16T18:49:55Z

Summary

Implement agents caching to not repeat the same sweeps

Test Plan

Automated CI
Manual runs with conf/common/test_scenario/dse_nccl_all_gather.toml on EOS

Additional Notes

Closes CLOUDAI-25

coderabbitai · 2026-03-16T18:50:04Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Added a TrajectoryEntry dataclass and per-iteration in-memory + CSV trajectory cache to CloudAIGymEnv; step now checks the cache for exact-action matches and write_trajectory accepts a TrajectoryEntry. Related imports, exports, SLURM integration, and tests updated to the new API.

Changes

Cohort / File(s)	Summary
Core Trajectory Caching System `src/cloudai/configurator/cloudai_gym.py`	Added public `TrajectoryEntry` dataclass; introduced `self.trajectory: dict[int, list[TrajectoryEntry]]`, `trajectory_file_path` property, `current_trajectory` property, `get_cached_trajectory_result(action)`, and `_values_match_exact(...)`; `step` now returns cached result when matched; `write_trajectory(self, entry: TrajectoryEntry)` appends to in-memory trajectory and serializes CSV.
Package Exports `src/cloudai/configurator/__init__.py`	Imported and exported `TrajectoryEntry`; updated `__all__` and copyright year.
SLURM Integration `src/cloudai/systems/slurm/single_sbatch_runner.py`	Imported `TrajectoryEntry` and updated DSE handling to construct and pass a `TrajectoryEntry` to `CloudAIGymEnv.write_trajectory()` (signature changed).
Tests — Unit & Integration `tests/test_cloudaigym.py`, `tests/test_handlers.py`	Added `test_get_cached_trajectory_result` (parametric cases) and `test_dse_run_cache` to verify caching behavior, CSV output, directory layout, and logging; updated imports to include `TrajectoryEntry` and added test scaffolding/mocks.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I nibbled traces, row by row,
Stored each hop so repeats I know,
If action twinned, I skip the run,
Append the crumbs when work is done,
A tiny rabbit cache — quick as a bun.

🚥 Pre-merge checks | ✅ 2

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Agents caching' directly describes the main feature being implemented—agents caching to avoid repeating sweeps.
Description check	✅ Passed	The description explains the purpose (implement agents caching), includes a test plan, and references the closed issue, all relating to the changeset.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch ipod/cloudai-25/agents-cache

📝 Coding Plan

Generate coding plan for human review comments

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/cloudai/configurator/cloudai_gym.py`:
- Around line 120-134: The cache-hit branch in cloudai_gym.py uses
cached_result["source_step"] when calling write_trajectory, but for executed
entries the original executing step is stored as cached_result["step"], so
update the call in the cache path to pass cached_result["step"] as source_step;
locate the logic around get_cached_trajectory_result(...) and the
write_trajectory(...) invocation in the same block and replace the source_step
argument accordingly so cached entries correctly record the original executed
step.

In `@tests/test_cloudaigym.py`:
- Around line 304-323: The parametrized test data for the "trajectory" parameter
is missing the required TypedDict fields "step" and "source_step" from
TrajectoryEntry; update each trajectory dict in the pytest.mark.parametrize
block (the entries paired with variables "trajectory", "action", "expected") to
include valid "step" and "source_step" keys (use appropriate example values
consistent with other fields, e.g., integer step indices and a matching
source_step) so the test data matches the TrajectoryEntry TypedDict and avoids
type inconsistency.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 9fc1dc33-ba1a-4ca7-a34d-d78581dda9e9

📥 Commits

Reviewing files that changed from the base of the PR and between 61e16c7 and 2269e77.

📒 Files selected for processing (8)

src/cloudai/configurator/cloudai_gym.py
src/cloudai/reporter.py
src/cloudai/systems/slurm/single_sbatch_runner.py
tests/conftest.py
tests/test_cloudaigym.py
tests/test_handlers.py
tests/test_reporter.py
tests/test_single_sbatch_runner.py

💤 Files with no reviewable changes (1)

tests/test_reporter.py

src/cloudai/configurator/cloudai_gym.py

tests/test_cloudaigym.py

greptile-apps · 2026-03-16T19:33:09Z

Greptile Summary

This PR implements an in-memory trajectory cache for CloudAIGymEnv to avoid re-executing DSE sweep steps whose action has already been evaluated in the current iteration. When step() is called with an action that matches a previous entry in self.trajectory[current_iteration], the cached observation and reward are returned immediately, skipping the Slurm job submission entirely.

Key changes:

Introduces a frozen TrajectoryEntry dataclass to hold (step, action, reward, observation) tuples.
CloudAIGymEnv gains a trajectory: dict[int, list[TrajectoryEntry]] field, partitioned by current_iteration.
get_cached_trajectory_result() does a linear scan over the current iteration's entries using _values_match_exact(), a strict recursive type+value comparator that intentionally treats 1 (int) and 1.0 (float) as distinct values.
write_trajectory() is refactored to accept a TrajectoryEntry and updates both the in-memory list and the CSV file atomically.
single_sbatch_runner.handle_dse() is updated to construct TrajectoryEntry objects for the post-processing trajectory writes.
A new integration test (test_dse_run_cache) and unit tests for get_cached_trajectory_result are added, with good coverage of type-mismatch and iteration-isolation edge cases.

The cache is in-memory only and is not restored from the existing trajectory CSV on disk, so a restarted process will re-execute all steps. No issues were found with the cache lookup ordering or step-counter advancement logic.

Confidence Score: 4/5

Safe to merge; the caching logic is correct and well-tested, with only minor style concerns.
The core cache lookup, exact-match comparison, per-iteration partitioning, and CSV write path are all correct and are covered by the new tests. The two flagged items are style/design concerns (frozen=True with unhashable fields, and lack of disk persistence) rather than functional bugs. Existing tests continue to pass.
src/cloudai/configurator/cloudai_gym.py — frozen=True with mutable/unhashable dict and list fields on TrajectoryEntry.

Important Files Changed

Filename	Overview
src/cloudai/configurator/cloudai_gym.py	Core caching logic: introduces TrajectoryEntry dataclass and per-iteration in-memory cache; cache lookup happens before constraint check and job execution in step(); _values_match_exact performs strict type-aware recursive comparison. Minor design smell: frozen=True with unhashable dict/list fields.
src/cloudai/systems/slurm/single_sbatch_runner.py	Updated handle_dse() to construct a TrajectoryEntry object instead of passing individual arguments to write_trajectory(); straightforward refactor with no logic changes.
tests/test_handlers.py	Adds test_dse_run_cache integration test that verifies: duplicate actions in a sweep trigger the cache (runner.run called only twice for 3 steps), the trajectory CSV contains only the 2 unique steps, and the reporter surfaces only those 2 steps.
tests/test_cloudaigym.py	Adds parametrised unit tests for get_cached_trajectory_result covering: empty trajectory, exact match, type mismatch (int vs float), multiple entries, and iteration isolation.
src/cloudai/configurator/init.py	Exports TrajectoryEntry from the configurator package; trivial change.

Sequence Diagram

sequenceDiagram
    participant Agent
    participant CloudAIGymEnv
    participant TrajectoryCache as trajectory dict[iter→list]
    participant Runner
    participant CSV as trajectory.csv

    Agent->>CloudAIGymEnv: step(action)
    CloudAIGymEnv->>CloudAIGymEnv: apply_params_set(action)
    CloudAIGymEnv->>TrajectoryCache: get_cached_trajectory_result(action)
    alt Cache HIT (same action already in current_trajectory)
        TrajectoryCache-->>CloudAIGymEnv: TrajectoryEntry(obs, reward)
        CloudAIGymEnv-->>Agent: (obs, reward, done=False, {}) — no job run
    else Cache MISS
        CloudAIGymEnv->>CloudAIGymEnv: constraint_check()
        alt Constraint FAIL
            CloudAIGymEnv-->>Agent: ([-1.0], -1.0, done=True, {})
        else Constraint PASS
            CloudAIGymEnv->>Runner: run()
            Runner-->>CloudAIGymEnv: job completed
            CloudAIGymEnv->>CloudAIGymEnv: get_observation() + compute_reward()
            CloudAIGymEnv->>TrajectoryCache: append TrajectoryEntry
            CloudAIGymEnv->>CSV: write row (step, action, reward, obs)
            CloudAIGymEnv-->>Agent: (obs, reward, done=False, {})
        end
    end

_{Last reviewed commit: "support iteration fe..."}

src/cloudai/configurator/cloudai_gym.py

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/cloudai/configurator/cloudai_gym.py`:
- Around line 263-267: The CSV writing is fragile because the header uses
TrajectoryEntry.__annotations__.keys() while the row uses entry.values(),
relying on dict ordering; change the write logic to compute the header_keys =
list(TrajectoryEntry.__annotations__.keys()) and always write rows by iterating
those keys (e.g., writer.writerow([entry.get(k, "") for k in header_keys]))
instead of writer.writerow(list(entry.values())), ensuring consistent column
order when writing in the method that opens self.trajectory_file_path.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: dbd29a05-b26c-486e-9078-e2e76c5a1963

📥 Commits

Reviewing files that changed from the base of the PR and between 2269e77 and e10bbe9.

📒 Files selected for processing (2)

src/cloudai/configurator/cloudai_gym.py
tests/test_handlers.py

src/cloudai/configurator/cloudai_gym.py

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/cloudai/configurator/cloudai_gym.py`:
- Line 60: The in-memory cache self.trajectory is global across the object while
trajectory_file_path is per-iteration (keyed by
self.test_run.current_iteration), allowing cross-iteration reuse and mixing of
CSVs; update get_cached_trajectory_result and related logic to scope cached
entries by iteration (e.g., map from current_iteration -> list[TrajectoryEntry])
or clear/replace self.trajectory at iteration boundary in reset(), and ensure
append logic that writes to trajectory_file_path uses the iteration-scoped
cache; also apply the same scoping/clearing changes to the code handling lines
around the other occurrence (the 233-242 block) so no stale entries are reused
across iterations.
- Around line 120-126: The cached TrajectoryEntry exposes mutable fields
(action, observation) that callers may mutate; update
get_cached_trajectory_result usage in the code path that returns cached_result
to return a deep copy of cached_result.observation (use copy.deepcopy) and
similarly modify write_trajectory to deepcopy the entry.action and
entry.observation before storing into self.trajectory so stored objects are
immutable from callers; reference the TrajectoryEntry type, the
get_cached_trajectory_result return site, and the write_trajectory storage logic
and import/use copy.deepcopy accordingly.

In `@src/cloudai/systems/slurm/single_sbatch_runner.py`:
- Around line 215-222: handle_dse() is writing trajectory rows for every
combination in tr.all_combinations even if they were filtered out by
unroll_dse(); update the loop that calls
gym.write_trajectory(TrajectoryEntry(...)) to skip combinations that fail the
same constraint_check() used in unroll_dse() (or check membership against the
filtered list produced by unroll_dse()), so only combinations that actually ran
produce observation/reward rows; reference gym.write_trajectory,
TrajectoryEntry, unroll_dse(), handle_dse(), tr.all_combinations, and
constraint_check() when locating and applying the guard.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: a3de5944-b2fd-465c-9fdd-94ff858e26ab

📥 Commits

Reviewing files that changed from the base of the PR and between e10bbe9 and 79410b0.

📒 Files selected for processing (5)

src/cloudai/configurator/__init__.py
src/cloudai/configurator/cloudai_gym.py
src/cloudai/systems/slurm/single_sbatch_runner.py
tests/test_cloudaigym.py
tests/test_handlers.py

src/cloudai/configurator/cloudai_gym.py

src/cloudai/systems/slurm/single_sbatch_runner.py

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/cloudai/configurator/cloudai_gym.py`:
- Around line 248-272: The method _values_match_exact has unnecessary elifs
after earlier returns; simplify control flow by replacing those elif branches
with plain if statements (keep the same type checks and logic for dict handling,
list/tuple handling—including the zip(..., strict=True) loop—and the final
equality fallback) so behavior does not change but the function flow is cleaner
and satisfies the RET505 hint.
- Around line 30-38: The TrajectoryEntry dataclass's observation field is
annotated as a bare list; change it to a parametrized type (e.g., list[Any] or
Sequence[Any]) to improve typing and IDE support: update the annotation for
observation in TrajectoryEntry and add/import the appropriate typing symbol (Any
or Sequence) from typing at the top of the module so the code compiles and
type-checks correctly.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 0b13d3b8-abbd-4615-af08-c4c510ee71e3

📥 Commits

Reviewing files that changed from the base of the PR and between 199385c and 0f55238.

📒 Files selected for processing (2)

src/cloudai/configurator/cloudai_gym.py
tests/test_cloudaigym.py

src/cloudai/configurator/cloudai_gym.py

podkidyshev added 3 commits March 14, 2026 20:53

agents cache

4691d9d

better float comparison

0e37b6e

manage trajectory better and in reports

4174ab5

simplify code

2269e77

podkidyshev changed the title ~~Ipod/cloudai 25/agents cache~~ Agents caching Mar 16, 2026

podkidyshev marked this pull request as ready for review March 16, 2026 19:25

podkidyshev requested review from amaslenn, jeffnvidia and srivatsankrishnan as code owners March 16, 2026 19:25

coderabbitai bot reviewed Mar 16, 2026

View reviewed changes

src/cloudai/configurator/cloudai_gym.py Outdated Show resolved Hide resolved

tests/test_cloudaigym.py Show resolved Hide resolved

greptile-apps bot reviewed Mar 16, 2026

View reviewed changes

src/cloudai/configurator/cloudai_gym.py Outdated Show resolved Hide resolved

fix bug with source_step

e10bbe9

coderabbitai bot reviewed Mar 17, 2026

View reviewed changes

src/cloudai/configurator/cloudai_gym.py Outdated Show resolved Hide resolved

greptile-apps bot reviewed Mar 17, 2026

View reviewed changes

src/cloudai/configurator/cloudai_gym.py Show resolved Hide resolved

podkidyshev added 2 commits March 17, 2026 20:06

revert certain changes

79410b0

fix copyright

199385c

coderabbitai bot reviewed Mar 17, 2026

View reviewed changes

src/cloudai/configurator/cloudai_gym.py Outdated Show resolved Hide resolved

src/cloudai/configurator/cloudai_gym.py Show resolved Hide resolved

src/cloudai/systems/slurm/single_sbatch_runner.py Show resolved Hide resolved

amaslenn previously approved these changes Mar 18, 2026

View reviewed changes

support iteration feature

0f55238

podkidyshev dismissed amaslenn’s stale review via 0f55238 March 18, 2026 10:50

coderabbitai bot reviewed Mar 18, 2026

View reviewed changes

src/cloudai/configurator/cloudai_gym.py Show resolved Hide resolved

src/cloudai/configurator/cloudai_gym.py Show resolved Hide resolved

amaslenn approved these changes Mar 18, 2026

View reviewed changes

podkidyshev merged commit 1f063ab into main Mar 18, 2026
5 checks passed

podkidyshev deleted the ipod/cloudai-25/agents-cache branch March 18, 2026 11:22

Conversation

podkidyshev commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Plan

Additional Notes

Uh oh!

coderabbitai bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

podkidyshev commented Mar 16, 2026 •

edited

Loading

coderabbitai bot commented Mar 16, 2026 •

edited

Loading

greptile-apps bot commented Mar 16, 2026 •

edited

Loading