Skip to content

Agents caching#837

Merged
podkidyshev merged 8 commits intomainfrom
ipod/cloudai-25/agents-cache
Mar 18, 2026
Merged

Agents caching#837
podkidyshev merged 8 commits intomainfrom
ipod/cloudai-25/agents-cache

Conversation

@podkidyshev
Copy link
Contributor

@podkidyshev podkidyshev commented Mar 16, 2026

Summary

Implement agents caching to not repeat the same sweeps

Test Plan

  • Automated CI
  • Manual runs with conf/common/test_scenario/dse_nccl_all_gather.toml on EOS

Additional Notes

Closes CLOUDAI-25

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 16, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Added a TrajectoryEntry dataclass and per-iteration in-memory + CSV trajectory cache to CloudAIGymEnv; step now checks the cache for exact-action matches and write_trajectory accepts a TrajectoryEntry. Related imports, exports, SLURM integration, and tests updated to the new API.

Changes

Cohort / File(s) Summary
Core Trajectory Caching System
src/cloudai/configurator/cloudai_gym.py
Added public TrajectoryEntry dataclass; introduced self.trajectory: dict[int, list[TrajectoryEntry]], trajectory_file_path property, current_trajectory property, get_cached_trajectory_result(action), and _values_match_exact(...); step now returns cached result when matched; write_trajectory(self, entry: TrajectoryEntry) appends to in-memory trajectory and serializes CSV.
Package Exports
src/cloudai/configurator/__init__.py
Imported and exported TrajectoryEntry; updated __all__ and copyright year.
SLURM Integration
src/cloudai/systems/slurm/single_sbatch_runner.py
Imported TrajectoryEntry and updated DSE handling to construct and pass a TrajectoryEntry to CloudAIGymEnv.write_trajectory() (signature changed).
Tests — Unit & Integration
tests/test_cloudaigym.py, tests/test_handlers.py
Added test_get_cached_trajectory_result (parametric cases) and test_dse_run_cache to verify caching behavior, CSV output, directory layout, and logging; updated imports to include TrajectoryEntry and added test scaffolding/mocks.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I nibbled traces, row by row,
Stored each hop so repeats I know,
If action twinned, I skip the run,
Append the crumbs when work is done,
A tiny rabbit cache — quick as a bun.

🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Agents caching' directly describes the main feature being implemented—agents caching to avoid repeating sweeps.
Description check ✅ Passed The description explains the purpose (implement agents caching), includes a test plan, and references the closed issue, all relating to the changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch ipod/cloudai-25/agents-cache
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

@podkidyshev podkidyshev changed the title Ipod/cloudai 25/agents cache Agents caching Mar 16, 2026
@podkidyshev podkidyshev marked this pull request as ready for review March 16, 2026 19:25
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/cloudai/configurator/cloudai_gym.py`:
- Around line 120-134: The cache-hit branch in cloudai_gym.py uses
cached_result["source_step"] when calling write_trajectory, but for executed
entries the original executing step is stored as cached_result["step"], so
update the call in the cache path to pass cached_result["step"] as source_step;
locate the logic around get_cached_trajectory_result(...) and the
write_trajectory(...) invocation in the same block and replace the source_step
argument accordingly so cached entries correctly record the original executed
step.

In `@tests/test_cloudaigym.py`:
- Around line 304-323: The parametrized test data for the "trajectory" parameter
is missing the required TypedDict fields "step" and "source_step" from
TrajectoryEntry; update each trajectory dict in the pytest.mark.parametrize
block (the entries paired with variables "trajectory", "action", "expected") to
include valid "step" and "source_step" keys (use appropriate example values
consistent with other fields, e.g., integer step indices and a matching
source_step) so the test data matches the TrajectoryEntry TypedDict and avoids
type inconsistency.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 9fc1dc33-ba1a-4ca7-a34d-d78581dda9e9

📥 Commits

Reviewing files that changed from the base of the PR and between 61e16c7 and 2269e77.

📒 Files selected for processing (8)
  • src/cloudai/configurator/cloudai_gym.py
  • src/cloudai/reporter.py
  • src/cloudai/systems/slurm/single_sbatch_runner.py
  • tests/conftest.py
  • tests/test_cloudaigym.py
  • tests/test_handlers.py
  • tests/test_reporter.py
  • tests/test_single_sbatch_runner.py
💤 Files with no reviewable changes (1)
  • tests/test_reporter.py

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 16, 2026

Greptile Summary

This PR implements an in-memory trajectory cache for CloudAIGymEnv to avoid re-executing DSE sweep steps whose action has already been evaluated in the current iteration. When step() is called with an action that matches a previous entry in self.trajectory[current_iteration], the cached observation and reward are returned immediately, skipping the Slurm job submission entirely.

Key changes:

  • Introduces a frozen TrajectoryEntry dataclass to hold (step, action, reward, observation) tuples.
  • CloudAIGymEnv gains a trajectory: dict[int, list[TrajectoryEntry]] field, partitioned by current_iteration.
  • get_cached_trajectory_result() does a linear scan over the current iteration's entries using _values_match_exact(), a strict recursive type+value comparator that intentionally treats 1 (int) and 1.0 (float) as distinct values.
  • write_trajectory() is refactored to accept a TrajectoryEntry and updates both the in-memory list and the CSV file atomically.
  • single_sbatch_runner.handle_dse() is updated to construct TrajectoryEntry objects for the post-processing trajectory writes.
  • A new integration test (test_dse_run_cache) and unit tests for get_cached_trajectory_result are added, with good coverage of type-mismatch and iteration-isolation edge cases.

The cache is in-memory only and is not restored from the existing trajectory CSV on disk, so a restarted process will re-execute all steps. No issues were found with the cache lookup ordering or step-counter advancement logic.

Confidence Score: 4/5

  • Safe to merge; the caching logic is correct and well-tested, with only minor style concerns.
  • The core cache lookup, exact-match comparison, per-iteration partitioning, and CSV write path are all correct and are covered by the new tests. The two flagged items are style/design concerns (frozen=True with unhashable fields, and lack of disk persistence) rather than functional bugs. Existing tests continue to pass.
  • src/cloudai/configurator/cloudai_gym.py — frozen=True with mutable/unhashable dict and list fields on TrajectoryEntry.

Important Files Changed

Filename Overview
src/cloudai/configurator/cloudai_gym.py Core caching logic: introduces TrajectoryEntry dataclass and per-iteration in-memory cache; cache lookup happens before constraint check and job execution in step(); _values_match_exact performs strict type-aware recursive comparison. Minor design smell: frozen=True with unhashable dict/list fields.
src/cloudai/systems/slurm/single_sbatch_runner.py Updated handle_dse() to construct a TrajectoryEntry object instead of passing individual arguments to write_trajectory(); straightforward refactor with no logic changes.
tests/test_handlers.py Adds test_dse_run_cache integration test that verifies: duplicate actions in a sweep trigger the cache (runner.run called only twice for 3 steps), the trajectory CSV contains only the 2 unique steps, and the reporter surfaces only those 2 steps.
tests/test_cloudaigym.py Adds parametrised unit tests for get_cached_trajectory_result covering: empty trajectory, exact match, type mismatch (int vs float), multiple entries, and iteration isolation.
src/cloudai/configurator/init.py Exports TrajectoryEntry from the configurator package; trivial change.

Sequence Diagram

sequenceDiagram
    participant Agent
    participant CloudAIGymEnv
    participant TrajectoryCache as trajectory dict[iter→list]
    participant Runner
    participant CSV as trajectory.csv

    Agent->>CloudAIGymEnv: step(action)
    CloudAIGymEnv->>CloudAIGymEnv: apply_params_set(action)
    CloudAIGymEnv->>TrajectoryCache: get_cached_trajectory_result(action)
    alt Cache HIT (same action already in current_trajectory)
        TrajectoryCache-->>CloudAIGymEnv: TrajectoryEntry(obs, reward)
        CloudAIGymEnv-->>Agent: (obs, reward, done=False, {}) — no job run
    else Cache MISS
        CloudAIGymEnv->>CloudAIGymEnv: constraint_check()
        alt Constraint FAIL
            CloudAIGymEnv-->>Agent: ([-1.0], -1.0, done=True, {})
        else Constraint PASS
            CloudAIGymEnv->>Runner: run()
            Runner-->>CloudAIGymEnv: job completed
            CloudAIGymEnv->>CloudAIGymEnv: get_observation() + compute_reward()
            CloudAIGymEnv->>TrajectoryCache: append TrajectoryEntry
            CloudAIGymEnv->>CSV: write row (step, action, reward, obs)
            CloudAIGymEnv-->>Agent: (obs, reward, done=False, {})
        end
    end
Loading

Last reviewed commit: "support iteration fe..."

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/cloudai/configurator/cloudai_gym.py`:
- Around line 263-267: The CSV writing is fragile because the header uses
TrajectoryEntry.__annotations__.keys() while the row uses entry.values(),
relying on dict ordering; change the write logic to compute the header_keys =
list(TrajectoryEntry.__annotations__.keys()) and always write rows by iterating
those keys (e.g., writer.writerow([entry.get(k, "") for k in header_keys]))
instead of writer.writerow(list(entry.values())), ensuring consistent column
order when writing in the method that opens self.trajectory_file_path.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: dbd29a05-b26c-486e-9078-e2e76c5a1963

📥 Commits

Reviewing files that changed from the base of the PR and between 2269e77 and e10bbe9.

📒 Files selected for processing (2)
  • src/cloudai/configurator/cloudai_gym.py
  • tests/test_handlers.py

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/cloudai/configurator/cloudai_gym.py`:
- Line 60: The in-memory cache self.trajectory is global across the object while
trajectory_file_path is per-iteration (keyed by
self.test_run.current_iteration), allowing cross-iteration reuse and mixing of
CSVs; update get_cached_trajectory_result and related logic to scope cached
entries by iteration (e.g., map from current_iteration -> list[TrajectoryEntry])
or clear/replace self.trajectory at iteration boundary in reset(), and ensure
append logic that writes to trajectory_file_path uses the iteration-scoped
cache; also apply the same scoping/clearing changes to the code handling lines
around the other occurrence (the 233-242 block) so no stale entries are reused
across iterations.
- Around line 120-126: The cached TrajectoryEntry exposes mutable fields
(action, observation) that callers may mutate; update
get_cached_trajectory_result usage in the code path that returns cached_result
to return a deep copy of cached_result.observation (use copy.deepcopy) and
similarly modify write_trajectory to deepcopy the entry.action and
entry.observation before storing into self.trajectory so stored objects are
immutable from callers; reference the TrajectoryEntry type, the
get_cached_trajectory_result return site, and the write_trajectory storage logic
and import/use copy.deepcopy accordingly.

In `@src/cloudai/systems/slurm/single_sbatch_runner.py`:
- Around line 215-222: handle_dse() is writing trajectory rows for every
combination in tr.all_combinations even if they were filtered out by
unroll_dse(); update the loop that calls
gym.write_trajectory(TrajectoryEntry(...)) to skip combinations that fail the
same constraint_check() used in unroll_dse() (or check membership against the
filtered list produced by unroll_dse()), so only combinations that actually ran
produce observation/reward rows; reference gym.write_trajectory,
TrajectoryEntry, unroll_dse(), handle_dse(), tr.all_combinations, and
constraint_check() when locating and applying the guard.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: a3de5944-b2fd-465c-9fdd-94ff858e26ab

📥 Commits

Reviewing files that changed from the base of the PR and between e10bbe9 and 79410b0.

📒 Files selected for processing (5)
  • src/cloudai/configurator/__init__.py
  • src/cloudai/configurator/cloudai_gym.py
  • src/cloudai/systems/slurm/single_sbatch_runner.py
  • tests/test_cloudaigym.py
  • tests/test_handlers.py

amaslenn
amaslenn previously approved these changes Mar 18, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/cloudai/configurator/cloudai_gym.py`:
- Around line 248-272: The method _values_match_exact has unnecessary elifs
after earlier returns; simplify control flow by replacing those elif branches
with plain if statements (keep the same type checks and logic for dict handling,
list/tuple handling—including the zip(..., strict=True) loop—and the final
equality fallback) so behavior does not change but the function flow is cleaner
and satisfies the RET505 hint.
- Around line 30-38: The TrajectoryEntry dataclass's observation field is
annotated as a bare list; change it to a parametrized type (e.g., list[Any] or
Sequence[Any]) to improve typing and IDE support: update the annotation for
observation in TrajectoryEntry and add/import the appropriate typing symbol (Any
or Sequence) from typing at the top of the module so the code compiles and
type-checks correctly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 0b13d3b8-abbd-4615-af08-c4c510ee71e3

📥 Commits

Reviewing files that changed from the base of the PR and between 199385c and 0f55238.

📒 Files selected for processing (2)
  • src/cloudai/configurator/cloudai_gym.py
  • tests/test_cloudaigym.py

@podkidyshev podkidyshev merged commit 1f063ab into main Mar 18, 2026
5 checks passed
@podkidyshev podkidyshev deleted the ipod/cloudai-25/agents-cache branch March 18, 2026 11:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants