Improve the performance of creating FairSchedulingAlgoContext by JamesMurkin · Pull Request #4807 · armadaproject/armada

JamesMurkin · 2026-04-01T12:07:33Z

When creating the fairshare scheduling context, we loop through all jobs of the jobdb and then filter out the ones relevant to the current pool

We then use these jobs to calculate relevant info about the pool

Which jobs are on which nodes
The demand of each queue (and subsequently the fairshare/adjusted fairshare)

This is simple but quite inefficient if you have a lot (millions) of jobs in the system and many pools (most of which only a small fraction of jobs interact with)

As a result most of the scheduling time can be taken up by this data shuffling

Now instead of loading all jobs, we'll load on relevant jobs:

All leased jobs (needed to calculate which jobs are on which nodes)
- Possibly this could be further improved by storing them by node but for now this is simple
All queued jobs against pools we are processing (current pool + away pools)

This should have a significant impact, especially in the case we have 1 pool with most jobs queued against it and many smaller pools

There is larger reworks we could do here to make it even more efficient, however for now this should give us most the benefit for a minor change

Example would be to calculate demand for each pool + jobs on each node upfront. Incrementally update this as pools are processed rather than recalculate it entirely

When creating the fairshare scheduling context, we loop through all jobs of the jobdb and then filter out the ones relevant to the current pool We then use these jobs to calculate relevant info about the pool - Which jobs are on which nodes - The demand of each queue (and subsequently the fairshare/adjusted fairshare) This is simple but quite inefficient if you have a lot (millions) of jobs in the system and many pools (most of which only a small fraction of jobs interact with) - As a result most of the scheduling time can be taken up by this data shuffling Now instead of loading all jobs, we'll load on relevant jobs: - All leased jobs (needed to calculate which jobs are on which nodes) - Possibly this could be further improved by storing them by node but for now this is simple - All queued jobs against pools we are processing (current pool + away pools) This should have a significant impact, especially in the case we have 1 pool with most jobs queued against it and many smaller pools There is larger reworks we could do here to make it even more efficient, however for now this should give us most the benefit for a minor change - Example would be to calculate demand for each pool + jobs on each node upfront. Incrementally update this as pools are processed rather than recalculate it entirely Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

greptile-apps · 2026-04-01T12:10:59Z

Greptile Summary

This PR significantly improves FairSchedulingAlgoContext creation by replacing txn.GetAll() (which iterated every job in the DB) with three targeted fetches: all leased jobs, all terminal jobs, and queued jobs filtered by the relevant pool set. Both previously reported P0/P1 issues (the delete() copy-paste bug and the oversized pre-allocation in GetQueuedJobsByPool) are resolved in the current HEAD. The core logic — lifecycle management of the new leasedJobs/terminalJobs immutable sets through Upsert and delete, plus the deduplication step with UniqueBy — is correct.

Confidence Score: 5/5

Safe to merge — both previously reported P0/P1 issues are resolved and only minor P2 suggestions remain

The two critical issues flagged in prior review threads (delete() copy-paste bug assigning to wrong field, and the oversized pre-allocation in GetQueuedJobsByPool) are both corrected in the current HEAD. Remaining findings are P2: a comment/implementation mismatch on terminal job scope, and a minor Go idiom preference for the seen-map type. Neither affects correctness.

scheduling_algo.go — verify the GetAllTerminalJobs() scope is acceptable for your largest deployments; jobdb.go — no issues remaining after prior fixes

Important Files Changed

Filename	Overview
internal/common/slices/slices.go	Adds UniqueBy generic function for deduplication by key; correctly handles nil and empty inputs
internal/common/slices/slices_test.go	Comprehensive tests for UniqueBy covering nil, empty, consecutive/non-consecutive duplicates, and custom key function variants
internal/scheduler/jobdb/job.go	Adds Leased() method; correctly returns true only for non-queued, non-terminal jobs with an active run
internal/scheduler/jobdb/job_test.go	Tests for Leased() covering all state transitions: queued, terminal variants (succeeded/failed/cancelled), and no-run scenarios
internal/scheduler/jobdb/jobdb.go	Adds leasedJobs and terminalJobs immutable sets with correct lifecycle management in Upsert/delete; new GetAllLeasedJobs, GetAllTerminalJobs, and GetQueuedJobsByPool accessors look correct
internal/scheduler/jobdb/jobdb_test.go	Comprehensive tests for new job sets including lifecycle transitions and deletion, plus GetQueuedJobsByPool across multiple pool configurations
internal/scheduler/scheduling/scheduling_algo.go	Replaces txn.GetAll() with targeted fetches (leased + terminal + queued-by-pool) and UniqueBy dedup; optimization is logically sound, demand/allocation calculations remain correct

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[newFairSchedulingAlgoContext] --> B[Build allPools\ncurrentPool + AwayPools + awayAllocationPools]
    B --> C[GetAllLeasedJobs\nAll pools - needed for node population]
    B --> D[GetAllTerminalJobs\nAll pools - filtered to currentPool downstream]
    B --> E[For each pool in allPools:\nGetQueuedJobsByPool]
    C --> F[Combine into allJobs]
    D --> F
    E --> F
    F --> G[UniqueBy job.Id\nDeduplicate]
    G --> H[calculateJobSchedulingInfo]
    H --> I{job.InTerminalState?}
    I -- Yes, apply shortJobPenalty\nif jobPool==currentPool --> J[shortJobPenaltyByQueue]
    I -- No --> K{job.Queued?}
    K -- No, has run --> L[jobsByPool / allocation]
    K -- Yes --> M[demand calculation\nif pool in job.Pools]

_{Greploops — Automatically fix all review issues by running /greploops in Claude Code. It iterates: fix, push, re-review, repeat until 5/5 confidence.
Use the Greptile plugin for Claude Code to query reviews, search comments, and manage custom context directly from your terminal.}

_{Reviews (9): Last reviewed commit: "Merge branch 'master' into improve_fsctx..." | Re-trigger Greptile}

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

JamesMurkin marked this pull request as ready for review April 1, 2026 12:07

greptile-apps Bot reviewed Apr 1, 2026

View reviewed changes

Comment thread internal/scheduler/jobdb/jobdb.go

Comment thread internal/scheduler/jobdb/jobdb.go

JamesMurkin added 9 commits April 1, 2026 15:44

Testing

1faafd0

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

Better default

0d37b87

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

Track terminal jobs

d0edb64

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

Comment + better unique checks

5310f7d

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

Merge branch 'master' into improve_fsctx_creation_performance

b8aa3e9

Simplify

2fb76ec

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

Better comment

ac00b00

Signed-off-by: JamesMurkin <jamesmurkin@hotmail.com>

Merge branch 'master' into improve_fsctx_creation_performance

4f33dbf

Merge branch 'master' into improve_fsctx_creation_performance

24eac79

masipauskas approved these changes Apr 7, 2026

View reviewed changes

JamesMurkin merged commit 4c18eac into master Apr 7, 2026
18 checks passed

JamesMurkin deleted the improve_fsctx_creation_performance branch April 7, 2026 11:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve the performance of creating FairSchedulingAlgoContext#4807

Improve the performance of creating FairSchedulingAlgoContext#4807
JamesMurkin merged 10 commits intomasterfrom
improve_fsctx_creation_performance

JamesMurkin commented Apr 1, 2026

Uh oh!

greptile-apps Bot commented Apr 1, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JamesMurkin commented Apr 1, 2026

Uh oh!

greptile-apps Bot commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps Bot commented Apr 1, 2026 •

edited

Loading