feat(scheduler): optimize worker scheduling to O(1) using Valkey/Redi… by MushuEE · Pull Request #13 · agent-substrate/substrate

Grant McCloskey (MushuEE) · 2026-05-20T18:55:30Z

Description

Addresses #12

This PR optimizes Substrate's critical scheduling path by replacing the legacy $O(N)$ worker range scan bottleneck with an $O(1)$ constant-time idle worker Set queue backed by Valkey/Redis.

Architectural Rationale & Changes

Previously, when an actor was resumed, the scheduler step (AssignWorkerStep.Execute in workflow_resume.go) would fetch all registered workers in the pool via store.ListWorkers(ctx). For a pool of 10,000 workers, this required serial keyspace scans over all master shards, loading and unmarshaling 10,000 JSON blocks on every single wakeup request.

This PR completely eliminates that bottleneck by shifting worker state queue management into the database layer:

1. Interface Enhancements (`store.Interface`)

Added ClaimIdleWorker(ctx, namespace, pool, actorID, actorNamespace, actorTemplate) to the database contract.

2. Set-Based Idle Pool Management (`ateredis.go`)

We leverage Valkey/Redis's high-performance in-memory Set indexing to track available capacity:

Atomic Selection ($O(1)$): The new ClaimIdleWorker executes a single, atomic SPOP operation on the pool:<namespace>:<pool>:idle_workers Set to claim a random free worker in constant time.
Zero Scheduling Collisions: Because SPOP is an atomic server-side operation, concurrent scheduler instances are guaranteed to pop unique worker IDs. This completely eliminates optimistic lock collisions and database retries on concurrent wakeups.
Automated Lifecycle Hooks:
- Newly registered workers are added to the idle set during CreateWorker.
- Deleted worker pods are cleanly purged from the set during DeleteWorker.
- Suspended workers are returned to the set when their ActorId transitions to empty in UpdateWorker.

3. Handling Redis Cluster Slot Restrictions (`CROSSSLOT` fix)

During integration testing, multi-key transaction pipelines (trying to write a worker record and mutate the idle set in a single transaction block) failed with CROSSSLOT errors because the keys hashed to different clustered slots.

We solved this by splitting the operations sequentially outside the transactions:

The worker metadata record remains safely protected by optimistic locking version checks (WATCH transactions).
The indexing mutations (SAdd/SRem on the idle Set) are executed as separate, independent commands immediately upon transaction success. This ensures 100% compatibility with clustered production Valkey.

Verification & Tests Completed

Compilation: Fully verified with go vet (0 errors/warnings).

Unit & Integration Tests: Re-ran the entire store package test suite; 100% of tests passed flawlessly:

go test -v ./cmd/servers/ateapi/store/...
PASS
ok      github.com/agent-substrate/substrate/cmd/servers/ateapi/store/ateredis  0.168s

…s Sets

Grant McCloskey (MushuEE) · 2026-05-20T22:10:06Z

Fixed format issue using make fmt

Taahir Ahmed (ahmedtd) · 2026-05-21T20:37:41Z

-	for _, worker := range workers {
-		if worker.GetActorId() == input.ActorID && worker.GetWorkerPool() == state.ActorTemplate.Spec.WorkerPoolRef.Name && worker.GetWorkerNamespace() == state.ActorTemplate.Spec.WorkerPoolRef.Namespace {
+	// Re-use previously assigned worker if available.
+	if state.Actor.GetAteomPodName() != "" {


This will never be non-empty during resume (except maybe on retry of a failed resume?)

Taahir Ahmed (ahmedtd) · 2026-05-21T20:42:19Z


+	// Run SAdd sequentially outside the transaction to avoid cluster slot restrictions.
+	if shouldAddToIdle {
+		setKey := fmt.Sprintf("pool:%s:%s:idle_workers", worker.GetWorkerNamespace(), worker.GetWorkerPool())


One of our design goals is to let all ActorTemplates share the same WorkerPool to maximize efficiency. In that case, this will mean that we are sending all restore traffic through a single hash bucket (and thus, one single valkey node).

We are going to need to solve this problem, and I'm having difficulty seeing how it could be solved within redis.

However, this is heaps better than the current strategy of "read every worker all the time".

100% I have a much more advanced change that handles locality but because it is much more ossifying I felt this was a decent intermediate step while the datamodel and DB choices were finalized.

Taahir Ahmed (ahmedtd) · 2026-05-21T20:44:07Z

 	}

+	// Run SAdd sequentially outside the transaction to avoid cluster slot restrictions.
+	if shouldAddToIdle {


I think this needs to be a three phase process.

Mark the worker as free, but not yet returned to the free set.

Return to the free set

Make the worker as fully free.

Otherwise, a crash between steps 1 and 2 will permanently leak the worker from consideration.

With the three-phase approach, we can have an additional background thread sweeping workers that are "free, but not yet returned to the free set" back to the free set.

I think we do have a re-sync process? Will need to double check. But I will make this change. Thanks Taahir.

feat(scheduler): optimize worker scheduling to O(1) using Valkey/Redi…

a570c3c

…s Sets

Grant McCloskey (MushuEE) force-pushed the feature/redis-scheduler-opt branch from 3a51efa to a570c3c Compare May 20, 2026 22:09

Benjamin Elder (BenTheElder) requested a review from Julian Gutierrez Oschmann (juli4n) May 20, 2026 22:17

Benjamin Elder (BenTheElder) added the feature An enhancement / feature request or implementation label May 21, 2026

Benjamin Elder (BenTheElder) assigned Julian Gutierrez Oschmann (juli4n) May 21, 2026

Taahir Ahmed (ahmedtd) reviewed May 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(scheduler): optimize worker scheduling to O(1) using Valkey/Redi…#13

feat(scheduler): optimize worker scheduling to O(1) using Valkey/Redi…#13
Grant McCloskey (MushuEE) wants to merge 1 commit into
agent-substrate:mainfrom
MushuEE:feature/redis-scheduler-opt

Grant McCloskey (MushuEE) commented May 20, 2026

Uh oh!

Grant McCloskey (MushuEE) commented May 20, 2026

Uh oh!

Taahir Ahmed (ahmedtd) May 21, 2026

Uh oh!

Taahir Ahmed (ahmedtd) May 21, 2026

Uh oh!

Grant McCloskey (MushuEE) May 22, 2026

Uh oh!

Taahir Ahmed (ahmedtd) May 21, 2026

Uh oh!

Grant McCloskey (MushuEE) May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Grant McCloskey (MushuEE) commented May 20, 2026

Description

Architectural Rationale & Changes

1. Interface Enhancements (store.Interface)

2. Set-Based Idle Pool Management (ateredis.go)

3. Handling Redis Cluster Slot Restrictions (CROSSSLOT fix)

Verification & Tests Completed

Uh oh!

Grant McCloskey (MushuEE) commented May 20, 2026

Uh oh!

Taahir Ahmed (ahmedtd) May 21, 2026

Choose a reason for hiding this comment

Uh oh!

Taahir Ahmed (ahmedtd) May 21, 2026

Choose a reason for hiding this comment

Uh oh!

Grant McCloskey (MushuEE) May 22, 2026

Choose a reason for hiding this comment

Uh oh!

Taahir Ahmed (ahmedtd) May 21, 2026

Choose a reason for hiding this comment

Uh oh!

Grant McCloskey (MushuEE) May 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

1. Interface Enhancements (`store.Interface`)

2. Set-Based Idle Pool Management (`ateredis.go`)

3. Handling Redis Cluster Slot Restrictions (`CROSSSLOT` fix)