Skip to content

feat: scheduled (cron) triggers for agents — closes #186#196

Open
ishaan-berri wants to merge 13 commits into
mainfrom
litellm_add-cron-triggers
Open

feat: scheduled (cron) triggers for agents — closes #186#196
ishaan-berri wants to merge 13 commits into
mainfrom
litellm_add-cron-triggers

Conversation

@ishaan-berri
Copy link
Copy Markdown
Contributor

@ishaan-berri ishaan-berri commented May 19, 2026

Why

Closes #186. Today every LAP session is started by an external trigger (human chat, Slack/Linear webhook, POST /agents/{id}/session). There's no native way to say "run this agent every day at 9am PT on its own." This blocks daily stargazer outreach, weekly digests, hourly health checks, periodic data syncs — every one of them today has to be poked by an external GitHub Action / Zapier / k8s CronJob.

What

Native scheduled triggers. Add a 5-field cron + IANA timezone to Agent; the worker fires a Session at each scheduled instant.

Schema

  • Agent.cron_schedule (nullable 5-field cron)
  • Agent.cron_timezone (IANA, default UTC)
  • Agent.cron_enabled (toggle, default true)
  • Agent.cron_overlap_policy ("skip" only in v1; reserved for "queue" / "parallel")
  • Agent.cron_last_fired_at, Agent.cron_next_fire_at (server-managed)
  • Index (cron_enabled, cron_next_fire_at) so the scheduler is a single index seek
  • Session.trigger ("api" default, "cron" for scheduled runs) so the UI can badge them

Scheduler

src/server/cron.tstickCron() runs alongside the existing reconcile + warm-pool ticks in the worker. Each tick:

  1. Inside a transaction, claim due agents via raw SELECT … FOR UPDATE SKIP LOCKED
  2. Advance each agent's cron_next_fire_at inside the same tx (so a sibling pod won't re-claim on its next tick)
  3. Commit the tx — release locks fast
  4. Outside the tx, fire each agent's bring-up (warm claim → fall back to cold), tagged trigger="cron"

Bring-up is the same runBringUp the HTTP route already uses — extracted to src/server/session-bringup.ts so both callers share one implementation.

Answering "how does this work with multiple pods?" (the comment on the issue)

This was the open question. State lives in the DB (cron_next_fire_at column), not in any in-memory APScheduler instance. Concurrency is enforced by Postgres row locks:

  • Two pods wake up at the same instant → both run SELECT … FOR UPDATE SKIP LOCKED
  • Pod A grabs Agent X's lock; Pod B sees Agent X is locked and skips it (that's what SKIP LOCKED does — return rows that aren't currently locked)
  • Pod A advances cron_next_fire_at and commits → Pod B's next tick won't see Agent X because the next-fire instant is now in the future

No leader election, no Redis, no Zookeeper. Just Postgres. Verified locally with a race test (Promise.all([tickCron(), tickCron()]) against a due row): exactly one pod fires, the other returns fired: 0.

API

PATCH /api/v1/managed_agents/agents/{id} accepts cron_schedule, cron_timezone, cron_enabled, cron_overlap_policy.

  • Empty string cron_schedule clears the schedule (sets DB to NULL)
  • Invalid cron string → 400 with the parser's error message
  • Invalid timezone → 400 with a hint about IANA names
  • Schedule or timezone change recomputes cron_next_fire_at server-side so the new cadence takes effect immediately

UI

Inline editor on the agent settings page — Schedule row next to Env vars.

Collapsed (read-only):

Schedule    Weekdays at 9am (America/Los_Angeles)  [enabled]   ✏ Edit    last: 5/18/2026, 9:24:50 AM
                                                                          next: 5/19/2026, 9:24:50 AM

Expanded (editing):

Schedule  ┌──────────────────────────────────────────────────────────┐
          │ Cron expression                                           │
          │ [ 0 9 * * 1-5                                          ]  │
          │ Weekdays at 9am                                           │
          │                                                           │
          │ [Every 5 minutes] [Hourly] [Daily at 9am]                 │
          │ [Weekdays at 9am] [Weekly (Monday 9am)]                   │
          │                                                           │
          │ Timezone                                                  │
          │ [ America/Los_Angeles                                  ▾] │
          │                                                           │
          │ [✓] Enabled                                               │
          │                                                           │
          │ [ Save ]   ✕ Cancel                                       │
          └──────────────────────────────────────────────────────────┘

Sessions list badges cron-driven runs with a cron pill alongside the status.

(Screenshots saved locally at /tmp/cron-collapsed.png and /tmp/cron-expanded.png — happy to drag-and-drop them onto the PR after merge or via a follow-up comment.)

Test plan

  • Unit tests for parse + next-fire (node --import tsx --test scripts/cron-parse.test.mjs) — 9 assertions, all pass
  • npx tsc --noEmit clean
  • npx eslint clean on changed files
  • npx next build succeeds (no breaking changes to existing routes)
  • Migration applies cleanly against fresh Postgres
  • PATCH /agents/{id} happy path: schedule set → cron_next_fire_at populated
  • PATCH invalid cron string → 400
  • PATCH invalid timezone → 400
  • PATCH empty cron_schedule → both fields nulled
  • tickCron with a due agent → fires + Session row created with trigger="cron"
  • Two concurrent tickCron calls → exactly one fires (FOR UPDATE SKIP LOCKED verified)
  • Overlap policy "skip" → second fire skipped while previous still active
  • End-to-end test against a live k8s sandbox (deferred to staging)

Out of scope for v1

  • Per-run retry policy on bring-up failure
  • Backfill on cron_enabled = false → true resume
  • Manual "run now" button (separate issue)
  • cron_overlap_policy = "queue" | "parallel" (column exists, only "skip" accepted)

Adds nullable cron_schedule (standard 5-field), cron_timezone (IANA),
cron_enabled, cron_overlap_policy, plus server-managed cron_last_fired_at
and cron_next_fire_at. Indexed on (cron_enabled, cron_next_fire_at) so
the scheduler hot path is a single index seek.

Session gains a trigger column ("api" default, "cron" for scheduled
runs) so the UI can badge cron-driven sessions.
Used by src/server/cron.ts to compute next-fire instants. Standard
library — handles 5-field crons, IANA timezones, and DST correctly.
The session POST route inlined ~250 lines of warm/cold bring-up
orchestration. Moves runBringUp + helpers into a standalone module so
non-HTTP callers (the worker's cron tick, future integrations) can
reuse exactly the same dance. Nothing inside reads request-scoped
state — only prisma, k8s, harness primitives.
Drops ~370 lines of inline bring-up logic. Route is now a thin shell:
auth + body parse + warm claim + Session row create + delegate to
runBringUp. Behavior unchanged.
Implements parseCronSpec, computeNextFireAt, and tickCron.

Multi-pod safety: tickCron claims due agents via raw
"SELECT … FOR UPDATE SKIP LOCKED" inside a transaction, advances each
agent's cron_next_fire_at within the same tx, then fires bring-up
*outside* the transaction so the lock window stays small. Two pods
racing the same tick: one wins each row, the loser sees it skipped.

Overlap policy: when a previous cron-tagged session is still in
{creating, ready}, the new fire is skipped. Reserved for "queue" and
"parallel" via Agent.cron_overlap_policy column.

Bounded at 50 due agents per tick so a stuck run can't swamp the
worker; anything beyond gets picked up next tick (default 30s).
Lives in its own module so cron.ts → cron-bringup.ts →
session-bringup.ts doesn't pull API-route code into the worker bundle.
Also keeps the synthetic-prompt format for scheduled runs in one
place ("[cron] scheduled run at <iso>").
One more line in the existing tick loop. When no agent has a schedule
the (cron_enabled, cron_next_fire_at) index makes the lookup
essentially free, so the cost on a no-schedule deploy is ~one index
seek per worker tick. Counters surface in the heartbeat log so
operators can see cron_considered / cron_fired / cron_skipped_overlap
without a separate dashboard.
UpdateAgentBody accepts cron_schedule, cron_timezone, cron_enabled,
cron_overlap_policy. ApiAgent / ApiSession surface the server-managed
cron timestamps and the trigger discriminator. Read-only fields are
omitted from the request body so clients can't lie about
cron_next_fire_at.
When cron_schedule or cron_timezone changes (validated together so the
400 names the bad pair, not the field evaluation order), recompute
cron_next_fire_at server-side. Empty string clears the schedule; the
scheduler's WHERE cron_schedule IS NOT NULL then masks the row.

Flipping cron_enabled doesn't touch cron_next_fire_at — toggling off
and back on resumes from the existing cadence without losing it.
Client-side types now mirror the server shapes. cron_schedule accepts
null on the request side so the UI can clear via PATCH without
ambiguity. SessionRow.trigger is optional + nullable since older rows
predate the column.
Drop-in editor for the agent settings page's <dl> grid. Reads the
agent's cron config, lets the user edit cron expression / timezone /
enabled, and submits via the shared updateAgent API. Renders the
common shapes ("Weekdays at 9am", "Every 5 minutes") as a live
preview; falls back to "Custom schedule" for anything novel.

Preset chips cover the cases the original issue called out — daily
stargazer outreach, weekly digests, hourly checks. Server is the
ultimate validator: client preview is cosmetic and never blocks save.
Adds a Schedule row to the Configuration block and a small "cron"
badge alongside the status pill on cron-triggered sessions in the
Sessions list so the user can tell at a glance which runs were
auto-started.
Runs with "node --import tsx --test scripts/cron-parse.test.mjs".
9 assertions covering: empty/null schedules → next=null; daily, every-N-
minute, weekday-only cadence advances correctly; LA timezone produces
expected UTC offset; invalid cron + invalid tz throw with actionable
messages.
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 19, 2026

Greptile Summary

This PR adds native scheduled (cron) triggers for agents: new schema columns on Agent and Session, a Postgres-based multi-pod-safe scheduler (tickCron with SELECT … FOR UPDATE SKIP LOCKED), bring-up logic extracted to a shared module, and a UI inline editor. The architecture is sound \u2014 state lives in the DB, no leader election required \u2014 but there are two logic bugs in the Phase 1 / Phase 2 handoff inside tickCron.

  • cron_last_fired_at is committed inside the claim transaction before the overlap check runs outside it; when a run is skipped due to cron_overlap_policy = \"skip\", the column still shows now with no corresponding Session row, misleading the UI "last fired" display.
  • When computeNextFireAt throws for an agent with a corrupted schedule (the catch branch sets cron_enabled = false), the agent remains in the claimed list returned from the transaction and is still passed to fireCronRun in Phase 2 \u2014 spawning one unintended session despite the explicit "disabling" log message.

Confidence Score: 3/5

The core scheduler has two logic gaps in its Phase 1 / Phase 2 handoff that produce incorrect DB state and unintended session creation; safe to merge only after those are addressed.

Both bugs live in the hot path of the new cron tick. The first means every overlap-skipped firing writes a stale 'last fired' timestamp users can see in the UI. The second means an agent that the code explicitly logs as 'disabling' due to a bad schedule still gets a session spawned for it in the same tick.

src/server/cron.ts — specifically the Phase 1 transaction block (lines ~204-222) and the Phase 2 loop (lines ~238-268) need attention before merging.

Important Files Changed

Filename Overview
src/server/cron.ts New scheduler with two logic bugs: cron_last_fired_at is committed before the overlap check (skipped agents get a false timestamp), and agents disabled mid-tick for invalid schedules still fire once in Phase 2.
prisma/migrations/0006_agent_cron_triggers/migration.sql Additive-only migration (new columns with safe defaults + one index). No column deletions, no data loss risk.
src/server/session-bringup.ts Logic extracted from the session POST route into a shared module — functionally identical, just moved to enable reuse by the cron tick.
src/app/api/v1/managed_agents/agents/[agent_id]/route.ts PATCH handler extended with cron field validation; parseCronSpec called server-side for immediate 400 on bad cron/tz pair, and cron_next_fire_at recomputed on schedule/timezone change.
src/components/cron-editor.tsx New UI component for inline schedule editing; validation is server-side only with errors surfaced via onError prop. Clean implementation.
src/worker/index.ts Worker tick wired to call tickCron() alongside existing reconcile and warm-pool ticks; result logged in heartbeat.
src/server/cron-bringup.ts Thin shim that sets a cron-specific title and delegates to runBringUp; keeps cron.ts free of API-route imports.
prisma/schema.prisma Six new nullable/defaulted cron columns on Agent plus trigger on Session; composite index on (cron_enabled, cron_next_fire_at) matches the scheduler query exactly.
src/server/types.ts ApiAgent and ApiSession extended with cron fields; extensive cast-through-unknown pattern due to pre-generate Prisma client compatibility.

Reviews (1): Last reviewed commit: "test(cron): unit cover parser, next-fire..." | Re-trigger Greptile

Comment thread src/server/cron.ts
Comment on lines +204 to +255
for (const a of claimed) {
try {
const next = computeNextFireAt(a.cron_schedule, a.cron_timezone, now);
await tx.agent.update({
where: { agent_id: a.agent_id },
data: { cron_next_fire_at: next, cron_last_fired_at: now },
});
} catch (e) {
console.error(
`cron: disabling agent ${a.agent_id} — invalid schedule at fire time: ${
e instanceof Error ? e.message : String(e)
}`,
);
await tx.agent.update({
where: { agent_id: a.agent_id },
data: { cron_enabled: false, cron_next_fire_at: null },
});
}
}
return claimed;
});
} catch (e) {
// Transaction-level failure — log and bail. Next tick will retry.
console.error(
`cron: claim transaction failed: ${e instanceof Error ? e.message : String(e)}`,
);
result.errors += 1;
return result;
}

result.considered = dueAgents.length;

// Phase 2: fire each agent outside the transaction. Failures are
// per-agent — one bad agent must not block the rest of the tick.
for (const due of dueAgents) {
try {
const agent = await prisma.agent.findUnique({
where: { agent_id: due.agent_id },
});
if (agent === null) continue;

// Overlap policy. "skip" is the only supported value in v1; the
// column is keyed for future "queue" / "parallel" semantics.
const policy =
(agent as unknown as { cron_overlap_policy?: string })
.cron_overlap_policy ?? "skip";
if (policy === "skip" && (await hasActiveCronRun(agent.agent_id))) {
result.skipped_overlap += 1;
console.log(
`cron: skipping agent_id=${agent.agent_id} — previous cron run still active`,
);
continue;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 cron_last_fired_at set before overlap check, so skipped runs show a false "last fired" timestamp

cron_last_fired_at = now is committed inside the claim transaction (line 209) for every claimed agent. The overlap check (hasActiveCronRun) runs after the transaction commits (line 250). When the check fires and a previous run is still active, the agent is skipped via continue — no session is created — but cron_last_fired_at already reflects now in the database. The UI then shows "last: X minutes ago" for a run that never happened, with no matching Session row users can inspect.

Comment thread src/server/cron.ts
Comment on lines +204 to +258
for (const a of claimed) {
try {
const next = computeNextFireAt(a.cron_schedule, a.cron_timezone, now);
await tx.agent.update({
where: { agent_id: a.agent_id },
data: { cron_next_fire_at: next, cron_last_fired_at: now },
});
} catch (e) {
console.error(
`cron: disabling agent ${a.agent_id} — invalid schedule at fire time: ${
e instanceof Error ? e.message : String(e)
}`,
);
await tx.agent.update({
where: { agent_id: a.agent_id },
data: { cron_enabled: false, cron_next_fire_at: null },
});
}
}
return claimed;
});
} catch (e) {
// Transaction-level failure — log and bail. Next tick will retry.
console.error(
`cron: claim transaction failed: ${e instanceof Error ? e.message : String(e)}`,
);
result.errors += 1;
return result;
}

result.considered = dueAgents.length;

// Phase 2: fire each agent outside the transaction. Failures are
// per-agent — one bad agent must not block the rest of the tick.
for (const due of dueAgents) {
try {
const agent = await prisma.agent.findUnique({
where: { agent_id: due.agent_id },
});
if (agent === null) continue;

// Overlap policy. "skip" is the only supported value in v1; the
// column is keyed for future "queue" / "parallel" semantics.
const policy =
(agent as unknown as { cron_overlap_policy?: string })
.cron_overlap_policy ?? "skip";
if (policy === "skip" && (await hasActiveCronRun(agent.agent_id))) {
result.skipped_overlap += 1;
console.log(
`cron: skipping agent_id=${agent.agent_id} — previous cron run still active`,
);
continue;
}

await fireCronRun(agent as AgentRow, now);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Agents disabled for invalid schedule are still fired in Phase 2

When computeNextFireAt throws for an agent (invalid schedule mutated between last fire and this tick), the catch block sets cron_enabled = false and cron_next_fire_at = null inside the transaction — correctly preventing future fires. However, the agent is still present in the claimed array returned from the transaction, and Phase 2 iterates over dueAgents without checking cron_enabled. The result is that the "now disabled" agent still gets fireCronRun called, spawning one unintended session. The error log says "disabling" but the code fires anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add scheduled / cron triggers for agents

1 participant