The admin UI needs a backend-agnostic control plane. Direct Redis access is a dead-end once multiple durable backends exist. We will expose a gRPC AdminService implemented by the worker, and a HTTP/JSON gateway for the admin UI. The gateway uses mutual TLS (mTLS) for transport authentication.
- Provide a stable admin contract independent of the durable backend implementation.
- Allow the Next.js admin UI to read operational data and trigger safe actions.
- Support mTLS authentication between UI and the gateway.
- Keep AdminService deployable alongside the worker in the same process.
- Public multi-tenant auth/authorization (beyond mTLS).
- Full RBAC and audit logging (future work).
- WebSocket/bi-directional streaming beyond SSE (future work).
- Overview data - Return queue counts, processing counts, and DLQ size. - Return coordination state (global rate limit, leader lock, pause state).
- Queue data - Return per-queue ready/processing/dead counts and weights. - Return queue detail by name (single-queue view). - Allow updating queue weight and resetting to default. - Allow pausing/resuming a queue (durable dequeue only).
- Schedules - List cron schedules with spec and next/previous run. - Create/update schedules by name (requires a registered factory). - Pause/resume schedules without unregistering the factory. - Delete schedules by name. - Run schedule immediately on demand. - Pause/resume all schedules at once.
- Jobs (containerized runner) - Persist job definitions (repo + tag + optional path/dockerfile). - Support sources: Git tag, HTTPS tarball URL, or local tarball path. - Validate tarball SHA256 when provided and enforce allowlists. - Run jobs on demand (enqueue durable task). - Register job factories to allow scheduling via existing cron APIs. - Enforce tag-only Git fetch (no branches). - Truncate output to a configured max bytes for safety.
- Health - Report service health and build information (version, commit, Go version).
- DLQ - List DLQ entries (bounded, paginated by limit/offset, with optional filters). - Fetch DLQ entry detail by ID (metadata, error, payload size, timestamps). - Replay DLQ entries (bounded limit).
- Actions - Pause durable dequeue. - Resume durable dequeue.
- Events - Stream overview + schedule + job events to power the UI event log. - Persist job events across restarts when a store is configured.
- Backend abstraction - Admin API should not depend on Redis directly; use backend interface.
- Security: mTLS required for gateway; server certificate validated by client.
- Reliability: idempotent actions and bounded timeouts.
- Performance: use batched Redis commands (if Redis backend is used).
- Compatibility: API versioned as
v1.
Service: worker.v1.AdminService
GetOverview(GetOverviewRequest) returns (GetOverviewResponse)GetHealth(GetHealthRequest) returns (GetHealthResponse)ListQueues(ListQueuesRequest) returns (ListQueuesResponse)GetQueue(GetQueueRequest) returns (GetQueueResponse)PauseQueue(PauseQueueRequest) returns (PauseQueueResponse)ListScheduleFactories(ListScheduleFactoriesRequest) returns (ListScheduleFactoriesResponse)ListSchedules(ListSchedulesRequest) returns (ListSchedulesResponse)CreateSchedule(CreateScheduleRequest) returns (CreateScheduleResponse)DeleteSchedule(DeleteScheduleRequest) returns (DeleteScheduleResponse)PauseSchedule(PauseScheduleRequest) returns (PauseScheduleResponse)RunSchedule(RunScheduleRequest) returns (RunScheduleResponse)PauseSchedules(PauseSchedulesRequest) returns (PauseSchedulesResponse)ListJobs(ListJobsRequest) returns (ListJobsResponse)GetJob(GetJobRequest) returns (GetJobResponse)UpsertJob(UpsertJobRequest) returns (UpsertJobResponse)DeleteJob(DeleteJobRequest) returns (DeleteJobResponse)RunJob(RunJobRequest) returns (RunJobResponse)ListDLQ(ListDLQRequest) returns (ListDLQResponse)GetDLQEntry(GetDLQEntryRequest) returns (GetDLQEntryResponse)PauseDequeue(PauseDequeueRequest) returns (PauseDequeueResponse)ResumeDequeue(ResumeDequeueRequest) returns (ResumeDequeueResponse)ReplayDLQ(ReplayDLQRequest) returns (ReplayDLQResponse)ReplayDLQByID(ReplayDLQByIDRequest) returns (ReplayDLQByIDResponse)StreamEvents(StreamEventsRequest) returns (stream AdminEvent)(gateway uses SSE)
GET /admin/v1/healthGET /admin/v1/overviewGET /admin/v1/queuesGET /admin/v1/queues/{name}POST /admin/v1/queues/{name}/weight(body:{ "weight": 3 })DELETE /admin/v1/queues/{name}/weightPOST /admin/v1/queues/{name}/pause(body:{ "paused": true })GET /admin/v1/schedules/factoriesGET /admin/v1/schedulesGET /admin/v1/schedules/events?name=schedule_name&limit=25POST /admin/v1/schedules(body:{ "name": "...", "spec": "...", "durable": false })DELETE /admin/v1/schedules/{name}POST /admin/v1/schedules/{name}/pause(body:{ "paused": true })POST /admin/v1/schedules/{name}/runPOST /admin/v1/schedules/pause(body:{ "paused": true })GET /admin/v1/jobsGET /admin/v1/jobs/{name}POST /admin/v1/jobs(body:{ "name": "...", "description": "...", "source": "git_tag|tarball_url|tarball_path", "repo": "...", "tag": "v1.2.3", "tarballUrl": "...", "tarballPath": "...", "tarballSha256": "...", "path": "subdir", "dockerfile": "Dockerfile", "command": ["..."], "env": ["KEY=VALUE"], "queue": "default", "retries": 2, "timeoutSeconds": 600 })DELETE /admin/v1/jobs/{name}POST /admin/v1/jobs/{name}/runGET /admin/v1/jobs/events?name=job_name&limit=25GET /admin/v1/dlq?limit=100&offset=0&queue=default&handler=send_email&query=oopsGET /admin/v1/dlq/{id}GET /admin/v1/audit?limit=100&action=queue.pause&target=defaultGET /admin/v1/audit/export?format=jsonl|json|csv&limit=1000&action=...&target=...POST /admin/v1/pausePOST /admin/v1/resumePOST /admin/v1/dlq/replay(body:{ "limit": 100 })POST /admin/v1/dlq/replay/ids(body:{ "ids": ["..."] })GET /admin/v1/events(SSE)
- mTLS required on the gateway listener.
- Certificates provided via env:
-
WORKER_ADMIN_TLS_CERT-WORKER_ADMIN_TLS_KEY-WORKER_ADMIN_TLS_CA - Gateway does not accept plaintext connections.
WORKER_ADMIN_GRPC_ADDR(default127.0.0.1:50052)WORKER_ADMIN_GRPC_TARGET(optional; when set, the gateway dials this address instead of starting a local AdminService)WORKER_ADMIN_HTTP_ADDR(default127.0.0.1:8081)WORKER_ADMIN_TLS_CERT,WORKER_ADMIN_TLS_KEY,WORKER_ADMIN_TLS_CAWORKER_ADMIN_AUDIT_EVENT_LIMIT(max audit events retained by service/backend list operations)WORKER_ADMIN_AUDIT_RETENTION(optional max age, e.g.168h; events older than cutoff are pruned/filtered)WORKER_ADMIN_AUDIT_ARCHIVE_DIR(optional directory for archived audit JSONL files)WORKER_ADMIN_AUDIT_ARCHIVE_INTERVAL(optional flush interval for archival writer)WORKER_ADMIN_AUDIT_EXPORT_LIMIT_MAX(gateway cap for export querylimit)WORKER_ADMIN_DEFAULT_QUEUE,WORKER_ADMIN_QUEUE_WEIGHTS(backend defaults)
Worker service cron presets (when using the bundled worker-service):
WORKER_DURABLE_CRON_HANDLERS(comma-separated handler names to register)WORKER_DURABLE_CRON_PRESETS(comma-separatedname=specoverrides)WORKER_CRON_HANDLERS(comma-separated in-memory handlers)WORKER_CRON_PRESETS(comma-separatedname=specoverrides)
Worker service job runner (containerized):
WORKER_JOB_REPO_ALLOWLIST(comma-separated;*to allow all)WORKER_JOB_GIT_BIN(defaultgit)WORKER_JOB_DOCKER_BIN(defaultdocker)WORKER_JOB_NETWORK(optional docker network)WORKER_JOB_WORKDIR(optional temp workspace root)WORKER_JOB_OUTPUT_BYTES(max combined stdout+stderr)WORKER_JOB_TARBALL_ALLOWLIST(comma-separated hostnames for HTTPS tarballs)WORKER_JOB_TARBALL_DIR(root for local tarball paths)WORKER_JOB_TARBALL_MAX_BYTES(default 64MiB)WORKER_JOB_TARBALL_TIMEOUT(default 30s)WORKER_JOB_EVENT_DIR(file-backed job event store root; enables persistence)WORKER_JOB_EVENT_MAX_ENTRIES(per key; default 10000)WORKER_JOB_EVENT_CACHE_TTL(default 10s)
- Log errors and action outcomes at INFO/WARN levels.
- Add counters for admin actions (pause/resume/replay).
- Implemented
- Overview, health, queues (list/get/weight/pause), schedules (list/factories/create/delete/pause/run/pause-all), jobs (list/get/upsert/delete/run), DLQ (list/detail/replay/replay-by-id), pause/resume dequeue, SSE events.
- Containerized job runner supports
git_tag,tarball_url,tarball_path, allowlists, SHA256 validation, output truncation. - Job event persistence is available via file-backed store (WORKER_JOB_EVENT_DIR). - Admin observability collector is available with gateway middleware and gRPC unary interceptors; snapshot endpoint:GET /admin/v1/metricsand Prometheus endpoint:GET /admin/v1/metrics/prometheus(alsoGET /admin/v1/metrics?format=prometheus). - Guardrails are configurable via env for replay caps, schedule run caps, and optional approval token checks (WORKER_ADMIN_REPLAY_LIMIT_MAX,WORKER_ADMIN_REPLAY_IDS_MAX,WORKER_ADMIN_SCHEDULE_RUN_MAX,WORKER_ADMIN_SCHEDULE_RUN_WINDOW,WORKER_ADMIN_REQUIRE_APPROVAL,WORKER_ADMIN_APPROVAL_TOKEN). - Artifact API parity: gateway now exposes job artifact metadata and download endpoints (GET /admin/v1/jobs/{name}/artifact/meta,GET /admin/v1/jobs/{name}/artifact) so UI no longer needs direct filesystem access for tarball-path jobs. - Audit export is available atGET /admin/v1/audit/export(jsonl,json,csv) with action/target filtering and bounded limits. - Audit retention is configurable via count (WORKER_ADMIN_AUDIT_EVENT_LIMIT) and optional age cutoff (WORKER_ADMIN_AUDIT_RETENTION); archived JSONL files can be enabled withWORKER_ADMIN_AUDIT_ARCHIVE_DIR(+ optional flush interval), and gateway export limits are capped byWORKER_ADMIN_AUDIT_EXPORT_LIMIT_MAX. - Gateway artifact handling now runs behind an internal artifact-store abstraction (filesystem implementation), reducing direct coupling in handlers. - Partial / gaps
- Event stream now supports
Last-Event-IDwith bounded in-memory replay; replay does not survive gateway restarts. - Metrics are in-memory snapshots with JSON + Prometheus text export; OTel/remote aggregation is still pending for cluster-wide long-term retention. - Central audit log supports count+age retention, bounded export, and file-based scheduled archival; external archive shipping/lifecycle policies are still pending. - No RBAC/authorization model beyond mTLS + perimeter controls. - Artifact download fortarball_pathdepends on gateway runtime mounts/config (WORKER_ADMIN_JOB_TARBALL_DIR) when worker and gateway are split.
- Implemented
- Operational pages for Overview, Queues, Schedules, DLQ, Jobs, Settings, and Docs.
- Action flows for queue weight/pause, schedule management, DLQ replay, job creation/run/edit/delete.
- Run-level page for jobs and event panes for schedules/jobs.
- SSE + polling fallback for live updates.
- Overview now includes a unified operations timeline (jobs + schedules + audit/queue/DLQ actions) with filters, paging, and run deep-links.
- Timeline filter state persists locally between page visits.
- Section state persistence is implemented for Queues, Schedules, Jobs, and DLQ (DLQ via URL query params).
- Action/API failures now surface structured diagnostics in-page (mapped cause hints + request ID when available).
- Gateway-facing Next API routes now emit a normalized error envelope (
error+errorDetail) to keep diagnostics consistent across UI sections. - Partial / gaps - Overview analytics are snapshot-focused; no built-in trend charts across day/week/month windows for all domains. - Timeline is implemented, but correlation is still basic (no causal graph/grouping by request/task chain). - Some high-value admin workflows are still split across pages (no global command palette / cross-page quick actions). - Artifact handling is source-aware but still operationally coupled to environment mounts for local tarballs.
- Admin observability: implemented as in-memory service metrics + gateway JSON/Prometheus endpoints; pending OTel export and durable retention backend.
- Policy enforcement: implemented for DLQ replay caps, schedule run caps, and optional approval token checks in admin mutations.
- Artifact API parity: implemented in gateway with an artifact-store abstraction; remaining work is non-filesystem providers (S3/object store) and signed URL policies.
- Audit retention/export: implemented for count+time-window retention + export endpoints, plus file-based scheduled archival; remaining work is external archival lifecycle (e.g., object storage + retention policies).
- Unified run visibility: show queued/running/completed transitions consistently with per-run immutable detail pages.
- Operational timelines: implemented as a single filtered timeline; remaining work is richer correlation/grouping UX.
- Safety UX: richer confirmation modals for risky actions with impact preview (counts, targets, scope).
- Error diagnostics: structured error panels for gateway/API failures (request ID, mapped cause, recovery hint).
- State persistence: partially implemented (timeline); extend to all sections for operator continuity.
- S3/object-store event backend (in addition to file store) with retention lifecycle controls.
- OpenAPI/JSON schema export for gateway endpoints.
- WebSocket transport as optional alternative to SSE.
- Multi-tenant authn/authz integration points (OIDC headers, signed tokens, policy hooks).
- Trend cards and small charts for throughput/failure/replay over configurable windows.
- Bulk action workbench for queues/schedules/jobs with dry-run mode.
- Advanced docs integration: contextual “why this failed” links from runtime errors to docs sections.
- Exportable reports (CSV/JSON) for run history and DLQ operations.
- Add AdminService proto + gateway routes.
- Implement Redis-backed AdminBackend.
- Update admin UI to call gateway (remove Redis direct access).
- Provide Dockerfile + compose for worker + gateway + admin-ui.
- Gateway auth complexity: use mTLS with clear docs and example certs.
- Backend diversity: isolate in
AdminBackendinterface. - Breaking changes: version HTTP routes under
/admin/v1/.