Skip to content

API: Add AgentType CRD for user-defined agent types without requiring upstream code changesΒ #927

@kelos-bot

Description

@kelos-bot

πŸ€– Kelos Strategist Agent @gjkim42

Area: New CRDs & API Extensions

Summary

Kelos's agent type system is hardcoded across 4+ source files β€” every new agent requires changes to the CRD enum, job builder, credential mapper, and output parser. This has already happened 5 times (claude-code β†’ codex β†’ gemini β†’ opencode β†’ cursor), each following the same mechanical pattern. Meanwhile, the AI coding agent landscape is rapidly expanding (Aider, SWE-agent, Goose, Amazon Q Developer, Continue, Windsurf, and many internal/proprietary agents). This proposal introduces an AgentType CRD that lets users register custom agent types declaratively, making Kelos truly agent-agnostic without requiring upstream releases for every new agent.

Problem

1. Adding a new agent type requires changes across 4+ files

Each new agent type touches the same set of files:

CRD enum validation (api/v1alpha1/task_types.go:89):

// +kubebuilder:validation:Enum=claude-code;codex;gemini;opencode;cursor
Type string `json:"type"`

Job builder switch (internal/controller/job_builder.go:111-125):

func (b *JobBuilder) Build(...) (*batchv1.Job, error) {
    switch task.Spec.Type {
    case AgentTypeClaudeCode:
        return b.buildAgentJob(task, workspace, agentConfig, b.ClaudeCodeImage, ...)
    case AgentTypeCodex:
        return b.buildAgentJob(task, workspace, agentConfig, b.CodexImage, ...)
    // ... one case per agent type
    default:
        return nil, fmt.Errorf("unsupported agent type: %s", task.Spec.Type)
    }
}

Credential env var mapping (internal/controller/job_builder.go:129-169):

func apiKeyEnvVar(agentType string) string {
    switch agentType {
    case AgentTypeCodex:
        return "CODEX_API_KEY"
    case AgentTypeGemini:
        return "GEMINI_API_KEY"
    // ... one case per agent type
    default:
        return "ANTHROPIC_API_KEY"
    }
}

Output usage parser (internal/capture/usage.go:33-46):

func ParseUsage(agentType, filePath string) map[string]string {
    switch agentType {
    case "claude-code":
        return parseClaudeCode(lines)
    case "codex":
        return parseCodex(lines)
    // ... one case per agent type
    default:
        return nil  // Unknown types get NO token/cost tracking
    }
}

Plus: image constants, image flags in the controller binary, JobBuilder struct fields, and a new Dockerfile + entrypoint script per agent.

2. The existing escape hatch has real limitations

Kelos already supports custom images via spec.image override and credentials.type: none for BYO credentials. But this workaround has three concrete problems:

a) Must declare one of 5 types even for custom agents:
Users running Aider or an internal agent must pick claude-code or another built-in type, which is semantically wrong:

spec:
  type: claude-code  # Actually running Aider β€” misleading
  image: ghcr.io/myorg/kelos-aider:latest
  credentials:
    type: none

b) KELOS_AGENT_TYPE is set to the wrong value:
The job builder injects KELOS_AGENT_TYPE=claude-code into the container environment. This breaks kelos-capture β€” it tries to parse the output as Claude Code's JSON format, which won't match Aider's output. Result: no token usage or cost tracking for custom agents, even if the agent provides this data.

c) No way to register a default image globally:
With built-in types, the controller provides a default image (set via flags). Custom agents must set spec.image on every Task or TaskTemplate β€” there's no way to say "when type is aider, always use ghcr.io/myorg/kelos-aider:latest."

3. The growth trajectory demands extensibility

The AI coding agent space is expanding rapidly. Agents that exist today or are emerging:

  • Aider β€” Open-source, supports any LLM backend
  • SWE-agent β€” Research-grade from Princeton
  • Amazon Q Developer β€” AWS-native
  • Goose β€” Block's open-source agent
  • Continue β€” Open-source IDE agent with CLI mode
  • Windsurf β€” Codeium's agent
  • Bolt β€” StackBlitz's agent
  • Internal/proprietary agents β€” Many enterprises build their own

Requiring a Kelos upstream release for each new agent creates a bottleneck. The agent image interface (docs/agent-image-interface.md) is already well-defined β€” any image implementing it can work with Kelos. The type system is the only thing preventing truly pluggable agents.

Proposed Solution: AgentType CRD

New CRD: AgentType

apiVersion: kelos.dev/v1alpha1
kind: AgentType
metadata:
  name: aider
spec:
  # Default container image for this agent type.
  # Can still be overridden per-Task via spec.image.
  image: ghcr.io/myorg/kelos-aider:v0.82.0

  # Credential environment variable mappings.
  # Maps credential type β†’ env var name used to inject the secret value.
  credentialEnvVars:
    api-key: OPENAI_API_KEY
    oauth: OPENAI_AUTH_TOKEN

  # Output format for kelos-capture token/cost extraction.
  # "generic" uses a configurable JSON path-based parser.
  # Omit for agents that emit KELOS_OUTPUTS markers directly.
  outputFormat:
    type: jsonl              # jsonl (one JSON object per line) or none
    eventType: "result"      # JSON objects with this "type" field value
    tokenPaths:
      inputTokens: "usage.input_tokens"   # JSONPath within the event
      outputTokens: "usage.output_tokens"
      costUSD: "total_cost_usd"           # Optional

Usage in Task/TaskSpawner

apiVersion: kelos.dev/v1alpha1
kind: Task
metadata:
  name: fix-bug-with-aider
spec:
  type: aider                    # Resolved against AgentType CRD
  prompt: "Fix the failing test in pkg/auth/handler_test.go"
  credentials:
    type: api-key
    secretRef:
      name: openai-key           # Secret key name: OPENAI_API_KEY (from AgentType)
  workspaceRef:
    name: my-workspace

Complete example: Internal agent with custom output format

# Register the custom agent type
apiVersion: kelos.dev/v1alpha1
kind: AgentType
metadata:
  name: internal-coder
spec:
  image: registry.internal.co/ai/coder-agent:2.0.0
  credentialEnvVars:
    api-key: INTERNAL_API_KEY
  outputFormat:
    type: jsonl
    eventType: "completion"
    tokenPaths:
      inputTokens: "metrics.prompt_tokens"
      outputTokens: "metrics.completion_tokens"
      costUSD: "metrics.cost"
---
# TaskSpawner using the custom type
apiVersion: kelos.dev/v1alpha1
kind: TaskSpawner
metadata:
  name: internal-bug-fixer
spec:
  when:
    githubIssues:
      labels: [bug, ai-eligible]
  taskTemplate:
    type: internal-coder         # References AgentType CRD
    credentials:
      type: api-key
      secretRef:
        name: internal-api-key
    workspaceRef:
      name: my-workspace
    branch: "kelos-{{.Number}}"
    promptTemplate: |
      Fix issue #{{.Number}}: {{.Title}}
      {{.Body}}
    ttlSecondsAfterFinished: 3600

Implementation Path

Phase 1: Relax type validation (minimal, backward-compatible)

  1. Remove the kubebuilder enum from TaskSpec.Type and TaskTemplate.Type. Replace with a webhook validation that accepts built-in types + names matching existing AgentType resources.
  2. Fall through gracefully in job_builder.go:Build(): if type is not built-in, require spec.image to be set (return a clear error if missing), and set KELOS_AGENT_TYPE to the actual type string.
  3. Fall through gracefully in capture/usage.go:ParseUsage(): for unknown types, attempt a generic JSONL parser or return nil (same as today, but with the correct type name logged).

This alone unblocks custom agents with correct semantics: users declare their actual type name, set their image, and use credentials.type: none for credentials.

Phase 2: AgentType CRD (full solution)

  1. Add the AgentType CRD with image, credentialEnvVars, and outputFormat fields.
  2. Task controller resolves AgentType before building the Job: fetch the AgentType resource, use its image as default, map credential types to env var names, and configure the capture parser.
  3. Webhook validation checks that spec.type is either a built-in type or matches an existing AgentType resource name.
  4. Add AgentType support to kelos create agenttype CLI command.

Phase 3: Output format extensibility

  1. Generic JSONL parser in kelos-capture: configurable via KELOS_OUTPUT_FORMAT env var (set from AgentType.spec.outputFormat).
  2. Built-in parsers remain for the 5 first-class types (no regression).
  3. Custom parsers can be added by mounting a parser script in the agent image.

Backward Compatibility

  • Existing tasks: All 5 built-in types continue to work exactly as today. They use hardcoded images, credential mappings, and output parsers. No migration needed.
  • Existing TaskSpawners: No changes required. The type field continues to accept all current values.
  • CRD upgrade: Phase 1 (removing the enum) is a CRD schema relaxation, not a breaking change. Kubernetes allows widening validation.
  • AgentType is additive: It's a new CRD that doesn't modify existing resources.

Why this matters for adoption

  1. Enterprise teams with internal agents can adopt Kelos without forking it
  2. Open-source agent builders can provide Kelos-compatible images + AgentType manifests
  3. Reduces maintenance burden β€” the Kelos team doesn't need to add and maintain agent-specific code for every new agent
  4. Aligns with the agent image interface β€” the interface is already well-defined and agent-agnostic; the type system should match

Related

  • docs/agent-image-interface.md β€” Already defines the contract custom images must implement
  • internal/capture/usage.go β€” Per-agent output parsers that would benefit from extensibility
  • internal/controller/job_builder.go:173 comment β€” "new providers (e.g. Vertex) only need to add a case here" confirms the team expects more agent types

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions