API: Add AgentType CRD for user-defined agent types without requiring upstream code changes

🤖 **Kelos Strategist Agent** @gjkim42

## Area: New CRDs & API Extensions

## Summary

Kelos's agent type system is hardcoded across 4+ source files — every new agent requires changes to the CRD enum, job builder, credential mapper, and output parser. This has already happened 5 times (claude-code → codex → gemini → opencode → cursor), each following the same mechanical pattern. Meanwhile, the AI coding agent landscape is rapidly expanding (Aider, SWE-agent, Goose, Amazon Q Developer, Continue, Windsurf, and many internal/proprietary agents). This proposal introduces an `AgentType` CRD that lets users register custom agent types declaratively, making Kelos truly agent-agnostic without requiring upstream releases for every new agent.

## Problem

### 1. Adding a new agent type requires changes across 4+ files

Each new agent type touches the same set of files:

**CRD enum validation** (`api/v1alpha1/task_types.go:89`):
```go
// +kubebuilder:validation:Enum=claude-code;codex;gemini;opencode;cursor
Type string `json:"type"`
```

**Job builder switch** (`internal/controller/job_builder.go:111-125`):
```go
func (b *JobBuilder) Build(...) (*batchv1.Job, error) {
    switch task.Spec.Type {
    case AgentTypeClaudeCode:
        return b.buildAgentJob(task, workspace, agentConfig, b.ClaudeCodeImage, ...)
    case AgentTypeCodex:
        return b.buildAgentJob(task, workspace, agentConfig, b.CodexImage, ...)
    // ... one case per agent type
    default:
        return nil, fmt.Errorf("unsupported agent type: %s", task.Spec.Type)
    }
}
```

**Credential env var mapping** (`internal/controller/job_builder.go:129-169`):
```go
func apiKeyEnvVar(agentType string) string {
    switch agentType {
    case AgentTypeCodex:
        return "CODEX_API_KEY"
    case AgentTypeGemini:
        return "GEMINI_API_KEY"
    // ... one case per agent type
    default:
        return "ANTHROPIC_API_KEY"
    }
}
```

**Output usage parser** (`internal/capture/usage.go:33-46`):
```go
func ParseUsage(agentType, filePath string) map[string]string {
    switch agentType {
    case "claude-code":
        return parseClaudeCode(lines)
    case "codex":
        return parseCodex(lines)
    // ... one case per agent type
    default:
        return nil  // Unknown types get NO token/cost tracking
    }
}
```

Plus: image constants, image flags in the controller binary, JobBuilder struct fields, and a new Dockerfile + entrypoint script per agent.

### 2. The existing escape hatch has real limitations

Kelos already supports custom images via `spec.image` override and `credentials.type: none` for BYO credentials. But this workaround has three concrete problems:

**a) Must declare one of 5 types even for custom agents:**
Users running Aider or an internal agent must pick `claude-code` or another built-in type, which is semantically wrong:
```yaml
spec:
  type: claude-code  # Actually running Aider — misleading
  image: ghcr.io/myorg/kelos-aider:latest
  credentials:
    type: none
```

**b) `KELOS_AGENT_TYPE` is set to the wrong value:**
The job builder injects `KELOS_AGENT_TYPE=claude-code` into the container environment. This breaks `kelos-capture` — it tries to parse the output as Claude Code's JSON format, which won't match Aider's output. Result: **no token usage or cost tracking** for custom agents, even if the agent provides this data.

**c) No way to register a default image globally:**
With built-in types, the controller provides a default image (set via flags). Custom agents must set `spec.image` on every Task or TaskTemplate — there's no way to say "when type is `aider`, always use `ghcr.io/myorg/kelos-aider:latest`."

### 3. The growth trajectory demands extensibility

The AI coding agent space is expanding rapidly. Agents that exist today or are emerging:
- **Aider** — Open-source, supports any LLM backend
- **SWE-agent** — Research-grade from Princeton
- **Amazon Q Developer** — AWS-native
- **Goose** — Block's open-source agent
- **Continue** — Open-source IDE agent with CLI mode
- **Windsurf** — Codeium's agent
- **Bolt** — StackBlitz's agent
- **Internal/proprietary agents** — Many enterprises build their own

Requiring a Kelos upstream release for each new agent creates a bottleneck. The agent image interface (`docs/agent-image-interface.md`) is already well-defined — any image implementing it can work with Kelos. The type system is the only thing preventing truly pluggable agents.

## Proposed Solution: AgentType CRD

### New CRD: AgentType

```yaml
apiVersion: kelos.dev/v1alpha1
kind: AgentType
metadata:
  name: aider
spec:
  # Default container image for this agent type.
  # Can still be overridden per-Task via spec.image.
  image: ghcr.io/myorg/kelos-aider:v0.82.0

  # Credential environment variable mappings.
  # Maps credential type → env var name used to inject the secret value.
  credentialEnvVars:
    api-key: OPENAI_API_KEY
    oauth: OPENAI_AUTH_TOKEN

  # Output format for kelos-capture token/cost extraction.
  # "generic" uses a configurable JSON path-based parser.
  # Omit for agents that emit KELOS_OUTPUTS markers directly.
  outputFormat:
    type: jsonl              # jsonl (one JSON object per line) or none
    eventType: "result"      # JSON objects with this "type" field value
    tokenPaths:
      inputTokens: "usage.input_tokens"   # JSONPath within the event
      outputTokens: "usage.output_tokens"
      costUSD: "total_cost_usd"           # Optional
```

### Usage in Task/TaskSpawner

```yaml
apiVersion: kelos.dev/v1alpha1
kind: Task
metadata:
  name: fix-bug-with-aider
spec:
  type: aider                    # Resolved against AgentType CRD
  prompt: "Fix the failing test in pkg/auth/handler_test.go"
  credentials:
    type: api-key
    secretRef:
      name: openai-key           # Secret key name: OPENAI_API_KEY (from AgentType)
  workspaceRef:
    name: my-workspace
```

### Complete example: Internal agent with custom output format

```yaml
# Register the custom agent type
apiVersion: kelos.dev/v1alpha1
kind: AgentType
metadata:
  name: internal-coder
spec:
  image: registry.internal.co/ai/coder-agent:2.0.0
  credentialEnvVars:
    api-key: INTERNAL_API_KEY
  outputFormat:
    type: jsonl
    eventType: "completion"
    tokenPaths:
      inputTokens: "metrics.prompt_tokens"
      outputTokens: "metrics.completion_tokens"
      costUSD: "metrics.cost"
---
# TaskSpawner using the custom type
apiVersion: kelos.dev/v1alpha1
kind: TaskSpawner
metadata:
  name: internal-bug-fixer
spec:
  when:
    githubIssues:
      labels: [bug, ai-eligible]
  taskTemplate:
    type: internal-coder         # References AgentType CRD
    credentials:
      type: api-key
      secretRef:
        name: internal-api-key
    workspaceRef:
      name: my-workspace
    branch: "kelos-{{.Number}}"
    promptTemplate: |
      Fix issue #{{.Number}}: {{.Title}}
      {{.Body}}
    ttlSecondsAfterFinished: 3600
```

## Implementation Path

### Phase 1: Relax type validation (minimal, backward-compatible)

1. **Remove the kubebuilder enum** from `TaskSpec.Type` and `TaskTemplate.Type`. Replace with a webhook validation that accepts built-in types + names matching existing `AgentType` resources.
2. **Fall through gracefully** in `job_builder.go:Build()`: if type is not built-in, require `spec.image` to be set (return a clear error if missing), and set `KELOS_AGENT_TYPE` to the actual type string.
3. **Fall through gracefully** in `capture/usage.go:ParseUsage()`: for unknown types, attempt a generic JSONL parser or return nil (same as today, but with the correct type name logged).

This alone unblocks custom agents with correct semantics: users declare their actual type name, set their image, and use `credentials.type: none` for credentials.

### Phase 2: AgentType CRD (full solution)

1. **Add the AgentType CRD** with `image`, `credentialEnvVars`, and `outputFormat` fields.
2. **Task controller resolves AgentType** before building the Job: fetch the AgentType resource, use its image as default, map credential types to env var names, and configure the capture parser.
3. **Webhook validation** checks that `spec.type` is either a built-in type or matches an existing `AgentType` resource name.
4. **Add AgentType support to `kelos create agenttype`** CLI command.

### Phase 3: Output format extensibility

1. **Generic JSONL parser** in `kelos-capture`: configurable via `KELOS_OUTPUT_FORMAT` env var (set from AgentType.spec.outputFormat).
2. **Built-in parsers remain** for the 5 first-class types (no regression).
3. **Custom parsers** can be added by mounting a parser script in the agent image.

## Backward Compatibility

- **Existing tasks**: All 5 built-in types continue to work exactly as today. They use hardcoded images, credential mappings, and output parsers. No migration needed.
- **Existing TaskSpawners**: No changes required. The `type` field continues to accept all current values.
- **CRD upgrade**: Phase 1 (removing the enum) is a CRD schema relaxation, not a breaking change. Kubernetes allows widening validation.
- **AgentType is additive**: It's a new CRD that doesn't modify existing resources.

## Why this matters for adoption

1. **Enterprise teams** with internal agents can adopt Kelos without forking it
2. **Open-source agent builders** can provide Kelos-compatible images + AgentType manifests
3. **Reduces maintenance burden** — the Kelos team doesn't need to add and maintain agent-specific code for every new agent
4. **Aligns with the agent image interface** — the interface is already well-defined and agent-agnostic; the type system should match

## Related

- `docs/agent-image-interface.md` — Already defines the contract custom images must implement
- `internal/capture/usage.go` — Per-agent output parsers that would benefit from extensibility
- `internal/controller/job_builder.go:173` comment — `"new providers (e.g. Vertex) only need to add a case here"` confirms the team expects more agent types

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: Add AgentType CRD for user-defined agent types without requiring upstream code changes #927

Area: New CRDs & API Extensions

Summary

Problem

1. Adding a new agent type requires changes across 4+ files

2. The existing escape hatch has real limitations

3. The growth trajectory demands extensibility

Proposed Solution: AgentType CRD

New CRD: AgentType

Usage in Task/TaskSpawner

Complete example: Internal agent with custom output format

Implementation Path

Phase 1: Relax type validation (minimal, backward-compatible)

Phase 2: AgentType CRD (full solution)

Phase 3: Output format extensibility

Backward Compatibility

Why this matters for adoption

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

API: Add AgentType CRD for user-defined agent types without requiring upstream code changes #927

Description

Area: New CRDs & API Extensions

Summary

Problem

1. Adding a new agent type requires changes across 4+ files

2. The existing escape hatch has real limitations

3. The growth trajectory demands extensibility

Proposed Solution: AgentType CRD

New CRD: AgentType

Usage in Task/TaskSpawner

Complete example: Internal agent with custom output format

Implementation Path

Phase 1: Relax type validation (minimal, backward-compatible)

Phase 2: AgentType CRD (full solution)

Phase 3: Output format extensibility

Backward Compatibility

Why this matters for adoption

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions