Skip to content

Session logs: classify environment/tool failures before blaming the model #2022

@Hmbown

Description

@Hmbown

Problem

Some turns look like the model is lost when the surrounding evidence points to the environment, tool layer, or session lifecycle instead. Users and maintainers need a redacted way to separate model-quality failures from tool/runtime failures before triaging the report.

Evidence from maintainer-private local CodeWhale session logs, scanned 2026-05-24:

  • 32 CodeWhale JSONL session files across 26 session ids were inspected.
  • 207 tool calls exited non-zero in the inspected logs.
  • 43 failures matched network or remote-service symptoms.
  • 34 failures matched permission, sandbox, or approval symptoms.
  • 36 failures matched missing-path or missing-binary symptoms.
  • 16 started turns had no matching task_complete event in the inspected logs.

No prompts, raw tool outputs, secrets, absolute local paths, or user text are copied here. The point is the failure shape, not the private conversation content.

Desired Behavior

CodeWhale should make this distinction visible and reusable:

  • A redacted session-log analyzer can summarize failure categories from local JSONL logs.
  • Tool receipts classify likely source: model, tool schema, command exit, network, sandbox/approval, missing dependency, timeout, background job, or unknown.
  • /status, Activity Detail, handoff, or bug-report helpers can show a short "environment suspect" summary without exposing sensitive content.
  • Failure summaries preserve enough source metadata for maintainers to find the private local evidence when they have access.
  • Default public issue text must never include prompts, secrets, raw command output, full local paths, or conversation transcripts.

Acceptance Criteria

  • Synthetic session logs with non-zero tool exits, network errors, sandbox denials, missing binaries, and unclosed turn spans classify correctly.
  • The classifier emits aggregate counts and redacted source handles by default.
  • Activity Detail or an adjacent diagnostic surface can explain "this likely failed in the environment/tool layer" before the model is blamed.
  • Bug-report export has a privacy-first mode that includes categories and timestamps but not raw content.
  • Existing logs remain readable; no migration should be required for older JSONL sessions.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingcontextContext management / contextenhancementNew feature or request

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions