Skip to content

feat(web-ui): per-gate evidence display and run history (#567)#575

Merged
frankbria merged 3 commits intomainfrom
feat/567-proof-evidence-run-history
Apr 9, 2026
Merged

feat(web-ui): per-gate evidence display and run history (#567)#575
frankbria merged 3 commits intomainfrom
feat/567-proof-evidence-run-history

Conversation

@frankbria
Copy link
Copy Markdown
Owner

@frankbria frankbria commented Apr 9, 2026

Summary

Closes #567

  • Backend: proof_runs table added to SQLite ledger; save_run, list_runs, and get_run_evidence functions populate and query it. runner.py now records start/end time, duration, and pass/fail for each run.
  • Two new API endpoints: GET /api/v2/proof/runs (list last N runs) and GET /api/v2/proof/runs/{run_id}/evidence (per-gate evidence with artifact file content).
  • GateEvidencePanel: expandable per-gate row showing raw artifact text in a scrollable <pre> block; truncates at 200 lines with "Show full output" toggle.
  • RunHistoryPanel: "Recent Runs" table showing last 5 runs with timestamp, pass/fail badge, duration, triggered-by; clicking a row sets the selected run.
  • Both new components are wired into the PROOF9 page (RunHistoryPanel) and the requirement detail page (GateEvidencePanel for latest run evidence).

Test plan

  • uv run pytest tests/ui/test_proof_v2.py — 47 tests, all pass (14 new)
  • cd web-ui && npm test — 703 tests, all pass (16 new across 2 new files)
  • cd web-ui && npm run build — clean build, no TypeScript errors
  • uv run ruff check — no lint issues

Summary by CodeRabbit

  • New Features

    • Proof run history view: recent executions with timestamps, status, duration, trigger source, and pagination/limit
    • Run details view: per-gate evidence with captured artifact text (truncation + "Show full output")
    • UI components: collapsible Gate Evidence panel and Run History panel; select runs to load evidence
  • API

    • Added read endpoints for listing runs and fetching per-run evidence
  • Tests

    • Added integration and UI tests covering endpoints and new components
  • Docs

    • Updated roadmap and architecture docs to reflect shipped UI and API features

)

- Backend: add proof_runs table to ledger (save_run, list_runs, get_run_evidence)
- Backend: populate proof_runs on every run in runner.py
- Backend: two new endpoints — GET /api/v2/proof/runs and GET /api/v2/proof/runs/{id}/evidence
- Frontend: ProofRunSummary, ProofEvidenceWithContent, ProofRunDetail types
- Frontend: GateEvidencePanel — expandable per-gate artifact text, 200-line truncation with toggle
- Frontend: RunHistoryPanel — last 5 runs table with pass/fail badge, click to select
- Frontend: wire RunHistoryPanel into /proof page, GateEvidencePanel into /proof/[req_id] page
- Tests: 14 new backend tests (47 total, all pass); 2 new frontend test files (703 total, all pass)
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 9, 2026

Walkthrough

Adds persistent run records and per-run evidence viewing: new ProofRun model, proof_runs table and ledger APIs; runner persists run metadata (timestamps, duration, outcome); new API endpoints to list runs and fetch run evidence; UI components/pages to show recent runs and expandable per-gate artifact text.

Changes

Cohort / File(s) Summary
Backend models & ledger
codeframe/core/proof/models.py, codeframe/core/proof/ledger.py
Add ProofRun dataclass; create proof_runs table; adjust lazy-init to include proof_runs; add save_run, get_run, list_runs, get_run_evidence.
Backend runner
codeframe/core/proof/runner.py
Capture timezone-aware started_at/completed_at, compute duration_ms, compute overall_passed, and persist run via ledger.save_run.
API router & tests
codeframe/ui/routers/proof_v2.py, tests/ui/test_proof_v2.py
Add GET /api/v2/proof/runs and GET /api/v2/proof/runs/{run_id}/evidence, new response models and artifact-text helper; add integration tests for listing runs and run-evidence responses.
Frontend types & client
web-ui/src/types/index.ts, web-ui/src/lib/api.ts
Add ProofRunSummary, ProofRunDetail, ProofEvidenceWithContent types; add proofApi.listRuns() and proofApi.getRunDetail().
Frontend components & exports
web-ui/src/components/proof/GateEvidencePanel.tsx, web-ui/src/components/proof/RunHistoryPanel.tsx, web-ui/src/components/proof/index.ts
Add GateEvidencePanel (expandable per-gate artifact text with 200-line truncation) and RunHistoryPanel (recent runs list, selection/highlight); re-export both.
Frontend pages & wiring
web-ui/src/app/proof/page.tsx, web-ui/src/app/proof/[req_id]/page.tsx
Integrate RunHistoryPanel, add selected-run state and SWR fetch for run detail; fetch latest-run evidence for per-req display and render GateEvidencePanel.
Frontend tests & mocks
web-ui/src/__tests__/components/proof/*.test.tsx, web-ui/src/__tests__/components/proof/ProofPage.test.tsx
Add tests for GateEvidencePanel, RunHistoryPanel; extend Jest mocks to include added proof components.
Docs & roadmap
CLAUDE.md, docs/PHASE_2_CLI_API_MAPPING.md, docs/PHASE_3_UI_ARCHITECTURE.md, docs/PRODUCT_ROADMAP.md
Document new PROOF9 commands, UI views, and mark Phase 3.5B items as shipped / update roadmap and architecture notes.

Sequence Diagram

sequenceDiagram
    participant User
    participant Client
    participant API
    participant Runner
    participant Ledger
    participant DB

    User->>Client: Trigger run (POST /api/v2/proof/run)
    Client->>API: POST /api/v2/proof/run
    API->>Runner: start run
    Runner->>Runner: execute gates (capture started_at)
    Runner->>Runner: compute completed_at, duration_ms, overall_passed
    Runner->>Ledger: save_run(workspace, ProofRun)
    Ledger->>DB: INSERT/REPLACE proof_runs
    DB-->>Ledger: OK
    Ledger-->>Runner: OK
    Runner-->>API: run complete
    API-->>Client: run-complete response

    User->>Client: Request run history
    Client->>API: GET /api/v2/proof/runs?limit=5
    API->>Ledger: list_runs(workspace, limit)
    Ledger->>DB: SELECT ... ORDER BY started_at DESC
    DB-->>Ledger: runs
    Ledger-->>API: runs
    API-->>Client: runs

    User->>Client: Select past run
    Client->>API: GET /api/v2/proof/runs/{run_id}/evidence
    API->>Ledger: get_run_evidence(workspace, run_id)
    Ledger->>DB: SELECT from proof_evidence WHERE run_id, workspace_id
    DB-->>Ledger: evidence rows
    Ledger-->>API: evidence
    API-->>Client: run detail (with artifact_text)
    Client->>UI: render GateEvidencePanel
Loading

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Poem

🐰 I hop, I stamp each run with care,

Five bright histories kept in my lair,
Gates unfold logs in collapsible rows,
Truncated lines peek where curiosity goes,
Click to reveal — the full output glows!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'feat(web-ui): per-gate evidence display and run history (#567)' directly and clearly summarizes the main changes, matching the core objectives of adding evidence display and run history features.
Linked Issues check ✅ Passed All requirements from #567 are met: per-gate evidence display with 200-line truncation implemented in GateEvidencePanel; run history panel showing last 5 runs with pass/fail badges and timestamps implemented in RunHistoryPanel; backend endpoints GET /api/v2/proof/runs and GET /api/v2/proof/runs/{run_id}/evidence implemented with proper database persistence; integration into PROOF9 page completed.
Out of Scope Changes check ✅ Passed All changes directly support the #567 requirements: backend ledger APIs for run persistence, ProofRun model, runner modifications for tracking, frontend components for evidence and history display, integration into pages, and documentation updates. No extraneous changes detected outside the stated objectives.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/567-proof-evidence-run-history

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link
Copy Markdown

claude bot commented Apr 9, 2026

Review: feat(web-ui): per-gate evidence display and run history (#567)

This is a solid Phase 3.5B delivery. The backend/frontend split is clean, test coverage is thorough (14 new backend + 16 new frontend tests), and the SWR-based data fetching integrates well with the existing proof page patterns. A few issues worth addressing before merge:


Security — Path traversal in _read_artifact_text (blocker)

proof_v2.py:

def _read_artifact_text(artifact_path: str) -> Optional[str]:
    from pathlib import Path
    try:
        p = Path(artifact_path)
        if p.exists():
            return p.read_text(errors="replace")
        return None
    except Exception:
        return None

The artifact_path comes from the database (not directly from user input), but an attacker with write access to the evidence table could store a path like /etc/passwd or any sensitive file. At minimum, validate the path is inside the workspace directory before reading:

def _read_artifact_text(artifact_path: str, workspace_root: Path) -> Optional[str]:
    try:
        p = Path(artifact_path).resolve()
        if not str(p).startswith(str(workspace_root.resolve())):
            return None
        if p.exists():
            return p.read_text(errors="replace")
        return None
    except Exception:
        return None

Also: no size cap. A 100MB artifact file will be read entirely into memory and sent to the browser. Consider capping at e.g. 500KB.


Missing get_run ledger function

get_run_evidence_endpoint does this to look up a single run:

runs = list_runs(workspace, limit=100)
run = next((r for r in runs if r.run_id == run_id), None)

This is an O(N) scan capped at 100. A run outside that window returns 404 even if it exists. Add a get_run(workspace, run_id) -> Optional[ProofRun] to ledger.py with a direct WHERE run_id = ? AND workspace_id = ? query.


workspacePath type mismatch in RunHistoryPanel

The prop is typed as string:

interface RunHistoryPanelProps {
  workspacePath: string;

But the caller passes string | null:

const [workspacePath, setWorkspacePath] = useState<string | null>(null);
// ...
<RunHistoryPanel workspacePath={workspacePath} .../>

The SWR null-key guard handles the runtime case, but the prop type should be string | null to match reality. The build probably passes only because TypeScript isn't strict enough here, or the null check is elsewhere in the parent's render path.


Inline imports

from pathlib import Path is inside _read_artifact_text's function body. The cache-fallback block also has:

import time as _time
from datetime import datetime as _dt, timezone as _tz

Move these to the module-level imports. The aliased names (_dt, _tz) suggest they were added defensively to avoid shadowing — not needed at module level since datetime and timezone aren't shadowed elsewhere in the file.


triggered_by always hardcoded to "human"

runner.py always passes triggered_by="human". The roadmap mentions auto-triggered runs (4B: post-merge glitch loop). Consider making run_proof() accept a triggered_by: str = "human" parameter now so the API and batch runner can set it correctly when that time comes. Low cost to do it right upfront.


Minor: _run_cache fallback adds complexity for an edge case that shouldn't occur

save_run is called synchronously at the end of run_proof() before returning. By the time the client receives the run_id from POST /api/v2/proof/run and makes a follow-up GET /runs/{run_id}/evidence, the record is already in the DB. The cache fallback path:

  • Returns started_at == completed_at (misleading — duration would be 0)
  • Has artifact_path="", artifact_checksum="", timestamp="" for all evidence items

If this fallback was needed for a real race condition, the empty strings would cause downstream issues. If it's not needed, removing it simplifies the endpoint considerably. Worth clarifying the intent.


Nit: test_list_runs_empty_initially may be fragile

This test asserts the list is empty before any run. If the test DB is shared across test methods within the class (depends on fixture scope), a prior test that triggered a run would cause this to fail. Worth verifying the fixture gives a fresh DB per test, or using @pytest.mark.usefixtures("clean_workspace") explicitly.


What's working well

  • _ensure_tables logic fix is correct — always closes the connection, checks both tables
  • ProofRunDetailResponse inheritance from ProofRunSummaryResponse avoids duplication
  • Frontend components handle all states cleanly (loading, error, empty, data)
  • Keyboard navigation on run rows (role="button", tabIndex, onKeyDown) is solid
  • truncateLines with "Show full output" toggle is a good UX pattern
  • Test coverage hits all the important paths including empty state, shape validation, 404 for unknown run

The security issue with _read_artifact_text is the only hard blocker. The missing get_run function is worth fixing before merge to avoid the 100-run window limitation.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (1)
codeframe/core/proof/ledger.py (1)

74-85: Index the new run-history tables.

The new UI calls into list_runs() and get_run_evidence() for interactive views, but this migration only creates the tables. Without supporting indexes, SQLite will eventually full-scan and sort proof_runs / proof_evidence on every history or evidence request. Adding indexes on the lookup keys now will keep these panels responsive as data grows.

Suggested migration additions
     cursor.execute("""
         CREATE TABLE IF NOT EXISTS proof_runs (
             run_id TEXT NOT NULL,
             workspace_id TEXT NOT NULL,
             started_at TEXT NOT NULL,
             completed_at TEXT,
             triggered_by TEXT NOT NULL DEFAULT 'human',
             overall_passed INTEGER NOT NULL DEFAULT 0,
             duration_ms INTEGER,
             PRIMARY KEY (run_id, workspace_id)
         )
     """)
+
+    cursor.execute(
+        "CREATE INDEX IF NOT EXISTS idx_proof_runs_workspace_started_at "
+        "ON proof_runs (workspace_id, started_at DESC)"
+    )
+    cursor.execute(
+        "CREATE INDEX IF NOT EXISTS idx_proof_evidence_workspace_run "
+        "ON proof_evidence (workspace_id, run_id)"
+    )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@codeframe/core/proof/ledger.py` around lines 74 - 85, The migration creates
proof_runs but lacks indexes, so update the migration SQL to add CREATE INDEX IF
NOT EXISTS statements for the lookup/sort keys used by list_runs() and
get_run_evidence(): add indexes on proof_runs(workspace_id),
proof_runs(workspace_id, started_at) (or started_at DESC for sort), and
proof_runs(run_id, workspace_id) as well as matching indexes on proof_evidence
(e.g., proof_evidence(workspace_id) and proof_evidence(run_id, workspace_id)).
Put these CREATE INDEX IF NOT EXISTS statements right after the table creations
so queries in list_runs() and get_run_evidence() avoid full-table scans as data
grows.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@codeframe/core/proof/ledger.py`:
- Around line 385-410: Add a precise run lookup helper get_run(workspace,
run_id) that mirrors list_runs' behavior but queries proof_runs WHERE run_id = ?
AND workspace_id = ? using get_db_connection and _ensure_tables; fetchone, close
the connection, and if a row exists return a ProofRun constructed the same way
as in list_runs (parsing started_at/completed_at with datetime.fromisoformat,
casting overall_passed to bool, preserving duration_ms and triggered_by),
otherwise return None to allow exact historical lookups without paging.

In `@codeframe/core/proof/runner.py`:
- Around line 135-137: overall_passed currently uses "bool(results) and
all(...)" which treats runs with no checks as failed; change the logic so
empty-result runs are considered passed by computing overall_passed = all(passed
for gate_results in results.values() for _, passed in gate_results) if results
else True, updating the variable assignment in runner.py where overall_passed is
defined so it treats an empty results dict as True.
- Around line 72-73: The early return when there are no open requirements skips
persisting the run, so ensure ledger.save_run(...) is invoked for user-triggered
runs before that return; move or add a call to ledger.save_run using the same
run metadata (e.g., started_at, finished_at, run_id/identifier, user trigger
flag and outcome) immediately prior to the "no open requirements" return in the
function that defines started_at, so every execution (including the no-run case)
is recorded in history.

In `@codeframe/ui/routers/proof_v2.py`:
- Around line 526-535: The router currently reads entire artifact files via
_read_artifact_text and serializes full content in get_run_evidence_endpoint,
which can blow memory for large logs; move the artifact resolution and preview
logic into a core function (e.g., create a core API like
core.artifacts.get_artifact_preview(path, max_lines=200) and
core.artifacts.get_artifact_full(path)) and change the router to call those thin
core functions so the router no longer reads files directly; ensure the core
function returns a bounded preview (e.g., first N lines or bytes) by default and
provide a separate router endpoint (e.g., /evidence/{id}/artifact/full) that
calls the core get_artifact_full to fetch the complete content on explicit user
action.

In `@web-ui/src/app/proof/`[req_id]/page.tsx:
- Around line 112-115: latestEvidence is currently set to the entire run-wide
latestRunDetail.evidence, which can include gates from other requirements;
update the useMemo for latestEvidence to filter latestRunDetail.evidence by the
route param req_id so only evidence for this requirement is shown (e.g., change
the computation in the useMemo that references latestRunDetail and
latestRunDetail?.evidence to filter items where the evidence's requirement
identifier matches req_id — check properties like requirementId, requirement_id
or req_id on the evidence object and use the correct one).

In `@web-ui/src/app/proof/page.tsx`:
- Around line 106-107: The main proof view isn’t using selectedRunId — wire
selectedRunId (and setSelectedRunId/RunHistoryPanel) into the main gate
rendering flow: add a useEffect that listens to selectedRunId and when non-null
calls the existing run-results loader (or create a fetchLoadRunResults function)
to retrieve historical gate results, store them in the state the gate renderer
uses (e.g., historicalRunResults or currentDisplayedRun), and have the
gate/gate-progress components render that state instead of the live run; also
support clearing selectedRunId to switch back to live run rendering.

---

Nitpick comments:
In `@codeframe/core/proof/ledger.py`:
- Around line 74-85: The migration creates proof_runs but lacks indexes, so
update the migration SQL to add CREATE INDEX IF NOT EXISTS statements for the
lookup/sort keys used by list_runs() and get_run_evidence(): add indexes on
proof_runs(workspace_id), proof_runs(workspace_id, started_at) (or started_at
DESC for sort), and proof_runs(run_id, workspace_id) as well as matching indexes
on proof_evidence (e.g., proof_evidence(workspace_id) and proof_evidence(run_id,
workspace_id)). Put these CREATE INDEX IF NOT EXISTS statements right after the
table creations so queries in list_runs() and get_run_evidence() avoid
full-table scans as data grows.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: eca6840a-2ed3-4e00-b8b0-c5ff559aa49d

📥 Commits

Reviewing files that changed from the base of the PR and between c4d9697 and 695d3ab.

📒 Files selected for processing (15)
  • codeframe/core/proof/ledger.py
  • codeframe/core/proof/models.py
  • codeframe/core/proof/runner.py
  • codeframe/ui/routers/proof_v2.py
  • tests/ui/test_proof_v2.py
  • web-ui/src/__tests__/components/proof/GateEvidencePanel.test.tsx
  • web-ui/src/__tests__/components/proof/ProofPage.test.tsx
  • web-ui/src/__tests__/components/proof/RunHistoryPanel.test.tsx
  • web-ui/src/app/proof/[req_id]/page.tsx
  • web-ui/src/app/proof/page.tsx
  • web-ui/src/components/proof/GateEvidencePanel.tsx
  • web-ui/src/components/proof/RunHistoryPanel.tsx
  • web-ui/src/components/proof/index.ts
  • web-ui/src/lib/api.ts
  • web-ui/src/types/index.ts

Comment on lines +526 to +535
def _read_artifact_text(artifact_path: str) -> Optional[str]:
"""Read artifact file content, returning None if the file is missing."""
from pathlib import Path
try:
p = Path(artifact_path)
if p.exists():
return p.read_text(errors="replace")
return None
except Exception:
return None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid inlining full artifact files into this response.

_read_artifact_text() reads every artifact completely, and get_run_evidence_endpoint() serializes that whole payload for each evidence row. Large pytest/coverage logs will make this endpoint slow and memory-heavy even though the UI initially shows only a 200-line preview. Move artifact-content resolution into core and return a bounded preview by default, with an explicit follow-up fetch for “Show full output”.

As per coding guidelines, "FastAPI routers in codeframe/ui/routers/ must be thin adapters over core functionality; logic must live in core modules, not routers".

Also applies to: 590-600

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@codeframe/ui/routers/proof_v2.py` around lines 526 - 535, The router
currently reads entire artifact files via _read_artifact_text and serializes
full content in get_run_evidence_endpoint, which can blow memory for large logs;
move the artifact resolution and preview logic into a core function (e.g.,
create a core API like core.artifacts.get_artifact_preview(path, max_lines=200)
and core.artifacts.get_artifact_full(path)) and change the router to call those
thin core functions so the router no longer reads files directly; ensure the
core function returns a bounded preview (e.g., first N lines or bytes) by
default and provide a separate router endpoint (e.g.,
/evidence/{id}/artifact/full) that calls the core get_artifact_full to fetch the
complete content on explicit user action.

- ledger.py: add get_run() for O(1) run lookup by run_id (avoids scan)
- runner.py: persist run record even when no open requirements exist
- runner.py: fix overall_passed logic (empty runs → True, not False)
- proof_v2.py: truncate artifact text to 200 lines server-side
- proof_v2.py: use get_run() instead of list_runs() scan in evidence endpoint
- [req_id]/page.tsx: filter latestEvidence by req_id (was returning all gates)
- proof/page.tsx: fetch and render selected run evidence via GateEvidencePanel
@claude
Copy link
Copy Markdown

claude bot commented Apr 9, 2026

Follow-up Review

Two items from my prior review are resolved:

  • get_run ledger function added with a direct WHERE run_id = ? AND workspace_id = ? query. Good.
  • Line truncation _read_artifact_text now caps at 200 lines. Good.

The main blocker and a couple of smaller issues remain.


Security: Path traversal still unaddressed (blocker)

_read_artifact_text still has no workspace root check. An artifact path stored in the DB (e.g. /etc/shadow) can be read and returned to the browser:

def _read_artifact_text(artifact_path: str, workspace: Workspace, max_lines: int = _ARTIFACT_LINE_LIMIT) -> Optional[str]:
    try:
        p = Path(artifact_path).resolve()
        root = Path(workspace.repo_path).resolve()
        if not str(p).startswith(str(root)):
            return None
        ...

Pass workspace into the function and resolve both paths before comparing. The callers in the endpoint already have the workspace object available.

Also: p.read_text() still loads the full file before slicing. For a large artifact the line cap does not help memory usage. Reading iteratively with p.open() and stopping after max_lines is safer.


Inline imports

These remain inside function bodies:

  • from pathlib import Path inside _read_artifact_text
  • import time as _time and from datetime import datetime as _dt, timezone as _tz inside the cache fallback block

Move to module-level. The _dt/_tz aliases are not needed since datetime/timezone are not shadowed elsewhere in the file.


workspacePath type mismatch in RunHistoryPanel

The prop interface declares workspacePath: string but the parent passes string | null. The SWR null-key guard handles the runtime case, but the declared type should be string | null so TypeScript catches future misuse at the call site.


Security fix is the only hard blocker. The other two are quick cleanups worth doing before merge.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@codeframe/core/proof/runner.py`:
- Around line 83-93: The run history is misattributed because triggered_by is
hardcoded to "human"; update the run_proof function signature to accept a
triggered_by parameter (default "human") and use that parameter when
constructing both ProofRun instances (the one created in run_proof and the
second one later around the other ProofRun(...) construction), ensuring the
passed-in value is threaded into ledger.save_run calls; also update all call
sites of run_proof to pass an appropriate triggered_by value where automation
may invoke it.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 39d37b1f-49b1-4180-8517-508c408c735f

📥 Commits

Reviewing files that changed from the base of the PR and between 695d3ab and 6f76a30.

📒 Files selected for processing (5)
  • codeframe/core/proof/ledger.py
  • codeframe/core/proof/runner.py
  • codeframe/ui/routers/proof_v2.py
  • web-ui/src/app/proof/[req_id]/page.tsx
  • web-ui/src/app/proof/page.tsx
✅ Files skipped from review due to trivial changes (1)
  • codeframe/ui/routers/proof_v2.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • web-ui/src/app/proof/page.tsx
  • codeframe/core/proof/ledger.py

Comment on lines +83 to +93
ledger.save_run(
workspace,
ProofRun(
run_id=run_id,
workspace_id=workspace.id,
started_at=started_at,
completed_at=completed_at,
triggered_by="human",
overall_passed=True,
duration_ms=int((completed_at - started_at).total_seconds() * 1000),
),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

triggered_by is hardcoded to "human" in both persisted run paths.

If run_proof is ever invoked by automation, run history will be misattributed. Consider threading a triggered_by parameter through run_proof and using it in both ProofRun(...) constructions (Line 90 and Line 157).

Suggested fix
 def run_proof(
     workspace: Workspace,
     *,
     full: bool = False,
     gate_filter: Optional[Gate] = None,
     run_id: Optional[str] = None,
+    triggered_by: str = "human",
 ) -> dict[str, list[tuple[Gate, bool]]]:
@@
             ProofRun(
                 run_id=run_id,
                 workspace_id=workspace.id,
                 started_at=started_at,
                 completed_at=completed_at,
-                triggered_by="human",
+                triggered_by=triggered_by,
                 overall_passed=True,
                 duration_ms=int((completed_at - started_at).total_seconds() * 1000),
             ),
@@
         ProofRun(
             run_id=run_id,
             workspace_id=workspace.id,
             started_at=started_at,
             completed_at=completed_at,
-            triggered_by="human",
+            triggered_by=triggered_by,
             overall_passed=overall_passed,
             duration_ms=duration_ms,
         ),

Also applies to: 150-160

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@codeframe/core/proof/runner.py` around lines 83 - 93, The run history is
misattributed because triggered_by is hardcoded to "human"; update the run_proof
function signature to accept a triggered_by parameter (default "human") and use
that parameter when constructing both ProofRun instances (the one created in
run_proof and the second one later around the other ProofRun(...) construction),
ensuring the passed-in value is threaded into ledger.save_run calls; also update
all call sites of run_proof to pass an appropriate triggered_by value where
automation may invoke it.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/PHASE_3_UI_ARCHITECTURE.md`:
- Around line 362-364: Update the docs so GateEvidencePanel is described as
rendering per-gate evidence rows (pass/fail + artifact text) rather than hosting
run-state progress; remove or relocate mention of GateProgressRow (pending →
running → passed/failed) from GateEvidencePanel and assign that responsibility
to the run-progress components (e.g., GateRunPanel and GateRunBanner), ensuring
the doc now states GateRunPanel/GateRunBanner own progress-state UI while
GateEvidencePanel owns evidence/artifact display.
- Line 353: The fenced code block that documents the component tree (starting
with "ProofPage (/proof)" and containing nodes like ProofHeader and
RunGatesButton) is missing a language tag and triggers markdownlint MD040;
update the opening fence to include a language (e.g., change "```" to "```text")
so the tree block is recognized as plain text, leaving the block content and
closing "```" unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 4ac59335-64db-4149-af0d-961a02b180d4

📥 Commits

Reviewing files that changed from the base of the PR and between 6f76a30 and 70accfb.

📒 Files selected for processing (4)
  • CLAUDE.md
  • docs/PHASE_2_CLI_API_MAPPING.md
  • docs/PHASE_3_UI_ARCHITECTURE.md
  • docs/PRODUCT_ROADMAP.md
✅ Files skipped from review due to trivial changes (3)
  • docs/PRODUCT_ROADMAP.md
  • docs/PHASE_2_CLI_API_MAPPING.md
  • CLAUDE.md

- Run history panel with outcome and duration

**Component Hierarchy:**
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add a language tag to the fenced block to satisfy markdownlint.

Line 353 opens a fenced code block without a language, triggering MD040. Use text (or another appropriate language) for the tree block.

Proposed fix
-```
+```text
 ProofPage (/proof)
 ├── ProofHeader
 │   ├── RunGatesButton → POST /api/v2/proof/run
 ...
-```
+```
🧰 Tools
🪛 markdownlint-cli2 (0.22.0)

[warning] 353-353: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/PHASE_3_UI_ARCHITECTURE.md` at line 353, The fenced code block that
documents the component tree (starting with "ProofPage (/proof)" and containing
nodes like ProofHeader and RunGatesButton) is missing a language tag and
triggers markdownlint MD040; update the opening fence to include a language
(e.g., change "```" to "```text") so the tree block is recognized as plain text,
leaving the block content and closing "```" unchanged.

Comment on lines +362 to +364
├── GateEvidencePanel ← new (Phase 3.5B)
│ ├── GateProgressRow[] (pending → running → passed/failed)
│ └── EvidenceArtifactDisplay (artifact text, scrollable)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Align GateEvidencePanel docs with actual component responsibilities.

Line 362–364 describes GateEvidencePanel as hosting GateProgressRow[] with pending → running → passed/failed, but current implementation of web-ui/src/components/proof/GateEvidencePanel.tsx renders per-gate evidence rows (pass/fail + artifact text) and not running-state progress. Please move progress-row ownership to the run-progress components (e.g., GateRunPanel/GateRunBanner) in this doc to avoid architectural drift.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/PHASE_3_UI_ARCHITECTURE.md` around lines 362 - 364, Update the docs so
GateEvidencePanel is described as rendering per-gate evidence rows (pass/fail +
artifact text) rather than hosting run-state progress; remove or relocate
mention of GateProgressRow (pending → running → passed/failed) from
GateEvidencePanel and assign that responsibility to the run-progress components
(e.g., GateRunPanel and GateRunBanner), ensuring the doc now states
GateRunPanel/GateRunBanner own progress-state UI while GateEvidencePanel owns
evidence/artifact display.

@claude
Copy link
Copy Markdown

claude bot commented Apr 9, 2026

Review: feat(web-ui): per-gate evidence display and run history (#567)

This closes Phase 3.5B cleanly. The full stack is coherent — DB persistence, two new API endpoints, two new React components, and solid test coverage (14 new backend tests, 16 new frontend tests). The component design for GateEvidencePanel and RunHistoryPanel is well-scoped and the SWR integration is consistent with the rest of the web UI.

I've reviewed the latest commit against both the code and CodeRabbit's prior passes. Here's what I think is still worth addressing.


Security — path traversal in _read_artifact_text

def _read_artifact_text(artifact_path: str, ...) -> Optional[str]:
    p = Path(artifact_path)
    if not p.exists():
        return None
    lines = p.read_text(errors="replace").splitlines(keepends=True)
    return "".join(lines[:max_lines])

artifact_path is read from the SQLite ledger and served over HTTP. If the database were ever populated with a crafted path (e.g., /etc/passwd or ~/.ssh/id_rsa), this endpoint would serve that content to anyone with API access. The fix is a single containment check:

workspace_root = Path(workspace.repo_path).resolve()
p = Path(artifact_path).resolve()
if not p.is_relative_to(workspace_root):
    return None

CodeRabbit also noted this helper belongs in core rather than the router layer — both concerns are worth addressing together.


Still open from CodeRabbit's second review — triggered_by hardcoded

runner.py still passes triggered_by="human" unconditionally, including when the run is started via the API (which can be automated). The fix is a parameter with a sensible default:

def run_proof(..., triggered_by: str = "human") -> dict:
    ...
    ProofRun(..., triggered_by=triggered_by, ...)

Call sites that don't care get the default; the API router can pass "api" or surface it as a query param if needed. This is a low-effort change that makes the run history accurate.


Still open from CodeRabbit's first review — missing indexes

The proof_runs table has a composite PK on (run_id, workspace_id) but list_runs orders by started_at DESC and filters on workspace_id — a full-scan + sort on every history panel load. Same issue for get_run_evidence, which filters proof_evidence by (run_id, workspace_id). These two indexes cover both queries:

CREATE INDEX IF NOT EXISTS idx_proof_runs_ws_started
    ON proof_runs (workspace_id, started_at DESC);

CREATE INDEX IF NOT EXISTS idx_proof_evidence_ws_run
    ON proof_evidence (workspace_id, run_id);

Inline imports inside the cache fallback path

In get_run_evidence_endpoint, the DB-miss fallback contains:

import time as _time
ts = cached.get("_ts", _time.time())
from datetime import datetime as _dt, timezone as _tz
ts_str = _dt.fromtimestamp(ts, tz=_tz.utc).isoformat()

datetime and timezone are already importable from module-level imports — they don't need to be re-imported inline. time can similarly be moved to a top-level import. These deferred imports add noise and are flagged by some linters.


Test assertion gap in ordering test

def test_list_runs_ordered_newest_first(self, test_client):
    ...
    runs = test_client.get("/api/v2/proof/runs").json()
    if len(runs) >= 2:           # silently no-ops if fixture only produces 1 run
        assert runs[0]["started_at"] >= runs[1]["started_at"]

The conditional if makes the ordering assertion a no-op when the fixture produces only one run. Use assert len(runs) >= 2 instead so the check is guaranteed to execute.


Minor: GateEvidenceRow key includes list index

<GateEvidenceRow key={`${ev.run_id}:${ev.gate}:${i}`} ev={ev} />

Using i (position) means React will re-mount rows on any reorder. ev.req_id is already available and makes the key stable: ${ev.run_id}:${ev.req_id}:${ev.gate}.


Minor: JSX indentation in proof/page.tsx

RunHistoryPanel is placed one indent level shallower than its sibling elements. Not a functional issue, but the formatter will catch it.


Summary: The implementation is solid and Phase 3.5B is complete. The path traversal issue in _read_artifact_text is the only item I would call a blocker — the rest are improvements or cleanup. The triggered_by parameter and the missing indexes are the most valuable follow-ups.

@frankbria frankbria merged commit 0c12d51 into main Apr 9, 2026
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Phase 3.5B] Run Gates: per-gate evidence display and run history

1 participant