Skip to content

feat(training): JSONL exporter for ATIF trajectories#95

Open
anderskev wants to merge 12 commits into
mainfrom
feat/jsonl-exporter-r1
Open

feat(training): JSONL exporter for ATIF trajectories#95
anderskev wants to merge 12 commits into
mainfrom
feat/jsonl-exporter-r1

Conversation

@anderskev
Copy link
Copy Markdown
Member

Summary

  • Adds daydream export-jsonl — a new CLI subcommand that turns archived ATIF trajectories into a versioned, schema-validated JSONL corpus suitable for training/eval pipelines.
  • Introduces the daydream/training/ package: schema v1 definition, exclusion + copyleft filtering, record/span builders, deterministic stack-stratification, and an end-to-end export orchestrator.
  • Extends the archive manifest with a code_context block (base_sha, changed_files) so newly archived runs carry the git context the exporter needs end-to-end.

Motivation

The trajectory archive captures per-run ATIF data, but there was no supported way to turn it into a training-ready dataset. Downstream consumers need:

  • A stable, versioned JSONL schema (v1) they can validate against.
  • Deterministic, byte-identical output across reruns so corpora can be diffed and cached.
  • Built-in filtering for license-incompatible (copyleft) and explicitly excluded sources.
  • Stack-aware stratification so a single dominant stack does not swamp the corpus.

This PR delivers that pipeline in one focused milestone so the training side can consume archived runs without bespoke glue.

Changes

Added

  • daydream/training/ package: schema.py, exclusion.py, export.py, schema artifacts (schema/v1.json, schema/copyleft.txt, schema/exclusion.txt).
  • daydream export-jsonl --out <path> CLI subcommand with the full filter, stratification, opt-in, and diagnostic flag surface from the plan (validates --max-stack-share in (0, 1] and --min-grounding in [0, 1]).
  • ExportConfig + run_export orchestrator: schema-only short-circuit, filter+stratify pipeline, dry-run summary, atomic JSONL write via tempfile + Path.replace, and schema.json side-car emission.
  • GitContext + Manifest carry base_sha and changed_files; git_ops.diff_name_only populates them at archive time; manifest.to_dict() emits a code_context block.
  • archive.index.count_runs() for cheap unfiltered totals (no row materialization).
  • Test fixtures under tests/fixtures/training/ plus suites covering exclusion, query/filter, record/span builders, stratification, and the end-to-end export.

Changed

  • Exporter sources head_sha / branch / base_branch from the manifest git block (was inadvertently reading from code_context); falls back to manifest_row when older archives have no code_context block.
  • is_copyleft() accepts a pre-loaded list so the per-row hot loop in _query_index doesn't reopen the file N times.
  • _build_query() accepts an exclusion kwarg (typed frozenset | set | None) so callers can inject the set without re-reading from disk.
  • Step IDs are cast to int to match the v1 schema integer constraint.
  • JSONL output uses sort_keys=True and compact JSON separators for deterministic, byte-identical reruns.
  • code_context field order in the manifest now matches the schema definition (head_sha then base_sha).
  • Skip-record warning copy clarified from "missing" to "corrupt or unreadable"; stack=None warning is now gated behind stratify_by == \"stack\".

Fixed

  • Atomic tempfile writes are wrapped in try/except so a failed write doesn't leave an orphan .tmp file beside the output.
  • Multi-outcome sessions go through _single_outcome_label(), which warns instead of silently dropping labels.
  • Schema v1 drops the unused test_outcome field.

Test Plan

  • uv run pytest — full suite passes locally (343 existing + new training/archive/git_ops coverage).
  • uv run ruff check and uv run mypy daydream pass; the _build_query exclusion-param typing fix was added specifically to unblock mypy.
  • Exporter determinism verified via a fixture-driven byte-identical rerun test.
  • Schema-only short-circuit, dry-run no-op, and missing-trajectory skip paths each covered by a dedicated test.
  • git_ops.diff_name_only covered for happy path, multi-file output, empty-line filtering, and bad-ref soft-failure.

Checklist

  • Tests pass locally (uv run pytest)
  • Linting passes (uv run ruff check)
  • Type checking passes (uv run mypy daydream)
  • Documentation updated (if applicable) — CLI surface documented inline; no user-facing docs site to update for this milestone.

Additional Context

Diff size: ~2,100 LOC added across 22 files; the bulk lives in daydream/training/export.py and its test suite. The code_context rollout is forward-compatible: older archives that predate the block surface base_sha=None / changed_files=[], which the v1 schema explicitly allows.


Generated with Claude Code

anderskev and others added 11 commits May 17, 2026 12:38
Adds ExportConfig + run_export — the orchestration layer that ties Waves
2–5 together. Implements plan §10 step 6: schema-only short-circuit,
filter+stratify pipeline, dry-run summary, atomic JSONL write via
tempfile + Path.replace, and schema.json side-car emission. Records are
serialized with compact JSON separators so output stays small and
deterministic across runs (covers AC #1, #2, #7, #8).

Skips rows whose archive directory is missing manifest.json or
trajectory.json with a warning, rather than crashing the whole export.

Tests use the §9 fixture matrix unchanged — JSONL validity against the
schema, required-field presence, byte-identical re-runs, schema.json
emission, dry-run no-op, missing-trajectory skip, and emit_schema_only
short-circuit.
Adds `daydream export-jsonl --out <path>` with the full filter,
stratification, opt-in, and diagnostic flag surface from plan §4.
Validates --max-stack-share in (0, 1] and --min-grounding in [0, 1]
before constructing ExportConfig.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…fests

Extends GitContext + Manifest to carry base_sha and changed_files at
archive time, adds git_ops.diff_name_only, emits a code_context block
in manifest.to_dict(), and updates the JSONL exporter to source those
fields from the on-disk manifest. Older archives that predate the
code_context block surface base_sha=None and changed_files=[] — schema
allows nullable values there.

Also warns once per unknown skill encountered during query_index so
stratify=None buckets are visible rather than silent.
…anup

- export: fall back to manifest_row for head_sha/branch/base_branch when
  the manifest dict has no code_context block (fixes a v1-schema gap
  where scalar fields surfaced as None for older archives).
- export: cast step_id to int so non-int trajectory values match the v1
  schema's integer type constraint.
- archive.index: add count_runs() and use it for the unfiltered summary
  count instead of materialising every row via query_runs.
- training.exclusion: let is_copyleft() accept a pre-loaded copyleft_list
  so a per-row loop in _query_index doesn't reopen the file N times.
- export: warn when records have stack=None (unmapped skill) before
  stratification so silent grouping doesn't surprise callers.
- schema/v1.json + export: drop the unused test_outcome field.
- stratify: document that max_stack_share is an input-corpus cap, not an
  output-share guarantee, with a worked example.
- test_git_ops: add coverage for diff_name_only (happy path, multi-file,
  empty-line filtering, bad-ref soft-failure).
- uv.lock: sync to 0.17.0.

Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Read head_sha, base_branch, branch from manifest `git` block instead
  of `code_context` (fixes regression introduced when the two blocks
  were split)
- Reorder `code_context` serialisation so base_sha follows head_sha for
  readability; field order now matches the schema definition
- Extract `_single_outcome_label()` helper that warns (rather than
  silently drops) when a session carries multiple outcome labels
- Accept an `exclusion` kwarg on `_build_query()` so callers can inject
  the set without re-loading from disk (improves testability)
- Guard the stack=None warning behind the `stratify_by == "stack"` branch
  so it only fires when stratification is actually requested
- Add `sort_keys=True` to `json.dumps` for deterministic JSONL output
- Wrap atomic tempfile write in try/except to clean up the .tmp file on
  any write error
- Clarify "missing" → "corrupt or unreadable" in the skip-record warning
- Expand noqa comments in tests to explain why subprocess args are safe

Daydream-Run: 20260517230342-fb31586b
Daydream-Version: 0.17.0
load_exclusion_list() returns frozenset[str]; the injected kwarg was
typed set[str] | None which mypy rejected on assignment.

Daydream-Run: 20260517230342-fb31586b
Daydream-Version: 0.17.0
@anderskev anderskev added enhancement New feature or request area:training Training pipeline and data preparation labels May 18, 2026
@anderskev anderskev self-assigned this May 18, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 18, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f8c5f62f-886e-4f86-a5d1-ff678ba276cb

📥 Commits

Reviewing files that changed from the base of the PR and between 2d1bc51 and e12fcce.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (5)
  • daydream/cli.py
  • daydream/git_ops.py
  • daydream/training/export.py
  • pyproject.toml
  • tests/fixtures/training/build_archive.py
🚧 Files skipped from review as they are similar to previous changes (4)
  • daydream/git_ops.py
  • daydream/cli.py
  • tests/fixtures/training/build_archive.py
  • daydream/training/export.py

Walkthrough

This PR adds a training-record JSONL exporter: archive metadata now includes merge-base SHA and changed files; training package, v1 JSON Schema, and file-backed exclusion/copyleft lists are added; trajectories are converted to schema v1 records (spans, labels, code_context); a parameterized SQL query/filter pipeline with copyleft handling and skill->stack mapping is implemented; stack-based stratification is supported; run_export orchestrates counting, querying, stratifying, and atomic JSONL+schema emission; a synchronous CLI subcommand exposes the flow; fixture-backed tests cover the end-to-end behavior.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 61.54% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat(training): JSONL exporter for ATIF trajectories' accurately summarizes the main change: introducing a JSONL export pipeline for archived ATIF trajectories.
Description check ✅ Passed The description comprehensively details the PR's purpose, changes, and context related to the JSONL exporter, archive manifest extensions, filtering, stratification, and test coverage.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@daydream/cli.py`:
- Around line 379-383: Detect when both args.include_all_labels and args.label
are provided and fail fast instead of silently overriding: in the CLI handling
code where labels is computed (the block referencing args.include_all_labels,
args.label, and labels), add a check that if args.include_all_labels is true and
args.label is non-empty, call the argument parser's error/exit (e.g.,
parser.error or raise SystemExit with a clear message) or refactor the flags
into an argparse mutually exclusive group so the parser prevents both from being
set; ensure the error message clearly states that --include-all-labels cannot be
used with --label.

In `@daydream/git_ops.py`:
- Around line 502-505: The call to _run_git inside diff_name_only can raise
GitError and must be caught to preserve the function's contract of returning []
on subprocess failure; wrap the _run_git call in a try/except that catches
GitError (the exception type raised by _run_git) and return [] from the except
block, keeping the existing behavior that also returns [] when proc.returncode
!= 0 and leaving the timeout and arguments to _run_git unchanged.

In `@daydream/training/export.py`:
- Around line 63-69: The step_id parsing in the loop over
trajectory.get("steps", []) (inside daydream/training/export.py) can raise
ValueError/TypeError when calling int(step.get("step_id", i + 1)); guard this by
wrapping the int(...) conversion in a try/except (catching ValueError and
TypeError) and on failure fall back to a safe default (e.g., use i + 1 or None)
and optionally log/debug the malformed step; update the code around the step_id
assignment so malformed step_id values do not abort export.
- Around line 164-169: The parsing of raw_labels (outcome_labels) silently
swallows JSON errors and drops data; modify the try/except around
json.loads(raw_labels) so that on JSONDecodeError/TypeError you log the error
and the offending raw_labels (and any record identifier available) before
falling back to an empty list, or re-raise if that better fits upstream
handling; update the except block to call the module logger (e.g.,
logging.exception or logger.warning with exc_info=True) referencing raw_labels
and keep the subsequent isinstance(labels, list) check to enforce type safety.

In `@tests/fixtures/training/build_archive.py`:
- Around line 27-37: The FixtureSession dataclass docstring is missing a
Google-style "Attributes:" section and the public function build_fixture_archive
lacks a "Returns:" section; update the FixtureSession docstring to include an
Attributes: block that lists session_id (str), repo_slug (str), skill (str),
grounding_rate (float), outcome_labels (tuple[str, ...]), status (str, default
"complete"), and notes (str, default ""), and update the build_fixture_archive
docstring to include a Google-style "Returns:" section describing the return
type and what the returned value represents (and add or complete Args: and
Raises: sections if applicable), ensuring wording matches existing docstring
style and uses the exact symbol names FixtureSession and build_fixture_archive
to locate the spots to edit.

In `@tests/test_archive.py`:
- Around line 85-93: Update the inline `# noqa` comments on the subprocess.run
calls in tests/test_archive.py (the git init/config/commit invocations) to
include an explicit rationale for S607 in addition to S603; e.g., augment each
`# noqa: S603, S607` comment to state that arguments are not user-controlled and
that the `git` command is a hardcoded, trusted command (same change also for the
later subprocess.run calls around lines 113-137).

In `@tests/test_training_export.py`:
- Line 16: The test suite imports the jsonschema package (seen in
test_training_export.py and test_training_record.py) but jsonschema is not
declared as a dependency; add jsonschema to the test/dev dependencies in
pyproject.toml (e.g., under [tool.poetry.dev-dependencies] or
[project.optional-dependencies."test"]) so CI installs it before running tests
and import errors are avoided.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 5e0e1b36-f6a8-44c7-bd46-f8ad8718001b

📥 Commits

Reviewing files that changed from the base of the PR and between f25b1ba and 2d1bc51.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (21)
  • daydream/archive/git_context.py
  • daydream/archive/index.py
  • daydream/archive/manifest.py
  • daydream/cli.py
  • daydream/git_ops.py
  • daydream/training/__init__.py
  • daydream/training/exclusion.py
  • daydream/training/export.py
  • daydream/training/schema.py
  • daydream/training/schema/copyleft.txt
  • daydream/training/schema/exclusion.txt
  • daydream/training/schema/v1.json
  • tests/fixtures/training/__init__.py
  • tests/fixtures/training/build_archive.py
  • tests/test_archive.py
  • tests/test_git_ops.py
  • tests/test_training_exclusion.py
  • tests/test_training_export.py
  • tests/test_training_query.py
  • tests/test_training_record.py
  • tests/test_training_stratify.py

Comment thread daydream/cli.py
Comment thread daydream/git_ops.py Outdated
Comment thread daydream/training/export.py
Comment thread daydream/training/export.py
Comment thread tests/fixtures/training/build_archive.py
Comment thread tests/test_archive.py
Comment thread tests/test_training_export.py
… deps

- Fail fast on conflicting --include-all-labels + --label flags (cli.py)
- Use parse_intermixed_args for feedback subcommand so optional TARGET
  positional is recognized after flags (cli.py)
- Catch GitError in diff_name_only() to honor soft-failure contract (git_ops.py)
- Warn instead of silently dropping malformed step_id and outcome_labels
  in export pipeline (training/export.py)
- Add jsonschema to dev dependencies — used directly in tests (pyproject.toml)
- Add Attributes section to FixtureSession docstring (build_archive.py)

Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:training Training pipeline and data preparation enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant