[ENG-2059] Implement crash-safe checkpointing for vf-eval#3
Draft
AmeenP wants to merge 1 commit into
Draft
Conversation
ae16222 to
904d5ee
Compare
Add SimpleCheckpoint system with automatic resume capability: Core Implementation: - verifiers/utils/checkpoint.py: SimpleCheckpoint class with immediate fsync writes - verifiers/scripts/eval.py: Integrated checkpointing with simplified CLI - Simplified to 3 parameters: --output-dir, --checkpoint-every, --seed Key Features: - Crash-safe: Immediate append + fsync for both successes and failures - Auto-resume: Signature-based validation, skips completed work - Auto-retry: Failed items automatically retried on resume - Always skip-on-error: Failures don't crash evaluation Exit Codes: - 0: All items completed successfully - 1: Some items failed (check failures.jsonl) - 2: Partial completion (interrupted, can resume) Files Created: - results.jsonl: All successful completions (append-only) - failures.jsonl: Current failures (snapshot at checkpoint) - manifest.json: Run config and counters (atomic writes)
904d5ee to
6f48eb3
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements crash-safe checkpointing for the vf-eval CLI with automatic resume capability. The system provides zero-data-loss guarantees through immediate fsync writes and atomic file operations.
Implementation
Core Components
SimpleCheckpoint (
verifiers/utils/checkpoint.py):os.replace()for manifest and failures snapshotCLI Integration (
verifiers/scripts/eval.py):--output-dir,--checkpoint-every,--seedKey Features
✅ Crash Safety
✅ Idempotent Resume
✅ Automatic Retry
✅ Data Integrity
Testing
Test Coverage: 17/17 Passing ✅
Unit Tests (7 tests -
tests/test_checkpoint.py):Integration Tests (8 tests -
tests/test_checkpoint_integration.py):CLI Tests (2 tests -
tests/test_eval_cli.py):Real API Validation
Validated with 336+ real API calls to OpenRouter:
Test Results:
See
test_results/directory for:Usage
Basic Evaluation
Custom Output Directory
Resume Interrupted Run
# Same command - auto-resumes from checkpoint python verifiers/scripts/eval.py gsm8k \ --model openai/gpt-4o-mini \ --num-examples 100 \ --output-dir ./my_eval_runOutput Files
Each evaluation creates three files in the output directory:
results.jsonl - All successful completions (append-only, never modified)
{"key": "0/0", "idx": 0, "rollout": 0, "status": "ok", "metrics": {"reward": 1.0}, ...} {"key": "1/0", "idx": 1, "rollout": 0, "status": "ok", "metrics": {"reward": 1.0}, ...}failures.jsonl - Current failures (rewritten at checkpoints as clean snapshot)
{"key": "5/0", "idx": 5, "rollout": 0, "status": "error", "error": "TimeoutError: ...", ...}manifest.json - Run configuration and counters (atomic writes)
{ "version": 1, "signature": "sha256:...", "config": { "model": "...", "num_examples": 100, ... }, "counters": { "total": 100, "done": 100, "failed": 0 }, "paths": { "results": ".../results.jsonl", "failures": ".../failures.jsonl" } }Changes Summary
Modified Files
verifiers/scripts/eval.py- Integrated SimpleCheckpoint, simplified CLItests/test_eval_cli.py- Updated for new checkpoint parametersNew Files
verifiers/utils/checkpoint.py- Core SimpleCheckpoint implementation (~220 lines)tests/test_checkpoint.py- Unit tests (7 tests)tests/test_checkpoint_integration.py- Integration tests (8 tests)test_results/- Complete testing documentation and scriptsStatistics
Performance
Backward Compatibility
✅ All checkpoint parameters have sensible defaults
✅ Existing code continues to work without modification
✅ No breaking changes to existing functionality
✅ Optional checkpoint parameters
Production Readiness
✅ All tests passing (17/17)
✅ Thoroughly validated (336+ real API calls)
✅ Zero data loss (all scenarios tested)
✅ Zero corruption (atomic writes, fsync)
✅ Performance validated (<1% overhead)
✅ Comprehensive documentation (test_results/)
Additional Notes
Related
test_results/directorytest_results/scripts/