Releases: MaverickHQ/executable-world-models
v0.8.5.1 — Evidence Policy Feedback Loop
This release completes the deterministic policy feedback loop for Executable World Models.
What changed:
- Added evidence-policy feedback layer
- Experiments can now influence future trading decisions
- Added policy module, policy builder, and policy feedback demo
- Added tests for evidence-policy and feedback loop
- Fixed deployed health version alignment
Why it matters:
The architecture now closes the loop from environment interaction to trajectories, evaluation, experiments, evidence datasets, policy updates, and improved future decisions.
This is deterministic policy feedback, not reinforcement learning.
v0.8.3 – Structural Evaluation Layer
0.8.3 introduces the first formal evaluation layer for Executable World Models.
This release adds deterministic, schema-aware structural evaluation for both single runs and experiments, built on top of the canonical v2 manifest schema.
Run-Level Structural Evaluation
Evaluate a single run’s artifacts:
ewm run evaluate --artifacts-dir <path>
ewm run evaluate --artifacts-dir <root> --run-id <id>
Produces:
• Deterministic evaluation.json
• Integrity validation (manifest v2 enforcement)
• Constraint checks (runtime_budgets_max_steps, policy_limits)
• Structural metrics (steps_executed, truncated_by_budget)
Key properties:
• Deterministic output (no timestamps)
• Stable error codes (manifest_missing, run_id_mismatch, etc.)
• Writes evaluation output even on failure
• No AWS or network dependencies
Experiment-Level Structural Aggregation
Aggregate metrics across multiple runs:
ewm experiment evaluate --experiment-dir <path>
Produces:
• evaluation_summary.json
• evaluation_runs.csv
Metrics include:
• total_runs
• avg_steps_executed
• pct_truncated_by_budget
• integrity summaries
• per-run structural results
• Manifest v2 canonical schema enforced
• Deterministic JSON (sort_keys=True)
• No optional dependency coupling
• No runtime/AWS imports in evaluation layer
• 231 unit tests passing
• Full AWS integration suite passing
• Observability validation passing
⸻
🧪 CLI Examples
Single run:
ewm run evaluate --artifacts-dir tmp/artifacts --run-id abc-123
Experiment:
ewm experiment evaluate --experiment-dir tmp/experiment_001
📦 Technical Notes
• runtime_budgets_max_steps is canonical (runtime_budget_max_steps retained for backward compatibility)
• integrity_errors now use stable error codes
• Evaluation writes output even if manifest invalid
• correlation_id remains canonical; trace_id retained for backward compatibility
v0.8.2.3
Stability & Test Baseline Release
This patch release improves CLI robustness and establishes a clean-green test baseline across local and AWS environments.
Improvements
- Lazy CLI import for experiment command: Optional dependencies are no longer required for non-experiment commands (mode, env, target, cost, runs, etc.).
- Removed eager certifi import: The experiment module now loads optional HTTPS dependencies only when needed (AWS target), eliminating unnecessary startup coupling.
- Subprocess test reliability: CLI subprocess tests now use sys.executable, ensuring consistent interpreter usage across environments.
- Improved AWS integration test portability: Integration tests resolve the artifacts bucket via CloudFormation outputs when ARTIFACT_BUCKET is not set.
Test Baseline
• make lint → 0 errors
• pytest tests/ --ignore=infra → 224 passed, 5 skipped, 0 failed
• AWS deployment verified (/health returns 0.8.2.3)
• Observability verification script passes
Compatibility
• No breaking changes.
• No runtime behavior changes beyond improved CLI dependency handling.
v0.8.1-Agent-Runtime
This release introduces the first deployed Agent Runtime for executable-world-models.
Highlights
• Deployed /agentcore/loop execution endpoint
• Deterministic artifact upload to S3 (decision.json, trajectory.json, deltas.json)
• DynamoDB run persistence
• Budget semantics enforcement
• Correlation ID propagation across API, logs, and EMF metrics
• Structured observability and latency tracking
This release establishes the execution and deployment layers required for experimental evaluation of agent-based world models.
v0.8.1-fixes
What’s in this release
• S3 artifacts enabled: agentcore/loop now uploads decision.json, deltas.json, and trajectory.json to:
• s3://beyondtokensstack-artifactsbucket2aac5544-vayurszcre4w/artifacts/<run_id>/
• Correlation ID propagation: x-correlation-id is preferred; fallback to X-Ray trace id, then UUID.
• Returned in the API response as correlation_id
• Logged in a grep-friendly line: correlation_id=
• EMF metrics now also use correlation_id (no trace_id)
• Response rounding: cash_balance is now rounded to 2 decimal places in API responses.
Verified
• Deployed and tested in us-east-1
• GET /health ✅
• POST /agentcore/loop ✅
• DynamoDB run persistence ✅
• S3 artifact presence for new runs ✅
• CloudWatch logs include correlation_id ✅
Notes
• Older artifacts remain in S3 under prior run IDs; new runs now consistently write the three key artifacts.
v0.7.12-cli — Deterministic run inspection + guardrail transparency
This release strengthens the inspection and observability layer of the executable world model loop.
Key improvements:
• Human-readable decision field (APPROVED, REJECTED, UNKNOWN)
• Rounded financial values for deterministic CLI output
• Rejection summaries with step index, action, and limiter reason
• Support for --raw and --json output modes
• Expanded unit test coverage for runs inspection
Why this matters:
The loop is now externally inspectable and deterministic.
Guardrail decisions are transparent, reproducible, and versioned via release tags.