Skip to content

etalbert102/decision_space_harness

Repository files navigation

Decision Space Harness

Decision Space Harness is a configuration-driven evaluation harness for measuring how LLM-agent pipelines preserve or compress disagreement, frame diversity, and option breadth during evidence synthesis.

Current State

The repo currently supports:

  • 3 registered task families: conflict_preservation, frame_diversity, option_space
  • 4 built-in agents: baseline_direct, retrieve_then_synthesize, option_generation, structured_conflict_preserving
  • heuristic and Ollama-backed model providers
  • named evidence providers including top1, top2, top5, top10, and precomputed_top2
  • benchmark_single_run/v1 and perturbation_group/v1 study protocols
  • structural metrics for conflict retention, frame preservation, option breadth, path dependence, and lexical overlap
  • append-only attempts.jsonl telemetry with derived runs.jsonl, metrics.jsonl, steps.jsonl, and optional message_records.jsonl
  • experiment report artifacts, repro bundles, and fidelity traces
  • rerun support for failed cells without deleting prior attempt history
  • 40 conflict-preservation benchmark tasks plus a frame-diversity smoke fixture

Quick Start

Run a sample experiment:

cd /home/etalbert102/decision_space_harness
source .venv/bin/activate
PYTHONPATH=src python -m decision_space_harness.experiments.runner experiments/configs/conflict_smoothing_v1.yaml

Rerun only cells whose current selected run is failed:

PYTHONPATH=src python -m decision_space_harness.experiments.runner \
  --rerun-failed-only \
  experiments/configs/conflict_smoothing_v1.yaml

Run the frame-diversity smoke experiment:

PYTHONPATH=src python -m decision_space_harness.experiments.runner experiments/configs/frame_diversity_smoke_v1.yaml

Run tests:

source .venv/bin/activate
PYTHONPATH=src pytest -q

Run fidelity assessment:

source .venv/bin/activate
PYTHONPATH=src python -m decision_space_harness.fidelity.assessment experiments/configs/conflict_smoothing_v1.yaml

Main Outputs

Each experiment writes into outputs/experiments/<experiment_id>/:

  • attempts.jsonl: canonical append-only attempt history
  • runs.jsonl: deterministic selected-run view
  • metrics.jsonl: scored metric rows
  • steps.jsonl: boundary and diagnostic step records
  • message_records.jsonl: optional intra-attempt communication trace
  • summary.json: experiment summary
  • report.md: markdown report
  • figures/: text and HTML metric summaries
  • repro_bundle/: config, manifests, environment, and supporting reproducibility artifacts
  • fidelity_trace.json: fidelity-framework event trace

Example Configs

  • experiments/configs/conflict_smoothing_v1.yaml
  • experiments/configs/conflict_full_suite_v1.yaml
  • experiments/configs/conflict_phase3_analysis_v1.yaml
  • experiments/configs/conflict_confirmatory_demo_v1.yaml
  • experiments/configs/conflict_ollama_demo_v1.yaml
  • experiments/configs/frame_diversity_smoke_v1.yaml

Documentation

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages