diff --git a/.gitignore b/.gitignore
index 17aa1eeb..ba03db69 100644
--- a/.gitignore
+++ b/.gitignore
@@ -168,4 +168,5 @@ OAI_CONFIG_LIST
 *.gv.pdf
 
 # jupyter book API output
-docs/api/*
\ No newline at end of file
+docs/api/*examples/notebooks/bbeh/
+t6_m2_bbeh_2.ipynb
diff --git a/README.md b/README.md
index fdfc153c..e46a2e00 100644
--- a/README.md
+++ b/README.md
@@ -32,7 +32,7 @@ Or for development, clone the repo and run the following.
 
     pip install -e .
 
-The library requires Python >= 3.9. By default (starting with v0.1.3.5), we use [LiteLLM](https://github.com/BerriAI/litellm) as the backend of LLMs. For backward compatibility, we provide backend-support with [AutoGen](https://github.com/microsoft/autogen); when installing, users can add `[autogen]` tag to install a compatible AutoGen version (e.g., `pip install trace-opt[autogen]`). You may require [Git Large File Storage](https://git-lfs.com/) if
+The library requires Python >= 3.10. By default (starting with v0.1.3.5), we use [LiteLLM](https://github.com/BerriAI/litellm) as the backend of LLMs. For backward compatibility, we provide backend-support with [AutoGen](https://github.com/microsoft/autogen); when installing, users can add `[autogen]` tag to install a compatible AutoGen version (e.g., `pip install trace-opt[autogen]`). You may require [Git Large File Storage](https://git-lfs.com/) if
 git is unable to clone the repository.
 
 **For questions or reporting bugs, please use Github Issues or post on our [Discord channel](https://discord.gg/4VeAvwFcWy). We actively check these channels.**
@@ -241,6 +241,36 @@ Defining and training an agent through Trace will give you more flexibility and
 | Advanced | [Robotic Arm Control](https://agentopt.github.io/Trace/examples/robotics/metaworld.html) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/AgentOpt/Trace/blob/website/docs/examples/robotics/metaworld.ipynb)                                     | Trace can optimize code to control a robotic arm after observing a full trajectory of interactions.                                                                                   |
 
 
+## Multi-Objective Optimization
+
+Trace supports **multi-objective optimization** where candidates are evaluated on
+multiple metrics simultaneously (e.g. accuracy + token cost, or base loss +
+regularization loss).
+
+See the full guide: **[docs/multi_objective_scores.md](docs/multi_objective_scores.md)**
+
+Key features:
+- **Vector scores** — `Guide.get_score_dict()` returns `Dict[str, float]` with named metrics
+- **Weighted scalarization** and **Pareto dominance** ranking via `ObjectiveConfig`
+- Supported in `BasicSearchAlgorithm`, `BeamsearchAlgorithm`, and `BeamsearchHistoryAlgorithm`
+- Token-minimization pattern using `UsageTrackingLLM` + `TokenUsageAugmentingGuide`
+
+Canonical notebooks:
+
+| Notebook | Description |
+|---|---|
+| [multiobjective_quickstart](examples/notebooks/multiobjective_quickstart.ipynb) | Core vector-score infrastructure and BasicSearch integration |
+| [multiobjective_trainers](examples/notebooks/multiobjective_trainers.ipynb) | Beamsearch and PrioritySearch multi-objective support |
+| [multiobjective_bbeh_langgraph](examples/notebooks/multiobjective_bbeh_langgraph.ipynb) | Real LLM task: BBEH boolean expressions with accuracy + execution time |
+
+Trace-Bench multi-objective benchmarks (in [AgentOpt/Trace-Bench](https://github.com/AgentOpt/Trace-Bench)):
+
+| Notebook | Task | Metrics |
+|---|---|---|
+| `multiobjective_convex` | SixHumpCamel | base_loss, reg_loss |
+| `multiobjective_bbeh` | BBEH boolean_expressions | accuracy, execution_time_s |
+| `multiobjective_gsm8k` | GSM8K + token usage | error, tokens_in, tokens_out |
+
 ## Supported Optimizers
 
 Currently, we support three optimizers:
diff --git a/docs/T6_technical_plan.md b/docs/T6_technical_plan.md
deleted file mode 100644
index 60891818..00000000
--- a/docs/T6_technical_plan.md
+++ /dev/null
@@ -1,843 +0,0 @@
-# T6 Technical Plan — Multi-Objective Vector Scores for Trainer Selection
-
-**Version:** 1.0 (Refined)
-**Author:** Carlos Rodriguez
-**Date:** February 9, 2026
-**Status:** M0 Deliverable — Analysis + Architecture + Interface Spec
-
-**Target repos / branches:**
-- **Primary (implementation + PR):** [`AgentOpt/OpenTrace@experimental`](https://github.com/AgentOpt/OpenTrace/tree/experimental)
-- **Benchmark integration (M3):** [`AgentOpt/Trace-Bench`](https://github.com/AgentOpt/Trace-Bench)
-
----
-
-## Table of Contents
-
-1. [Executive Summary](#1-executive-summary)
-2. [Goals, Non-Goals, Success Criteria](#2-goals-non-goals-success-criteria)
-3. [Current Code Reality (Baseline)](#3-current-code-reality-baseline)
-4. [Proposed Architecture (Minimal Delta)](#4-proposed-architecture-minimal-delta)
-5. [Public API & Data Contracts](#5-public-api--data-contracts)
-6. [Module Modifications (Files to Create / Modify)](#6-module-modifications)
-7. [Edge Cases & Defensive Design](#7-edge-cases--defensive-design)
-8. [Milestones & Validation Gates](#8-milestones--validation-gates)
-9. [Test Plan](#9-test-plan)
-10. [Risks & Mitigation](#10-risks--mitigation)
-11. [Design Decisions (Resolved)](#11-design-decisions-resolved)
-12. [Appendix: Code Touchpoints](#12-appendix-code-touchpoints)
-
----
-
-## 1. Executive Summary
-
-Today, trainer selection in Trace is driven by a **single scalar score**. Guides return `Tuple[float, str]` via `get_feedback()`, evaluators produce `np.array` of floats, and trainers (`BasicSearchAlgorithm`, `BeamsearchAlgorithm`) select candidates via scalar comparison (`max(candidates, key=lambda x: x[0])` and `sorted(..., key=lambda x: x[0])` respectively). This blocks trainer-side search from exploiting multiple metrics like `{accuracy, latency_ms, cost}`.
-
-**Motivation note (from team discussion):**
-Putting multiple metrics into the *feedback dict/text* is useful for optimizers (OptoPrime/OPRO), but trainers (BasicSearch/UCB/PrioritySearch/GEPA) typically only inspect the **scalar score** for ranking/UCB and ignore additional feedback structure. Therefore, enabling **vector score / score-as-dict** (with backward-compatible scalar reduction) is required for multi-objective trainer selection.
-
-### What this plan adds
-
-| Component | Change |
-|-----------|--------|
-| **Score contract** | `Dict[str, float]` returned by guides (optional), with backward-compatible scalar fallback |
-| **ObjectiveConfig** | Frozen dataclass defining selection mode: `scalar` (default), `weighted`, or `pareto` |
-| **objectives.py** (new) | All multi-objective logic isolated in pure, testable functions |
-| **Evaluators** | Vector-score aggregation helpers (`evaluate_vector`, `aggregate_vector_scores`) |
-| **BasicSearchAlgorithm** | Selection via `select_best(candidates, objective_config)` |
-| **BeamsearchAlgorithm** | Selection via `select_top_k(candidates, objective_config, k)` |
-| **PrioritySearch** (optional) | Scalarize heap priority via ObjectiveConfig; store dict for logging |
-| **Benchmarks** (M3) | 3 simple benchmarks integrated into Trace-Bench |
-
-### Guiding principles
-
-- **Backward compatibility is non-negotiable.** `mode="scalar"` (the default) preserves identical behavior.
-- **Isolate complexity.** All multi-objective logic lives in `objectives.py` — pure functions, easy to test.
-- **Minimal churn.** Trainers gain an optional `objective_config` parameter; existing call sites are untouched.
-- **Determinism.** Fixed `seed` → deterministic selection, especially Pareto tie-breaks.
-
----
-
-## 2. Goals, Non-Goals, Success Criteria
-
-### 2.1 Goals
-
-| ID | Goal | Acceptance Signal |
-|----|------|-------------------|
-| G1 | **Backward compatibility** | Existing scalar-score guides/trainers produce identical results when `objective_config` is `None` or `mode="scalar"` |
-| G2 | **Vector score support** | Guide returns `{"accuracy": 1.0, "latency_ms": 120.0}` and trainers select candidates using weighted or Pareto mode |
-| G3 | **Determinism** | Fixed `seed` → identical selection across runs (tested in CI) |
-| G4 | **Actionability** | Every milestone: Colab notebook + pytest coverage (M1+) |
-| G5 | **Benchmarks** | 3 benchmarks defined, integrated into Trace-Bench, runnable from notebooks |
-
-### 2.2 Non-goals (explicit)
-
-- No multi-objective UCB (MO-UCB) — too risky for v1 scope.
-- No Pareto archive / non-dominated set management inside PrioritySearch.
-- No changes to optimizer internals or new telemetry infrastructure.
-- No modification to `get_feedback()` return signature (we use a helper instead).
-
-### 2.3 Crisp success criteria
-
-All of the following must be true:
-
-1. Scalar-only trainers still work and produce same results by default.
-2. Multi-objective guide dict works end-to-end for BasicSearch + Beamsearch.
-3. Deterministic behavior with fixed seed (tests + notebook).
-4. Each milestone delivers a runnable Colab notebook.
-5. From M1 onward, new functions have pytest tests and CI is green.
-6. M3: three benchmarks exist, run, and Trace-Bench integration works.
-
----
-
-## 3. Current Code Reality (Baseline)
-
-### 3.1 Guide — scalar score contract
-
-```python
-# opto/trainer/guide.py
-
-class Guide:
-    def get_feedback(self, query, response, reference=None, **kwargs) -> Tuple[float, str]:
-        raise NotImplementedError
-
-    def metric(self, query, response, reference=None, **kwargs) -> float:
-        return self.get_feedback(query, response, reference)[0]  # extracts scalar
-```
-
-**Implication:** `metric()` always returns `float`. Multi-metric feedback is not usable for selection.
-
-### 3.2 Evaluators — scalar arrays
-
-```python
-# opto/trainer/evaluators.py
-
-def evaluate(agent, guide, inputs, infos, ...) -> np.ndarray:
-    # Calls guide.metric() per example → float
-    # Returns np.array of shape (N,) or (N, num_samples)
-```
-
-**Implication:** All scores are numeric scalars aggregated via `np.mean()`.
-
-### 3.3 BasicSearchAlgorithm — scalar max selection
-
-```python
-# opto/trainer/algorithms/basic_algorithms.py :: BasicSearchAlgorithm.optimizer_step()
-
-def validate():
-    scores = evaluate(self.agent, self.validate_guide, ...)
-    return np.mean(scores) if all([s is not None for s in scores]) else -np.inf
-
-# Selection:
-candidates.append((score, update_dict))        # score is float
-best_score, best_update = max(candidates, key=lambda x: x[0])  # scalar max
-```
-
-**Insertion point:** Replace `max(candidates, ...)` with `select_best(candidates, objective_config)`.
-
-### 3.4 BeamsearchAlgorithm — scalar sort selection
-
-```python
-# opto/trainer/algorithms/beamsearch_algorithm.py :: BeamsearchAlgorithm.select()
-
-scored_candidates.append((validation_score, candidate_params))  # float
-sorted_candidates = sorted(scored_candidates, key=lambda x: x[0], reverse=True)
-selected_candidates = sorted_candidates[:beam_width]  # take top-k by scalar
-```
-
-**Insertion point:** Replace scalar sort with `select_top_k(scored_candidates, objective_config, k=beam_width)`.
-
-### 3.5 Shared patterns across both trainers
-
-| Pattern | BasicSearch | Beamsearch |
-|---------|-------------|------------|
-| Validate | `np.mean(scores)` → float | `np.mean(validation_scores)` → float |
-| Store | `(score, update_dict)` | `(validation_score, candidate_params)` |
-| Select | `max(candidates, key=λ x: x[0])` | `sorted(candidates, key=λ x: x[0])[:k]` |
-| Fallback | `-np.inf` | `-np.inf` |
-
-Both converge to the same abstraction: **given a list of `(score, params)` pairs, select the best or top-k.** This is exactly what `objectives.py` will provide.
-
-### 3.6 Existing infrastructure we leverage
-
-- **Logger abstraction:** `BaseLogger` with `log(name, value, step)` — can log each metric in a vector score.
-- **StubLLM / DummyLLM:** Wraps deterministic callables — usable for CI and no-keys notebooks.
-- **`batch_run` / `async_run`:** Parallelism utilities already in place.
-
----
-
-## 4. Proposed Architecture (Minimal Delta)
-
-### 4.1 Core idea
-
-Isolate all multi-objective logic into one new module (`opto/trainer/objectives.py`) containing **pure functions**:
-
-```
-to_score_dict()     →  scalar/dict to dict conversion (neutral name)
-apply_minimize()    →  flip signs for minimize metrics
-weighted_scalarize()→  dict → float via weighted sum
-pareto_rank()       →  dominance ranking + tie-break
-select_best()       →  given candidates + config, return best index
-select_top_k()      →  given candidates + config, return top-k indices
-```
-
-Trainers call these functions instead of inline `max()` / `sorted()`. When `objective_config` is `None`, the functions fall through to scalar comparison — **identical to current behavior**.
-
-### 4.2 Data flow (target)
-
-```
-Guide.get_feedback()
-    │
-    ├── returns (float, str)          ← existing path, unchanged
-    └── returns (Dict[str,float], str) ← new path (via get_score_dict helper)
-            │
-            ▼
-Evaluator.evaluate_vector()
-    │
-    ├── per-example: List[Dict[str, float]]
-    └── aggregated:  Dict[str, float]  (mean per metric)
-            │
-            ▼
-Trainer selection (objectives.py)
-    │
-    ├── mode="scalar"   → max(mean_scores)           ← unchanged
-    ├── mode="weighted"  → max(weighted_scalarize())  ← new
-    └── mode="pareto"    → pareto_rank() + tie-break  ← new
-```
-
-### 4.3 Backward compatibility guarantee
-
-The entire vector-score path is **opt-in**:
-
-1. If `objective_config` is `None` → existing scalar path, no new code executed.
-2. If guide returns `float` and `objective_config` is provided → `to_score_dict()` wraps it as `{"score": float}`, weights default to `{"score": 1.0}`.
-3. If guide returns `Dict[str, float]` and `objective_config` is `None` → `ValueError` is raised (no hidden hard-coded dict→scalar reduction). Pass an explicit `ObjectiveConfig(mode="scalar", scalarize_dict="mean")` to reduce via mean, or `scalarize_dict="score"` to use a single key.
-
----
-
-## 5. Public API & Data Contracts
-
-### 5.1 Score types
-
-```python
-from typing import Union, Dict
-
-ScalarScore = float
-VectorScore = Dict[str, float]          # JSON-serializable, all values finite
-ScoreLike   = Union[int, float, bool, Dict[str, float]]
-```
-
-**Contract:**
-- "Higher is better" by default for all metrics.
-- Metrics to minimize are declared in `ObjectiveConfig.minimize` (semantics: negate internally).
-- All dict values must be finite floats. `NaN` / `±inf` in a dict raises `ValueError`.
-- `int` and `bool` scalar scores are accepted and converted to `float` (e.g., `LLMJudge` returns `int` 0/1, test guides return `bool`).
-
-### 5.2 ObjectiveConfig
-
-```python
-from dataclasses import dataclass, field
-from typing import Literal, Optional, Dict, Tuple
-
-@dataclass(frozen=True)
-class ObjectiveConfig:
-    """Configuration for multi-objective candidate selection.
-
-    Attributes:
-        mode: Selection strategy.
-            - "scalar": Use existing scalar comparison (default, backward-compatible).
-            - "weighted": Scalarize via weighted sum, then select max.
-            - "pareto": Pareto dominance ranking with configurable tie-break.
-        weights: Per-metric weights for weighted scalarization.
-            Missing metrics use missing_value. Metrics not present in the weights dict
-            are ignored (not included in the weighted sum).
-            If empty dict in weighted mode, all present metrics get equal weight 1.0.
-        minimize: Frozenset of metric names where lower is better (users can pass set; auto-converted).
-            These are negated internally before comparison ("higher-is-better" normalization).
-        missing_value: Score assigned to missing metrics in a candidate's score dict.
-            Default: float('-inf') (effectively disqualifies candidates missing required metrics).
-        pareto_metrics: Subset of metrics to use for Pareto dominance.
-            If None, all metrics present across candidates are used.
-        tie_break: Strategy for breaking ties among Pareto-equivalent candidates.
-            - "weighted": Fall back to weighted scalarization among tied candidates.
-            - "lexicographic": Sort by metrics in alphabetical order.
-            - "random_seeded": Seeded random shuffle.
-        seed: Random seed for deterministic tie-breaking.
-    """
-    mode: Literal["scalar", "weighted", "pareto"] = "scalar"
-    weights: Dict[str, float] = field(default_factory=dict)
-    minimize: frozenset = field(default_factory=frozenset)
-    missing_value: float = float("-inf")
-    pareto_metrics: Optional[Tuple[str, ...]] = None
-    tie_break: Literal["weighted", "lexicographic", "random_seeded"] = "weighted"
-    seed: int = 0
-
-    def __post_init__(self):
-        # Convert set → frozenset for true immutability + hashability
-        if isinstance(self.minimize, set):
-            object.__setattr__(self, 'minimize', frozenset(self.minimize))
-        # Validate weights are non-negative
-        for k, v in self.weights.items():
-            if v < 0:
-                raise ValueError(f"Weight for '{k}' must be non-negative, got {v}")
-        # Validate pareto_metrics
-        if self.pareto_metrics is not None and len(self.pareto_metrics) == 0:
-            raise ValueError("pareto_metrics must be None (auto) or non-empty tuple")
-```
-
-**Validation rules (enforced in `__post_init__`):**
-- `minimize` is stored as `frozenset` for true immutability (users can pass `set` for convenience; it's auto-converted).
-- `mode="weighted"` with empty `weights` → auto-assign equal weight 1.0 to all encountered metrics.
-- `mode="pareto"` with `pareto_metrics=None` → use union of all metric keys across candidates.
-- `mode="pareto"` with `pareto_metrics=()` → `ValueError`.
-- All weight values must be non-negative.
-- `minimize` metric names must be valid strings (warning if not found in any candidate).
-
-### 5.3 Guide helper method
-
-```python
-# Added to Guide base class (non-breaking)
-
-class Guide:
-    # ... existing methods unchanged ...
-
-    def get_score_dict(self, query: str, response: str, reference=None, **kwargs) -> Dict[str, float]:
-        """Return evaluation score as a dict (multi-objective selection path).
-
-        Default implementation wraps the scalar training score from get_feedback() as:
-            {"score": float_value}
-
-        Guides that need multiple metrics should override *get_score_dict()* and return
-        e.g. {"accuracy": 0.9, "brevity": 0.8, "latency_s": 0.05}.
-
-        Note: get_feedback() should remain scalar (float) for training-loop backward
-        compatibility. If a subclass returns a dict from get_feedback(), metric() and
-        scalar evaluators may break; prefer overriding get_score_dict().
-        """
-        score, _ = self.get_feedback(query, response, reference, **kwargs)
-        if isinstance(score, dict):
-            return {k: float(v) for k, v in score.items()}
-        return {"score": float(score)}
-```
-
-**Why this approach:**
-- `get_score_dict()` is a new method — zero risk of breaking existing subclasses.
-- `metric()` always returns `float` — the existing `evaluate()` function (which calls `guide.metric()` and passes results to `np.array()`) and the training loop (which calls `np.mean(scores)`) are completely unaffected.
-- Dict scores are only accessible via `get_score_dict()` → `evaluate_vector()`, keeping the two data paths cleanly separated.
-
-### 5.4 Evaluator additions
-
-```python
-# Added to opto/trainer/evaluators.py
-
-def evaluate_vector(agent, guide, inputs, infos, min_score=None,
-                    num_samples=1, num_threads=None, description=None
-                    ) -> list:
-    """Like evaluate(), but returns List[ScoreLike] (float or dict per example).
-
-    Uses guide.get_score_dict() to obtain dict scores per example.
-    When guide returns scalar, get_score_dict() wraps it as {"score": float}.
-
-    When num_samples > 1: for each example, collects num_samples score dicts,
-    computes per-key mean across the samples, and returns one aggregated dict
-    per example. Final output is always List[Dict[str, float]] of length N.
-    """
-    ...
-
-def aggregate_vector_scores(scores: list) -> Union[float, Dict[str, float]]:
-    """Aggregate per-example scores into a single summary score.
-
-    - If all scores are float: returns np.mean (existing behavior).
-    - If all scores are dict: returns per-metric mean dict.
-    - Mixed float/dict: normalizes all to dict via to_score_dict(), then averages.
-
-    Args:
-        scores: List of float or Dict[str, float] values.
-
-    Returns:
-        float (if all scalar) or Dict[str, float] (if any dicts present).
-    """
-    ...
-```
-
-### 5.5 objectives.py — complete function signatures
-
-```python
-# opto/trainer/objectives.py (NEW FILE)
-
-from typing import Union, Dict, List, Set, Optional, Tuple, Literal
-from dataclasses import dataclass, field
-
-# --- ObjectiveConfig defined here (see §5.2) ---
-
-# --- Score type aliases ---
-ScalarScore = float
-VectorScore = Dict[str, float]
-ScoreLike = Union[float, Dict[str, float]]
-
-# --- Pure utility functions ---
-
-def to_score_dict(score: ScoreLike) -> Dict[str, float]:
-    """Convert any score to dict form (neutral name).
-
-    - int/float/bool → {"score": float(value)}
-    - Dict[str, float] → returned as-is (validated: all values finite)
-
-    Handles int (LLMJudge returns 0/1) and bool (test guides) via isinstance(score, (int, float, bool)).
-    Backward-compatible alias: `normalize_score = to_score_dict`
-
-    Raises:
-        TypeError: if score is not int, float, bool, or dict
-        ValueError: if dict contains non-finite values or is empty
-    """
-    ...
-
-def apply_minimize(score_dict: Dict[str, float],
-                   minimize: Set[str]) -> Dict[str, float]:
-    """Negate values for minimize metrics (higher-is-better normalization).
-
-    Returns a new dict with minimize metrics negated.
-    Metrics not in minimize set are unchanged.
-    """
-    ...
-
-def weighted_scalarize(score_dict: Dict[str, float],
-                       weights: Dict[str, float],
-                       missing_value: float = float("-inf")) -> float:
-    """Compute weighted sum of score dict.
-
-    For each metric in weights:
-      - If present in score_dict: weight * value
-      - If missing: weight * missing_value
-
-    Metrics in score_dict but NOT in weights are ignored.
-    If weights is empty, all metrics get equal weight 1.0.
-
-    Returns:
-        Weighted scalar score.
-    """
-    ...
-
-def dominates(a: Dict[str, float], b: Dict[str, float],
-              metrics: Optional[Tuple[str, ...]] = None) -> bool:
-    """Check if candidate 'a' Pareto-dominates candidate 'b'.
-
-    a dominates b iff:
-      - a[m] >= b[m] for all metrics m, AND
-      - a[m] > b[m] for at least one metric m
-
-    Both dicts must already be in "higher-is-better" form (post apply_minimize).
-    Missing metrics are treated as missing_value (caller should handle before call).
-
-    Args:
-        a, b: Score dicts (higher-is-better normalized).
-        metrics: Subset of metrics to compare. If None, use union of keys.
-    """
-    ...
-
-def pareto_rank(candidates: List[Dict[str, float]],
-                metrics: Optional[Tuple[str, ...]] = None) -> List[int]:
-    """Assign Pareto rank to each candidate (0 = non-dominated front).
-
-    Uses standard non-dominated sorting.
-
-    Args:
-        candidates: List of score dicts (higher-is-better normalized).
-        metrics: Subset of metrics for dominance. If None, use all present.
-
-    Returns:
-        List of integer ranks (same length as candidates). Rank 0 = Pareto front.
-    """
-    ...
-
-def select_best(candidates: List[Tuple[ScoreLike, any]],
-                objective_config: Optional['ObjectiveConfig'] = None) -> int:
-    """Select the single best candidate index.
-
-    Args:
-        candidates: List of (score, payload) tuples.
-        objective_config: Selection config. If None, uses scalar max (backward-compatible).
-
-    Returns:
-        Index of best candidate.
-
-    Behavior by mode:
-        - scalar/None: max(score) where score is float (or mean of dict values).
-        - weighted: max(weighted_scalarize(normalize(score), config.weights)).
-        - pareto: rank candidates, tie-break among rank-0 front, return winner.
-
-    Call-site transformation (BasicSearch):
-        # Current:
-        best_score, best_update = max(candidates, key=lambda x: x[0])
-        # Target:
-        best_idx = select_best(candidates, objective_config)
-        best_score, best_update = candidates[best_idx]
-    """
-    ...
-
-def select_top_k(candidates: List[Tuple[ScoreLike, any]],
-                 objective_config: Optional['ObjectiveConfig'] = None,
-                 k: int = 1) -> List[int]:
-    """Select the top-k candidate indices.
-
-    Same logic as select_best, but returns k indices.
-
-    For pareto mode: returns rank-0 front (up to k). If front < k,
-    includes rank-1 candidates by tie-break order, etc.
-
-    Deterministic ordering guaranteed with fixed seed.
-    """
-    ...
-```
-
----
-
-## 6. Module Modifications
-
-### 6.1 Files to CREATE
-
-| File | Contents | Milestone |
-|------|----------|-----------|
-| `opto/trainer/objectives.py` | `ObjectiveConfig`, `to_score_dict`, `apply_minimize`, `weighted_scalarize`, `dominates`, `pareto_rank`, `select_best`, `select_top_k`, `score_dict_to_scalar`, `to_scalar_score`, `aggregate_score_dicts` | M1 |
-| `tests/test_objectives.py` | Unit tests for all functions in objectives.py | M1 |
-| `tests/test_evaluators_vector.py` | Tests for evaluate_vector + aggregate_vector_scores | M1 |
-| `tests/test_trainers_multiobjective.py` | Integration tests for BasicSearch + Beamsearch with ObjectiveConfig | M2 |
-| `examples/notebooks/t6_m0_analysis.ipynb` | M0 analysis notebook | M0 |
-| `examples/notebooks/t6_m1_vector_scores.ipynb` | M1 demo notebook | M1 |
-| `examples/notebooks/t6_m2_trainers.ipynb` | M2 demo notebook | M2 |
-| `examples/notebooks/t6_m3_benchmarks.ipynb` | M3 benchmark notebook | M3 |
-| `docs/T6_technical_plan.md` | This document | M0 |
-| `docs/multi_objective_scores.md` | User-facing documentation | M4 |
-
-### 6.2 Files to MODIFY
-
-| File | Change | Milestone |
-|------|--------|-----------|
-| `opto/trainer/guide.py` | Add `get_score_dict()` method to `Guide` base class. Keep training loop scalar-safe (`metric()` returns `float`). Dict/vector scores are accessed via `get_score_dict()` for trainer-side selection. | M1 |
-| `opto/trainer/evaluators.py` | Add `evaluate_vector()` and `aggregate_vector_scores()`. Existing `evaluate()` unchanged. | M1 |
-| `opto/trainer/algorithms/basic_algorithms.py` | Add `objective_config` param to `BasicSearchAlgorithm.train()`. Replace `max(candidates, ...)` with `select_best()` in `optimizer_step()`. | M1 (minimal) / M2 (robust) |
-| `opto/trainer/algorithms/beamsearch_algorithm.py` | Add `objective_config` param to `BeamsearchAlgorithm.train()`. Replace scalar sort in `select()` with `select_top_k()`. | M2 |
-| `opto/features/priority_search/priority_search.py` | (Optional) Add `objective_config` param. Scalarize heap key via weighted mode. Store dict for logging. Pareto falls back to weighted. | M2 |
-
-### 6.3 Files NOT modified
-
-- `opto/trace/` — no changes to trace primitives.
-- `opto/optimizers/` — optimizers are upstream of selection; they produce candidates, not rank them.
-- Existing tests — no modifications; they validate backward compatibility by continuing to pass.
-
----
-
-## 7. Edge Cases & Defensive Design
-
-### 7.1 Score validation
-
-| Case | Behavior |
-|------|----------|
-| `score = 0.85` (float) | `to_score_dict()` → `{"score": 0.85}` |
-| `score = 1` (int) | `to_score_dict()` → `{"score": 1.0}` (LLMJudge returns int 0/1) |
-| `score = True` (bool) | `to_score_dict()` → `{"score": 1.0}` (test guides return bool) |
-| `score = {"accuracy": 0.9, "latency_ms": 120.0}` | Returned as-is after validation |
-| `score = {}` (empty dict) | `ValueError("Score dict must not be empty")` |
-| `score = {"accuracy": float('nan')}` | `ValueError("Score dict contains non-finite value")` |
-| `score = {"accuracy": float('inf')}` | `ValueError("Score dict contains non-finite value")` |
-| `score = "text"` (wrong type) | `TypeError("Score must be int, float, bool, or Dict[str, float]")` |
-
-### 7.2 Missing metrics across candidates
-
-| Case | Behavior |
-|------|----------|
-| Candidate A has `{accuracy, latency}`, B has `{accuracy}` | B gets `latency = missing_value` (default `-inf`) |
-| `weights = {"accuracy": 0.7, "latency": 0.3}`, candidate missing `latency` | Weighted sum uses `0.3 * missing_value` |
-| All candidates missing a weighted metric | Warning logged; metric still contributes `weight * missing_value` |
-
-### 7.3 Mixed scalar/dict batches
-
-| Case | Behavior |
-|------|----------|
-| All scores are `float` (or `int`/`bool`) | `aggregate_vector_scores()` returns `float` via `np.mean()` (existing behavior) |
-| All scores are `dict` with same keys | `aggregate_vector_scores()` returns per-metric mean `Dict[str, float]` |
-| Mixed `float` and `dict` in same batch | `ValueError("All scores in a batch must be the same type (all float or all dict)")` |
-
-A mixed batch most likely indicates a bug in the guide implementation (e.g., returning `float` on some inputs and `dict` on others). Failing loudly prevents silent incorrect aggregation.
-
-### 7.4 Single-metric dict
-
-| Case | Behavior |
-|------|----------|
-| Guide returns `{"accuracy": 0.9}` with `mode="weighted"` | Weighted sum = `weight * 0.9` (trivially correct) |
-| Guide returns `{"accuracy": 0.9}` with `mode="pareto"` | Pareto degenerates to scalar max (single dimension — no tradeoffs). Warning logged. |
-
-### 7.5 Tie-breaking
-
-| Case | Behavior |
-|------|----------|
-| Two candidates with identical weighted score | Deterministic: lower original index wins (stable sort) |
-| Pareto front with 3 equivalent candidates, `tie_break="weighted"` | Fall back to weighted scalarization among the 3; select max |
-| Pareto front with 3 equivalent candidates, `tie_break="lexicographic"` | Sort by metric names alphabetically, compare values in order |
-| Pareto front with 3 equivalent candidates, `tie_break="random_seeded"` | Seeded shuffle with `config.seed`; same seed → same order always |
-
-### 7.7 Training loop safety
-
-The training loop has a **separate data path** from evaluation/selection. In `standard_optimization_step()` (basic_algorithms.py:46) and `standard_forward()` (sampler.py:130):
-
-```python
-score, feedback = guide(x, target.data, info)
-```
-
-This `score` flows into `MinibatchAlgorithm.update()` where `np.mean(scores)` is computed (basic_algorithms.py:511). **This path must always receive `float`.**
-
-| Constraint | Enforcement |
-|-----------|-------------|
-| `guide.__call__()` / `get_feedback()` return type is **NOT widened** | No changes to `get_feedback()` signature; it still returns `Tuple[float, str]` |
-| Training loop always receives scalar `score` | `metric()` always returns `float`. Vector/dict scores are not used by the training loop and are accessed via `get_score_dict()` for trainer-side selection. |
-| Dict scores flow through a separate path | `get_score_dict()` → `evaluate_vector()` → `select_best()` / `select_top_k()` |
-| A multi-objective guide must return `(float, str)` from `get_feedback()` for the training loop | The float is a collapsed scalar summary; the full dict is extracted via `get_score_dict()` during selection |
-
-**Two data paths (by design):**
-```
-Training loop:    guide() → score (float) → np.mean(scores)       ← UNCHANGED
-Selection path:   get_score_dict() → evaluate_vector() → objectives.py  ← NEW
-```
-
-### 7.6 ObjectiveConfig validation
-
-| Case | Behavior |
-|------|----------|
-| `mode="weighted"`, `weights={}` | Auto-assign equal weight 1.0 to all metrics encountered at selection time |
-| `mode="pareto"`, `pareto_metrics=()` (empty tuple) | `ValueError("pareto_metrics must be None (auto) or non-empty tuple")` |
-| `weights={"accuracy": -0.5}` (negative weight) | `ValueError("All weights must be non-negative")` |
-| `minimize={"unknown_metric"}` | Warning logged at selection time if metric never appears; no error (tolerant) |
-
----
-
-## 8. Milestones & Validation Gates
-
-### Milestone 0 — Analysis + technical plan + interface spec
-
-**Deliverables:**
-- `docs/T6_technical_plan.md` — this document, finalized
-- `examples/notebooks/t6_m0_analysis.ipynb` — Colab-ready notebook
-
-**Notebook demonstrates:**
-- Current Guide score contract (`get_feedback` → `Tuple[float, str]`, `metric` → `float`)
-- Where scalar selection happens in BasicSearch (`max(candidates, ...)`) and Beamsearch (`sorted(...)[:k]`)
-- Planned behavior prototype: deterministic toy guide returning dict metrics, showing weighted vs Pareto selection on dummy candidates
-
-**SMART validation:**
-- Plan includes final API signatures and precise file list (create/modify) ✓
-- Notebook runs without API keys ✓
-- Notebook prints: current score contract, selection touchpoints, planned selection outputs ✓
-
----
-
-### Milestone 1 — ObjectiveConfig + utilities + evaluator support + BasicSearch minimal
-
-**Deliverables:**
-- `opto/trainer/objectives.py` (new)
-- `opto/trainer/guide.py` (add `get_score_dict`)
-- `opto/trainer/evaluators.py` (add `evaluate_vector`, `aggregate_vector_scores`)
-- `opto/trainer/algorithms/basic_algorithms.py` (BasicSearch: accept/use ObjectiveConfig)
-- `tests/test_objectives.py`, `tests/test_evaluators_vector.py`
-- `examples/notebooks/t6_m1_vector_scores.ipynb`
-
-**Notebook demonstrates:**
-- StubLLM mode: BasicSearchAlgorithm on small candidate set (5-10) with deterministic dummy guide returning dict metrics
-- Shows: (a) scalar baseline, (b) weighted mode, (c) Pareto mode, (d) deterministic tie-break under fixed seed
-- Real LLM mode (required): tiny dataset (≤5 items) producing ≥2 metrics
-
-**SMART validation:**
-- `pytest -q` passes (all new functions covered)
-- Notebook runs in Colab: weighted selection result changes when weights change
-- Pareto returns tradeoffs and is deterministic under fixed seed
-- Scalar path produces identical results to pre-change behavior
-
----
-
-### Milestone 2 — Trainer upgrades (Beamsearch + robust BasicSearch)
-
-**Deliverables:**
-- `opto/trainer/algorithms/beamsearch_algorithm.py` (accept ObjectiveConfig, vector selection)
-- Expanded BasicSearch tests (edge cases, missing metrics, tie-break policies)
-- Optional: minimal PrioritySearch support (weighted scalarization for heap, dict stored for logging)
-- `tests/test_trainers_multiobjective.py`
-- `examples/notebooks/t6_m2_trainers.ipynb`
-
-**Notebook demonstrates:**
-- BasicSearch + Beamsearch in: scalar mode (baseline), weighted mode, Pareto mode
-- StubLLM + real LLM sections
-
-**SMART validation:**
-- `pytest -q` green
-- Integration test confirms: weighted vs Pareto select different candidates where expected
-- Scalar-only example produces same final best score when `objective_config=None`
-- Deterministic tie-break is stable across runs
-
----
-
-### Milestone 3 — Benchmarks (Trace-Bench integration)
-
-**Deliverables:**
-- PR to Trace-Bench: benchmark configs/tasks + notebook
-  - **Trace-Bench touchpoints (update `main` if default branch differs):**
-    - https://github.com/AgentOpt/Trace-Bench/blob/main/LLM4AD/trainers_benchmark.py
-    - https://github.com/AgentOpt/Trace-Bench/blob/main/LLM4AD/trainers_benchmark_tasks_validation.py
-    - https://github.com/AgentOpt/Trace-Bench/blob/main/LLM4AD/benchmark_tasks/index.json
-    - https://github.com/AgentOpt/Trace-Bench/tree/main/LLM4AD/benchmark_tasks
-    - https://github.com/AgentOpt/Trace-Bench/blob/main/LLM4AD/llm4ad_loader.py
-    - https://github.com/AgentOpt/Trace-Bench/blob/main/tests/test_lite_optimize_llm4ad.py
-- 3 benchmarks:
-  1. **Accuracy vs latency** (toy QA dataset)
-  2. **Accuracy vs response length** (penalize verbosity)
-  3. **Accuracy vs tool calls** (penalize excessive tool usage)
-- Trace-Bench notebook: `notebooks/t6_multiobjective_benchmarks.ipynb` (in Trace-Bench repo)
-
-**SMART validation:**
-- Notebook outputs per-benchmark table: weighted-mode best candidate metrics + Pareto-mode set of tradeoffs
-- Benchmarks run in StubLLM mode (fast/deterministic) and real LLM mode (small sample)
-- Trace-Bench run completes without private datasets
-- `pytest -q` green (smoke tests for benchmark integration)
-
----
-
-### Milestone 4 — Documentation + polished notebooks
-
-**Deliverables:**
-- `docs/multi_objective_scores.md` — user-facing documentation
-- README update with pointers to docs and notebooks
-- Polished "How-to" notebook: installs from GitHub, runs BasicSearch weighted + Pareto, prints metric tradeoffs
-
-**SMART validation:**
-- Fresh Colab runtime runs how-to notebook without manual patching
-- CI green, no behavioral changes beyond documentation/polish
-
----
-
-## 9. Test Plan
-
-### 9.1 Unit tests — `tests/test_objectives.py` (M1)
-
-| Test | Validates |
-|------|-----------|
-| `test_to_score_dict_from_float` | `0.85` → `{"score": 0.85}` |
-| `test_to_score_dict_from_dict` | `{"a": 1.0, "b": 2.0}` → same dict |
-| `test_to_score_dict_empty_dict_raises` | `{}` → `ValueError` |
-| `test_to_score_dict_nan_raises` | `{"a": float('nan')}` → `ValueError` |
-| `test_to_score_dict_wrong_type_raises` | `"text"` → `TypeError` |
-| `test_apply_minimize` | `{"acc": 0.9, "lat": 100}` with `minimize={"lat"}` → `{"acc": 0.9, "lat": -100}` |
-| `test_apply_minimize_empty_set` | No metrics negated |
-| `test_weighted_scalarize_basic` | `{"a": 0.8, "b": 0.2}` with `weights={"a": 0.7, "b": 0.3}` → `0.7*0.8 + 0.3*0.2` |
-| `test_weighted_scalarize_missing_metric` | Missing metric uses `missing_value` |
-| `test_weighted_scalarize_empty_weights` | Equal weight 1.0 for all metrics |
-| `test_dominates_true` | A dominates B (all ≥, at least one >) |
-| `test_dominates_false_equal` | A == B → does not dominate |
-| `test_dominates_false_tradeoff` | A better on one, B better on another |
-| `test_pareto_rank_simple` | 3 candidates with clear rank 0, 1, 2 |
-| `test_pareto_rank_all_nondominated` | All candidates rank 0 |
-| `test_select_best_scalar_mode` | Falls back to scalar max |
-| `test_select_best_weighted_mode` | Returns highest weighted score |
-| `test_select_best_pareto_mode` | Returns Pareto-optimal by tie-break |
-| `test_select_best_none_config` | `objective_config=None` → scalar max (backward compat) |
-| `test_select_top_k_weighted` | Returns k highest weighted scores |
-| `test_select_top_k_pareto` | Returns k from Pareto front + spillover |
-| `test_deterministic_tie_break_seeded` | Same seed → same result across 100 runs |
-| `test_deterministic_tie_break_different_seeds` | Different seeds → potentially different result |
-
-### 9.2 Unit tests — `tests/test_evaluators_vector.py` (M1)
-
-| Test | Validates |
-|------|-----------|
-| `test_aggregate_vector_scores_all_scalar` | `[0.8, 0.9, 0.7]` → `np.mean` (backward compat) |
-| `test_aggregate_vector_scores_all_dict` | Per-metric mean computed correctly |
-| `test_aggregate_vector_scores_mixed` | Scalars normalized to dict, then averaged |
-| `test_evaluate_vector_returns_correct_types` | Returns list of ScoreLike matching guide output |
-
-### 9.3 Integration tests — `tests/test_trainers_multiobjective.py` (M2)
-
-| Test | Validates |
-|------|-----------|
-| `test_basicsearch_scalar_unchanged` | Default behavior identical to pre-change |
-| `test_basicsearch_weighted_selects_expected` | Weighted mode picks correct candidate |
-| `test_basicsearch_pareto_selects_expected` | Pareto mode picks different candidate than weighted |
-| `test_beamsearch_scalar_unchanged` | Default behavior identical |
-| `test_beamsearch_weighted_selects_top_k` | Weighted mode picks correct top-k |
-| `test_beamsearch_pareto_selects_front` | Pareto mode returns non-dominated front |
-| `test_deterministic_across_runs` | Fixed seed → same selections in 5 repeated runs |
-
-### 9.4 Notebook validation (human / Trace team)
-
-Each notebook contains:
-- **StubLLM (no keys) section:** deterministic dummy guide, runs quickly
-- **Real LLM section (required):** small N (5-20 examples), prints cost/latency caveats, requires API key
-
----
-
-## 10. Risks & Mitigation
-
-| Risk | Severity | Mitigation |
-|------|----------|------------|
-| **R1: Missing metrics across candidates** | Medium | `missing_value` in ObjectiveConfig (default `-inf`). Enforce metric presence for configured weights (or warn). |
-| **R2: Pareto nondeterminism** | High | Deterministic ordering via stable sort + explicit tie-break rules. Seeded randomness only when requested. |
-| **R3: Multi-thread eval ordering** | Medium | Tests run with `num_threads=1` to guarantee stability. Document thread-safety considerations. |
-| **R4: Breaking Guide subclasses** | High | Use `get_score_dict()` helper — never change `get_feedback()` signature. Union type on `metric()` is safe because existing callers only receive floats. |
-| **R5: Performance regression** | Low | `objectives.py` functions are O(n²) for Pareto ranking on n candidates, but n is typically ≤20 (num_proposals). No concern at this scale. |
-| **R6: Mixed scalar/dict in same batch** | Medium | `aggregate_vector_scores()` rejects mixed batches with `ValueError`. A mixed batch indicates a bug in the guide. |
-| **R7: Training loop receives dict score** | High | `guide.__call__()` / `get_feedback()` return type is NOT widened. `metric()` always returns `float`. Dict scores only flow through `get_score_dict()` → `evaluate_vector()`. See §7.7. |
-
----
-
-## 11. Design Decisions (Resolved)
-
-> **Post-review update (Ching-An, Feb 2026):** All dict→scalar reduction is now controlled by `ObjectiveConfig.scalarize_dict` (values: `"score"`, `"mean"`, `"weighted"`). Guide produces raw metrics only. `normalize_score` renamed to `to_score_dict` (neutral name; backward-compat alias kept). `aggregate_score_dicts()` moved from evaluators to objectives.py (Objective-side policy). Dict scores with `config=None` now raise `ValueError` (no hidden hard-coded reduction).
-
-### D1: Where to implement scalar→dict normalization?
-
-**Decision: Option A — `Guide.get_score_dict()` helper + `objectives.to_score_dict()`**
-
-- `get_score_dict()` on Guide provides a clean entry point for subclasses.
-- `to_score_dict()` in objectives.py is the canonical utility (pure function, testable). Renamed from `normalize_score` per Ching-An's review (neutral name; backward-compat alias kept).
-- All dict→scalar reduction is controlled by `ObjectiveConfig` (via `scalarize_dict` field). No hidden hard-coded defaults.
-- Avoids widening `get_feedback()` return type (higher churn, breaks typing).
-
-### D2: Pareto selection definition
-
-**Decision: Option A — Standard dominance on aggregated metrics, return single best by tie-break.**
-
-- `select_best()` returns one winner. `select_top_k()` returns k winners.
-- Trainers don't need to manage a "front" — they just get indices.
-- Beamsearch naturally uses `select_top_k(k=beam_width)`.
-
-### D3: PrioritySearch scope
-
-**Decision: Minimal (in-scope).**
-
-- Scalarize heap priority via `weighted_scalarize()`.
-- Store full `score_dict` on each candidate for logging.
-- `mode="pareto"` falls back to weighted with documented warning.
-- Pareto archive is out-of-scope for v1.
-
----
-
-## 12. Appendix: Code Touchpoints
-
-### OpenTrace / experimental
-
-| File | URL |
-|------|-----|
-| Guide base | [guide.py](https://github.com/AgentOpt/OpenTrace/blob/experimental/opto/trainer/guide.py) |
-| Evaluators | [evaluators.py](https://github.com/AgentOpt/OpenTrace/blob/experimental/opto/trainer/evaluators.py) |
-| BasicSearch | [basic_algorithms.py](https://github.com/AgentOpt/OpenTrace/blob/experimental/opto/trainer/algorithms/basic_algorithms.py) |
-| Beamsearch | [beamsearch_algorithm.py](https://github.com/AgentOpt/OpenTrace/blob/experimental/opto/trainer/algorithms/beamsearch_algorithm.py) |
-| PrioritySearch | [priority_search.py](https://github.com/AgentOpt/OpenTrace/blob/experimental/opto/features/priority_search/priority_search.py) |
-
-### Trace-Bench
-
-| File | URL |
-|------|-----|
-| Repo | [Trace-Bench](https://github.com/AgentOpt/Trace-Bench) |
-
-### Selection logic summary (current → target)
-
-| Trainer | Current Code | Target Code |
-|---------|-------------|-------------|
-| BasicSearch | `max(candidates, key=lambda x: x[0])` | `select_best(candidates, objective_config)` |
-| Beamsearch | `sorted(candidates, key=lambda x: x[0], reverse=True)[:k]` | `select_top_k(candidates, objective_config, k)` |
-| PrioritySearch | scalar heap key | `weighted_scalarize(score_dict, config)` for heap key |
diff --git a/examples/notebooks/t6_m0_analysis.ipynb b/docs/dev/multi_objective_design_exploration.ipynb
similarity index 97%
rename from examples/notebooks/t6_m0_analysis.ipynb
rename to docs/dev/multi_objective_design_exploration.ipynb
index 2549d76a..50c92a70 100644
--- a/examples/notebooks/t6_m0_analysis.ipynb
+++ b/docs/dev/multi_objective_design_exploration.ipynb
@@ -35,7 +35,7 @@
    "cell_type": "markdown",
    "id": "b1a58d26",
    "metadata": {},
-   "source": "# T6 Multi-Objective Vector Scores — M0 Analysis\n\n[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/AgentOpt/OpenTrace/blob/pull/61/head/examples/notebooks/t6_m0_analysis.ipynb)\n\n**Milestone 0 Deliverable** — Analysis + Technical Plan + Interface Spec\n\nThis notebook demonstrates:\n1. **Current baseline**: How Guide returns scalar scores, how evaluators aggregate, where selection happens\n2. **Exact touchpoints**: The specific lines of code in BasicSearch and Beamsearch that perform scalar selection\n3. **Planned behavior**: A deterministic prototype showing weighted vs Pareto selection on toy candidates\n\n**Motivation (why score-as-dict):** adding extra metrics into the *feedback dict/text* can help optimizers (OptoPrime/OPRO), but trainers typically only use the scalar score for ranking/UCB and ignore additional feedback structure. To enable Pareto/weighted multi-objective selection at the trainer level, we need vector score (score-as-dict) with backward-compatible scalar reduction.\n\n**No API keys required for M0.** All examples use deterministic dummy data. (From M1 onward, milestone notebooks must validate both StubLLM and real LLM modes.)\n\n---"
+   "source": "# T6 Multi-Objective Vector Scores — M0 Analysis\n\n[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/AgentOpt/OpenTrace/blob/experimental/docs/dev/multi_objective_design_exploration.ipynb)\n\n**Milestone 0 Deliverable** — Analysis + Technical Plan + Interface Spec\n\nThis notebook demonstrates:\n1. **Current baseline**: How Guide returns scalar scores, how evaluators aggregate, where selection happens\n2. **Exact touchpoints**: The specific lines of code in BasicSearch and Beamsearch that perform scalar selection\n3. **Planned behavior**: A deterministic prototype showing weighted vs Pareto selection on toy candidates\n\n**Motivation (why score-as-dict):** adding extra metrics into the *feedback dict/text* can help optimizers (OptoPrime/OPRO), but trainers typically only use the scalar score for ranking/UCB and ignore additional feedback structure. To enable Pareto/weighted multi-objective selection at the trainer level, we need vector score (score-as-dict) with backward-compatible scalar reduction.\n\n**No API keys required for M0.** All examples use deterministic dummy data. (From M1 onward, milestone notebooks must validate both StubLLM and real LLM modes.)\n\n---"
   },
   {
    "cell_type": "markdown",
@@ -932,4 +932,4 @@
  },
  "nbformat": 4,
  "nbformat_minor": 5
-}
\ No newline at end of file
+}
diff --git a/docs/multi_objective_scores.md b/docs/multi_objective_scores.md
new file mode 100644
index 00000000..3128f14f
--- /dev/null
+++ b/docs/multi_objective_scores.md
@@ -0,0 +1,423 @@
+# Multi-Objective Vector Scores
+
+## Why multi-objective optimization?
+
+Standard single-objective optimization collapses all concerns into one scalar
+score. This works when you care about exactly one metric, but real tasks have
+competing concerns: accuracy vs. API cost, quality vs. latency, base loss vs.
+regularization. A single number hides these trade-offs — you can't tell whether
+a candidate is cheap-but-wrong or correct-but-expensive.
+
+Multi-objective mode makes each metric explicit. Instead of `score = 0.85`, you
+get `{"accuracy": 0.95, "tokens_out": 120, "latency_s": 0.3}`. The trainer
+then uses weighted scalarization or Pareto ranking to select candidates, giving
+you visibility into trade-offs and the ability to re-prioritize without
+retraining.
+
+**Best use cases:** 2-4 competing metrics, minimizing API cost while
+maintaining quality, understanding accuracy-vs-speed trade-offs, regularized
+optimization problems.
+
+**What you gain:** explicit per-metric tracking, Pareto frontier exploration,
+tunable weight priorities, token-efficient candidate selection.
+
+**Jump to:**
+[Switching to multi-objective](#switching-from-scalar-to-multi-objective) |
+[ObjectiveConfig reference](#objectiveconfig-reference) |
+[Token minimization](#adding-token-minimization) |
+[Canonical demos](#canonical-demos) |
+[Data flow](#data-flow) |
+[Running in Trace-Bench](#running-in-trace-bench)
+
+---
+
+## Overview
+
+By default, OpenTrace guides return a **scalar** score (a single float). Multi-
+objective mode extends this to **vector scores** — a `Dict[str, float]` where
+each key is a named metric (e.g. `accuracy`, `tokens_out`, `base_loss`).
+
+The training loop evaluates candidates on all metrics simultaneously, then uses
+an `ObjectiveConfig` to decide which candidate is best — either by weighted
+scalarization or by Pareto dominance ranking.
+
+### When to use multi-objective
+
+| Scenario | Recommendation |
+|---|---|
+| Single quality metric (accuracy, loss) | Scalar mode — no changes needed |
+| Quality + cost (accuracy + token usage) | Multi-objective weighted mode |
+| Multiple competing losses (base_loss + reg_loss) | Multi-objective weighted or pareto |
+| Exploring trade-off frontiers | Pareto mode |
+
+### What works well
+
+- **BasicSearchAlgorithm** — full multi-objective support via `select_best()`.
+- **BeamsearchAlgorithm** — full support via `select_top_k()` for beam ranking.
+- **BeamsearchHistoryAlgorithm** — full support (inherits from Beamsearch).
+- **PrioritySearch** — supported through the Trace-Bench runner.
+
+### Current limitations
+
+- **UCBSearch** does not support multi-objective selection. It uses its own
+  internal scoring and ignores `ObjectiveConfig`.
+- Pareto ranking with many metrics (>4) becomes expensive. Weight-based
+  scalarization is more efficient when relative metric importance is known.
+
+---
+
+## Switching from scalar to multi-objective
+
+### Step 1 — Return a score dict from your Guide
+
+Override `get_score_dict()` in your `Guide` subclass to return a dict instead
+of relying on the default scalar wrapper:
+
+```python
+from opto.trainer.guide import Guide
+
+class MyGuide(Guide):
+    def get_feedback(self, query, response, reference=None, **kwargs):
+        # ... compute score and feedback ...
+        return score, feedback
+
+    def get_score_dict(self, query, response, reference=None, **kwargs):
+        # Return multiple named metrics
+        accuracy = 1.0 if response.strip() == reference.strip() else 0.0
+        length_penalty = len(response) / 1000.0
+        return {"accuracy": accuracy, "length": length_penalty}
+```
+
+The base `Guide.get_score_dict()` wraps the scalar from `get_feedback()` as
+`{"score": float_value}`. Override it to return your own metric names.
+
+### Step 2 — Create an ObjectiveConfig
+
+```python
+from opto.trainer.objectives import ObjectiveConfig
+
+config = ObjectiveConfig(
+    mode="weighted",                              # or "pareto"
+    weights={"accuracy": 1.0, "length": 0.5},    # relative importance
+    minimize=frozenset({"length"}),               # lower is better
+)
+```
+
+### Step 3 — Pass it to the trainer
+
+```python
+from opto.trainer.algorithms.basic_algorithms import BasicSearchAlgorithm
+
+trainer = BasicSearchAlgorithm(agent, optimizer)
+trainer.train(
+    guide,
+    train_dataset,
+    objective_config=config,
+    num_proposals=4,
+    num_epochs=3,
+)
+```
+
+The trainer will call `guide.get_score_dict()` via `evaluate_vector()`, aggregate
+per-metric means via `aggregate_score_dicts()`, and use `select_best()` (or
+`select_top_k()` for beam search) with your config to pick the winning candidate.
+
+---
+
+## ObjectiveConfig reference
+
+```python
+@dataclass(frozen=True)
+class ObjectiveConfig:
+    mode: str = "scalar"            # "scalar" | "weighted" | "pareto"
+    weights: Dict[str, float]       # per-metric weights (empty = equal weight 1.0)
+    minimize: frozenset             # metric names where lower is better
+    missing_value: float = -inf     # fallback for missing metrics
+    pareto_metrics: Tuple[str, ...] # subset for Pareto dominance (None = all)
+    tie_break: str = "weighted"     # "weighted" | "lexicographic" | "random_seeded"
+    seed: int = 0                   # for deterministic tie-breaking
+    scalarize_dict: str = "score"   # "score" | "mean" | "weighted"
+    score_key: str = "score"        # key used when scalarize_dict="score"
+```
+
+### Mode: `"weighted"`
+
+Computes a weighted sum of all metrics (after negating those in `minimize`),
+then selects the candidate with the highest scalarized value.
+
+```python
+ObjectiveConfig(
+    mode="weighted",
+    weights={"accuracy": 1.0, "tokens_out": 1e-3},
+    minimize=frozenset({"tokens_out"}),
+)
+```
+
+Metrics not listed in `weights` are ignored. If `weights` is empty, all metrics
+get equal weight 1.0.
+
+### Mode: `"pareto"`
+
+Performs non-dominated sorting (Pareto ranking). Rank-0 candidates are on the
+Pareto front. If multiple candidates share rank 0, the `tie_break` strategy
+resolves the winner:
+
+- `"weighted"` — fall back to weighted scalarization among the front.
+- `"lexicographic"` — sort by metric name alphabetically, pick highest.
+- `"random_seeded"` — seeded random shuffle (deterministic).
+
+```python
+ObjectiveConfig(
+    mode="pareto",
+    weights={"accuracy": 1.0, "tokens_out": 1e-3},  # used for tie-break
+    minimize=frozenset({"tokens_out"}),
+    tie_break="weighted",
+    seed=42,
+)
+```
+
+### Mode: `"scalar"` (default)
+
+Backward-compatible. Treats scores as single floats. Dict scores are reduced
+via `scalarize_dict`:
+- `"score"` — extract `score_dict[score_key]` (default).
+- `"mean"` — `mean(score_dict.values())`.
+- `"weighted"` — `weighted_scalarize()`.
+
+---
+
+## Adding token minimization
+
+The GSM8K demo shows how to add token-count metrics to any existing guide
+without modifying it. The pattern uses two components:
+
+### UsageTrackingLLM
+
+A wrapper around any LLM that records token counts (input and output) using a
+`contextvars.ContextVar`. It works transparently — the wrapped LLM behaves
+identically, but token counts are captured per-call.
+
+```python
+from trace_bench.examples.multiobjective_gsm8k import UsageTrackingLLM
+
+# Wrap your LLM
+tracked_llm = UsageTrackingLLM(base_llm)
+```
+
+### TokenUsageAugmentingGuide
+
+A decorator guide that wraps an existing guide and appends `tokens_in` and
+`tokens_out` to its score dict:
+
+```python
+from trace_bench.examples.multiobjective_gsm8k import TokenUsageAugmentingGuide
+
+base_guide = MyGuide()
+guide = TokenUsageAugmentingGuide(base_guide, tracked_llm)
+
+# guide.get_score_dict() now returns e.g.:
+# {"accuracy": 1.0, "tokens_in": 350.0, "tokens_out": 120.0}
+```
+
+### Full configuration with token minimization
+
+```python
+config = ObjectiveConfig(
+    mode="weighted",
+    weights={"error": 1.0, "tokens_in": 1e-3, "tokens_out": 1e-3},
+    minimize=frozenset({"error", "tokens_in", "tokens_out"}),
+)
+```
+
+The small weights on token metrics (1e-3) ensure that accuracy dominates the
+selection, but among equally accurate candidates, the one using fewer tokens
+wins.
+
+---
+
+## Canonical demos
+
+Three reference implementations demonstrate multi-objective patterns. Each lives
+in the Trace-Bench repository under `trace_bench/examples/` with companion
+notebooks under `notebooks/`.
+
+### Convex (SixHumpCamel)
+
+**File:** `trace_bench/examples/multiobjective_convex.py`
+**Notebook:** `notebooks/multiobjective_convex.ipynb`
+
+Optimizes a 2D input to minimize two losses independently:
+- `base_loss` — the Six-Hump Camel function value.
+- `reg_loss` — L2-squared regularization.
+
+The `ConvexRewardGuide.get_score_dict()` returns both metrics. This is the
+simplest multi-objective example — no LLM, no external dependencies.
+
+### BBEH (boolean_expressions)
+
+**File:** `trace_bench/examples/multiobjective_bbeh.py`
+**Notebook:** `notebooks/multiobjective_bbeh.ipynb`
+
+Optimizes a code-generation agent on BIG-Bench Extra Hard boolean expression
+problems with two objectives:
+- `accuracy` — exact-match correctness (minimize error).
+- `execution_time_s` — wall-clock time for code execution (minimize).
+
+Uses PAL (Program-Aided Language) strategy: the agent writes Python code that
+is executed to extract the answer.
+
+### GSM8K + Token Usage
+
+**File:** `trace_bench/examples/multiobjective_gsm8k.py`
+**Notebook:** `notebooks/multiobjective_gsm8k.ipynb`
+
+Optimizes a math-solving agent on GSM8K with three objectives:
+- `error` — 1 minus exact-match accuracy (minimize).
+- `tokens_in` — input token count (minimize).
+- `tokens_out` — output token count (minimize).
+
+Demonstrates the `UsageTrackingLLM` + `TokenUsageAugmentingGuide` pattern for
+adding token metrics to any task.
+
+---
+
+## Data flow
+
+### Evaluation pipeline
+
+```mermaid
+graph TD
+    A["Guide.get_score_dict()"] -->|"per-example Dict[str, float]"| B["evaluate_vector()"]
+    B -->|"List[Dict[str, float]]"| C["aggregate_score_dicts()"]
+    C -->|"per-candidate mean Dict"| D["select_best() / select_top_k()"]
+    D -->|"uses ObjectiveConfig"| E["Best candidate selected"]
+```
+
+1. **evaluate_vector()** (`opto/trainer/evaluators.py`) calls
+   `guide.get_score_dict()` for each input and returns a
+   `List[Dict[str, float]]`.
+2. **aggregate_score_dicts()** (`opto/trainer/objectives.py`) computes per-
+   metric means across all examples for a single candidate.
+3. **select_best()** / **select_top_k()** rank candidates according to the
+   `ObjectiveConfig` and return the winning index/indices.
+
+### Selection mode decision
+
+```mermaid
+graph TD
+    S["ObjectiveConfig.mode"] -->|"scalar"| SC["to_scalar_score() → argmax"]
+    S -->|"weighted"| W["apply_minimize() → weighted_scalarize() → argmax"]
+    S -->|"pareto"| P["apply_minimize() → pareto_rank()"]
+    P --> F{"Single rank-0\ncandidate?"}
+    F -->|"Yes"| R["Return winner"]
+    F -->|"No"| TB["tie_break strategy"]
+    TB -->|"weighted"| TW["weighted_scalarize among front → argmax"]
+    TB -->|"lexicographic"| TL["sort by metric name → pick highest"]
+    TB -->|"random_seeded"| TR["seeded shuffle → pick first"]
+```
+
+Trainer algorithms (BasicSearch, Beamsearch) call this pipeline internally when
+`objective_config` is provided and `mode != "scalar"`.
+
+---
+
+## Running in Trace-Bench
+
+### CLI
+
+```bash
+# List available multi-objective tasks
+trace-bench list-tasks --bench internal
+
+# Validate a config without running
+trace-bench validate --config configs/m3_multiobjective.yaml
+
+# Run the full multi-objective benchmark
+export TRACE_LITELLM_MODEL=openrouter/x-ai/grok-4.1-fast
+trace-bench run --config configs/m3_multiobjective.yaml
+```
+
+### YAML config format
+
+The multi-objective config (`configs/m3_multiobjective.yaml`) uses this
+structure:
+
+```yaml
+mode: real
+seeds: [42]
+max_workers: 6
+resume: auto
+job_timeout: 1200
+
+tasks:
+  - id: "internal:multiobjective_convex"
+    eval_kwargs:
+      objective_mode: "weighted"
+
+  - id: "internal:multiobjective_convex"
+    eval_kwargs:
+      objective_mode: "pareto"
+
+  # ... same pattern for bbeh and gsm8k
+
+trainers:
+  - id: BasicSearchAlgorithm
+    params_variants:
+      - num_proposals: 4
+        num_epochs: 2
+        batch_size: 1
+
+  - id: BeamsearchAlgorithm
+    params_variants:
+      - beam_width: 2
+        num_proposals: 4
+        max_depth: 2
+        batch_size: 1
+```
+
+**Key fields:**
+- `tasks[].id` — registry task ID (e.g. `internal:multiobjective_bbeh`).
+- `tasks[].eval_kwargs.objective_mode` — `"weighted"` or `"pareto"`. Passed to
+  the task's `build_trace_problem()` which constructs the `ObjectiveConfig`.
+- `trainers[].id` — algorithm name.
+- `trainers[].params_variants` — list of parameter sets. The runner expands
+  tasks x trainers x variants x seeds into individual jobs.
+
+### Task registration
+
+Each task module exposes a `build_trace_problem(**eval_kwargs)` function that
+returns a dict with:
+
+```python
+{
+    "param": trace.node(..., trainable=True),
+    "guide": MyGuide(),
+    "train_dataset": {"inputs": [...], "infos": [...]},
+    "optimizer_kwargs": {"objective": "...", "memory_size": 10},
+    "objective_config": ObjectiveConfig(...),
+    "metadata": {"benchmark": "multiobjective", ...},
+}
+```
+
+The `objective_config` is consumed by the trainer's `train()` method. The
+`eval_kwargs` from the YAML `tasks[].eval_kwargs` are forwarded directly to
+`build_trace_problem()`.
+
+### LLM model selection
+
+The LLM is selected at runtime via the `TRACE_LITELLM_MODEL` environment
+variable. Common provider configurations:
+
+```bash
+# OpenRouter
+export OPENROUTER_API_KEY=...
+export TRACE_LITELLM_MODEL=openrouter/x-ai/grok-4.1-fast
+
+# Direct provider
+export XAI_API_KEY=...
+export TRACE_LITELLM_MODEL=xai/grok-4.1-fast
+
+# DeepSeek
+export DEEPSEEK_API_KEY=...
+export TRACE_LITELLM_MODEL=deepseek/deepseek-chat
+```
diff --git a/examples/notebooks/t6_m2_bbeh.ipynb b/examples/notebooks/multiobjective_bbeh_langgraph.ipynb
similarity index 73%
rename from examples/notebooks/t6_m2_bbeh.ipynb
rename to examples/notebooks/multiobjective_bbeh_langgraph.ipynb
index d53632d5..ec5e48dc 100644
--- a/examples/notebooks/t6_m2_bbeh.ipynb
+++ b/examples/notebooks/multiobjective_bbeh_langgraph.ipynb
@@ -4,7 +4,7 @@
    "cell_type": "markdown",
    "id": "cell-title",
    "metadata": {},
-   "source": "# T6 M2 — BBEH Boolean Expressions with Multi-Objective Instrumentation\n\n[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/AgentOpt/OpenTrace/blob/pull/61/head/examples/notebooks/t6_m2_bbeh.ipynb)\n\n**Milestone 2 Deliverable** — Multi-objective scoring on a real LLM task\n\nThis notebook demonstrates multi-objective optimization on the **BBEH boolean_expressions** benchmark\nusing the **PAL (Program-Aided Language model)** strategy from Xavier's original experiment.\n\nTwo objectives are tracked:\n- **accuracy** (binary: 1.0 = correct, 0.0 = wrong)\n- **execution_time_s** (end-to-end wall-clock seconds per example: LLM call + code execution)\n\nThe `LangGraphGuide.get_score_dict()` method returns both metrics per example,\nenabling the M2 multi-objective infrastructure to track and visualize tradeoffs.\n\n**Requires a real LLM API key** (OpenRouter recommended, default model: `openai/gpt-5-nano`).\n\n---"
+   "source": "# T6 M2 — BBEH Boolean Expressions with Multi-Objective Instrumentation\n\n[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/AgentOpt/OpenTrace/blob/experimental/examples/notebooks/multiobjective_bbeh_langgraph.ipynb)\n\n**Milestone 2 Deliverable** — Multi-objective scoring on a real LLM task\n\nThis notebook demonstrates multi-objective optimization on the **BBEH boolean_expressions** benchmark\nusing the **PAL (Program-Aided Language model)** strategy from Xavier's original experiment.\n\nTwo objectives are tracked:\n- **accuracy** (binary: 1.0 = correct, 0.0 = wrong)\n- **execution_time_s** (end-to-end wall-clock seconds per example: LLM call + code execution)\n\nThe `LangGraphGuide.get_score_dict()` method returns both metrics per example,\nenabling the M2 multi-objective infrastructure to track and visualize tradeoffs.\n\n**Requires a real LLM API key** (OpenRouter recommended, default model: `openai/gpt-5-nano`).\n\n---"
   },
   {
    "cell_type": "code",
@@ -20,7 +20,7 @@
    "id": "cell-setup",
    "metadata": {},
    "outputs": [],
-   "source": "import os, sys, subprocess\n\nif IN_COLAB:\n    if not os.path.exists('/content/Trace'):\n        print(\"Setting up Trace...\")\n        !pip install langgraph langchain langchain_openai datasets tqdm langchain_community litellm dspy black matplotlib pandas\n        !git clone https://github.com/AgentOpt/OpenTrace.git /content/Trace\n        %cd /content/Trace\n        !git pull origin experimental && git checkout experimental\n        !sed -i 's/python_requires=\">=3.13\"/python_requires=\">=3.12\"/' setup.py\n        !pip install -e .\n    sys.path.append('/content/Trace')\nelse:\n    # Local: add repo root to sys.path\n    _nb_dir = os.path.dirname(os.path.abspath(\"__file__\"))\n    _repo_root = os.path.abspath(os.path.join(_nb_dir, \"..\", \"..\"))\n    if _repo_root not in sys.path:\n        sys.path.insert(0, _repo_root)\n\n# Clone BBEH benchmark tasks\nif not os.path.exists('bbeh'):\n    !git clone https://github.com/google-deepmind/bbeh.git\nelse:\n    print(\"bbeh/ already exists, skipping clone.\")\n\n# Soft-import display\ntry:\n    from IPython.display import display\nexcept Exception:\n    def display(*args, **kwargs):\n        return None\n\nprint(f\"{IN_COLAB=}\")"
+   "source": "import os, sys, subprocess\n\nif IN_COLAB:\n    if not os.path.exists('/content/Trace'):\n        print(\"Setting up Trace...\")\n        !pip install langgraph langchain langchain_openai datasets tqdm langchain_community litellm dspy black matplotlib pandas\n        !git clone https://github.com/AgentOpt/OpenTrace.git /content/Trace\n        %cd /content/Trace\n        !git pull origin experimental && git checkout experimental\n        !pip install -e .\n    sys.path.append('/content/Trace')\nelse:\n    # Local: add repo root to sys.path\n    _nb_dir = os.path.dirname(os.path.abspath(\"__file__\"))\n    _repo_root = os.path.abspath(os.path.join(_nb_dir, \"..\", \"..\"))\n    if _repo_root not in sys.path:\n        sys.path.insert(0, _repo_root)\n\n# Clone BBEH benchmark tasks\nif not os.path.exists('bbeh'):\n    !git clone https://github.com/google-deepmind/bbeh.git\nelse:\n    print(\"bbeh/ already exists, skipping clone.\")\n\n# Soft-import display\ntry:\n    from IPython.display import display\nexcept Exception:\n    def display(*args, **kwargs):\n        return None\n\nprint(f\"{IN_COLAB=}\")"
   },
   {
    "cell_type": "code",
@@ -205,15 +205,16 @@
     "# -----------------------\n",
     "class LangGraphTrainer(_TraceMinibatch):\n",
     "    def __init__(self, *, graph_root_function: str, graph_agents_functions: list[str], scope: dict,\n",
-    "                 optimizer, parameters: list):\n",
+    "                 optimizer, parameters: list,\n",
+    "                 original_root=None, original_agents=None):\n",
     "        object.__init__(self)\n",
     "        self.root_name = graph_root_function\n",
     "        self.agent_names = list(graph_agents_functions)\n",
     "        self.scope = scope\n",
     "        self.optimizer = optimizer\n",
     "        self.parameters = list(parameters)\n",
-    "        self._original_root = scope[graph_root_function]\n",
-    "        self._original_agents = {n: scope[n] for n in graph_agents_functions if n in scope}\n",
+    "        self._original_root = original_root if original_root is not None else scope[graph_root_function]\n",
+    "        self._original_agents = original_agents if original_agents is not None else {n: scope[n] for n in graph_agents_functions if n in scope}\n",
     "\n",
     "    def restore_originals(self):\n",
     "        self.scope[self.root_name] = self._original_root\n",
@@ -439,6 +440,10 @@
     "    if isinstance(scope.get(graph_root_function), FunModule):\n",
     "        scope[graph_root_function] = scope[graph_root_function]._fun\n",
     "\n",
+    "    # Save original (pre-bind) functions so trainer can restore on corruption\n",
+    "    original_root = scope.get(graph_root_function)\n",
+    "    original_agents = {name: scope[name] for name in graph_agents_functions if name in scope}\n",
+    "\n",
     "    parameters = []\n",
     "    for name in graph_agents_functions:\n",
     "        if name not in scope:\n",
@@ -509,6 +514,8 @@
     "        scope=scope,\n",
     "        optimizer=opt,\n",
     "        parameters=parameters,\n",
+    "        original_root=original_root,\n",
+    "        original_agents=original_agents,\n",
     "    )\n",
     "    modified, history, best_state, last_state = trainer.train(\n",
     "        guide=guide,\n",
@@ -548,7 +555,113 @@
    "id": "cell-pal-strategy",
    "metadata": {},
    "outputs": [],
-   "source": "import re, time\nfrom langgraph.graph import StateGraph, START, END\n\n# -----------------------\n# Strategy: PAL (Program-Aided Language model)\n# -----------------------\nprompt_parse_problem = node(\n    \"Read the problem and write Python code that sets a variable named `result` to the final answer.\\n\"\n    \"- Output ONLY valid Python (no markdown fences).\\n\"\n    \"- If the task is multiple-choice, set result to the option label exactly (e.g., '(A)').\\n\\n\"\n    \"Problem:\\n\",\n    trainable=True,\n    description=\"PAL prompt that generates python code producing a `result`.\"\n)\n\n# Global variable to capture execution time from the graph invocation.\n# This is read by run_solver_on_example() to populate the guide's score_dict.\n_last_exec_time_s = 0.0\n\ndef parse_problem(state: dict):\n    question = get_no_node(state.get(\"question\", \"\"))\n    prompt = prompt_parse_problem + question\n    code_str = llm_call(get_no_node(prompt))\n    return {\"code\": code_str.strip(), \"question\": question}\n\ndef execute_code(state: dict):\n    \"\"\"Execute LLM-generated Python code.\n\n    The PAL strategy: exec() the code produced by the LLM and extract the\n    `result` variable as the final answer.\n    \"\"\"\n    def strip_python_tags(code: str) -> str:\n        return re.sub(\n            r'(?s)(?:.*?```(?:python)?\\s*\\n(.*?)(?:\\n```.*)?|(.*))',\n            lambda m: m.group(1) if m.group(1) is not None else m.group(2),\n            code,\n        )\n\n    update = {}\n    try:\n        code_to_run = strip_python_tags(get_no_node(state.get(\"code\", \"\")))\n        local_vars = {}\n        exec(code_to_run, {}, local_vars)  # noqa: S102 - intentional PAL strategy\n        local_vars.pop(\"__builtins__\", None)\n\n        if \"result\" in local_vars:\n            update[\"final_answer\"] = node(local_vars[\"result\"])\n        elif len(local_vars) == 1:\n            update[\"final_answer\"] = node(next(iter(local_vars.values())))\n        else:\n            update[\"final_answer\"] = node(None)\n\n    except Exception as e:\n        update[\"final_answer\"] = node(None)\n        update[\"error\"] = str(e)\n\n    return update\n\ndef create_graph_solve_with_PAL_Strategy():\n    g = StateGraph(dict)\n    g.add_node(\"parse\", parse_problem)\n    g.add_node(\"calculate\", execute_code)\n    g.add_edge(START, \"parse\")\n    g.add_edge(\"parse\", \"calculate\")\n    g.add_edge(\"calculate\", END)\n    return g\n\ndef solve_with_PAL_Strategy(problem: str) -> dict:\n    global _last_exec_time_s\n    _last_exec_time_s = 0.0  # reset before each invocation\n\n    g = create_graph_solve_with_PAL_Strategy()\n    compiled = g.compile()\n\n    if SHOW_MERMAID_GRAPH:\n        try:\n            from IPython.display import Image, display\n            display(Image(compiled.get_graph(xray=1).draw_mermaid_png()))\n        except Exception:\n            pass\n\n    # --- M2: measure end-to-end graph execution time (LLM call + code exec) ---\n    t0 = time.perf_counter()\n    result = compiled.invoke({\"question\": get_no_node(problem)})\n    t1 = time.perf_counter()\n    _last_exec_time_s = t1 - t0\n\n    if \"final_answer\" not in result:\n        return {\"final_answer\": node(\"No solution found\")}\n    if isinstance(result[\"final_answer\"], str):\n        return {\"final_answer\": node(result[\"final_answer\"])}\n    return result\n\n# Default graph spec\nGRAPH_ROOT = \"solve_with_PAL_Strategy\"\nGRAPH_AGENTS = [\"parse_problem\", \"execute_code\"]\nGRAPH_PROMPTS = [prompt_parse_problem]\n\nprint(\"PAL strategy loaded (with end-to-end timing instrumentation).\")\nprint(\"solve_with_PAL_Strategy() measures total graph time (LLM + code exec) via time.perf_counter().\")"
+   "source": [
+    "import re, time\n",
+    "from langgraph.graph import StateGraph, START, END\n",
+    "\n",
+    "# -----------------------\n",
+    "# Strategy: PAL (Program-Aided Language model)\n",
+    "# -----------------------\n",
+    "prompt_parse_problem = node(\n",
+    "    \"Read the problem and write Python code that sets a variable named `result` to the final answer.\\n\"\n",
+    "    \"- Output ONLY valid Python (no markdown fences).\\n\"\n",
+    "    \"- If the task is multiple-choice, set result to the option label exactly (e.g., '(A)').\\n\\n\"\n",
+    "    \"Problem:\\n\",\n",
+    "    trainable=True,\n",
+    "    description=\"PAL prompt that generates python code producing a `result`.\"\n",
+    ")\n",
+    "\n",
+    "# Global variable to capture execution time from the graph invocation.\n",
+    "# This is read by run_solver_on_example() to populate the guide's score_dict.\n",
+    "_last_exec_time_s = 0.0\n",
+    "\n",
+    "def parse_problem(state: dict):\n",
+    "    state = get_no_node(state)\n",
+    "    question = get_no_node(state.get(\"question\", \"\"))\n",
+    "    prompt = prompt_parse_problem + question\n",
+    "    code_str = llm_call(get_no_node(prompt))\n",
+    "    return {\"code\": code_str.strip(), \"question\": question}\n",
+    "\n",
+    "def execute_code(state: dict):\n",
+    "    \"\"\"Execute LLM-generated Python code.\n",
+    "\n",
+    "    The PAL strategy: exec() the code produced by the LLM and extract the\n",
+    "    `result` variable as the final answer.\n",
+    "    \"\"\"\n",
+    "    def strip_python_tags(code: str) -> str:\n",
+    "        return re.sub(\n",
+    "            r'(?s)(?:.*?```(?:python)?\\s*\\n(.*?)(?:\\n```.*)?|(.*))',\n",
+    "            lambda m: m.group(1) if m.group(1) is not None else m.group(2),\n",
+    "            code,\n",
+    "        )\n",
+    "\n",
+    "    update = {}\n",
+    "    try:\n",
+    "        code_to_run = strip_python_tags(get_no_node(state.get(\"code\", \"\")))\n",
+    "        local_vars = {}\n",
+    "        exec(code_to_run, {}, local_vars)  # noqa: S102 - intentional PAL strategy\n",
+    "        local_vars.pop(\"__builtins__\", None)\n",
+    "\n",
+    "        if \"result\" in local_vars:\n",
+    "            update[\"final_answer\"] = node(local_vars[\"result\"])\n",
+    "        elif len(local_vars) == 1:\n",
+    "            update[\"final_answer\"] = node(next(iter(local_vars.values())))\n",
+    "        else:\n",
+    "            update[\"final_answer\"] = node(None)\n",
+    "\n",
+    "    except Exception as e:\n",
+    "        update[\"final_answer\"] = node(None)\n",
+    "        update[\"error\"] = str(e)\n",
+    "\n",
+    "    return update\n",
+    "\n",
+    "def create_graph_solve_with_PAL_Strategy():\n",
+    "    g = StateGraph(dict)\n",
+    "    g.add_node(\"parse\", parse_problem)\n",
+    "    g.add_node(\"calculate\", execute_code)\n",
+    "    g.add_edge(START, \"parse\")\n",
+    "    g.add_edge(\"parse\", \"calculate\")\n",
+    "    g.add_edge(\"calculate\", END)\n",
+    "    return g\n",
+    "\n",
+    "def solve_with_PAL_Strategy(problem: str) -> dict:\n",
+    "    global _last_exec_time_s\n",
+    "    _last_exec_time_s = 0.0  # reset before each invocation\n",
+    "\n",
+    "    g = create_graph_solve_with_PAL_Strategy()\n",
+    "    compiled = g.compile()\n",
+    "\n",
+    "    if SHOW_MERMAID_GRAPH:\n",
+    "        try:\n",
+    "            from IPython.display import Image, display\n",
+    "            display(Image(compiled.get_graph(xray=1).draw_mermaid_png()))\n",
+    "        except Exception:\n",
+    "            pass\n",
+    "\n",
+    "    # --- M2: measure end-to-end graph execution time (LLM call + code exec) ---\n",
+    "    t0 = time.perf_counter()\n",
+    "    try:\n",
+    "        result = compiled.invoke({\"question\": get_no_node(problem)})\n",
+    "    except Exception as e:\n",
+    "        _last_exec_time_s = time.perf_counter() - t0\n",
+    "        return {\"final_answer\": node(None), \"error\": str(e)}\n",
+    "    t1 = time.perf_counter()\n",
+    "    _last_exec_time_s = t1 - t0\n",
+    "\n",
+    "    if \"final_answer\" not in result:\n",
+    "        return {\"final_answer\": node(\"No solution found\")}\n",
+    "    if isinstance(result[\"final_answer\"], str):\n",
+    "        return {\"final_answer\": node(result[\"final_answer\"])}\n",
+    "    return result\n",
+    "\n",
+    "# Default graph spec\n",
+    "GRAPH_ROOT = \"solve_with_PAL_Strategy\"\n",
+    "GRAPH_AGENTS = [\"parse_problem\", \"execute_code\"]\n",
+    "GRAPH_PROMPTS = [prompt_parse_problem]\n",
+    "\n",
+    "print(\"PAL strategy loaded (with end-to-end timing instrumentation).\")\n",
+    "print(\"solve_with_PAL_Strategy() measures total graph time (LLM + code exec) via time.perf_counter().\")"
+   ]
   },
   {
    "cell_type": "code",
@@ -662,7 +775,187 @@
    "id": "cell-training",
    "metadata": {},
    "outputs": [],
-   "source": "from typing import List, Dict, Tuple\nimport time\n\n# -----------------------\n# Multi-objective instrumented solver + evaluator\n# -----------------------\n\n# Build the guide with multi-objective support\nguide = LangGraphGuide(\n    feedback_func=feedback_answer_bbeh,\n    answer_key=\"final_answer\",\n    allowed_answer_set=allowed_set,\n)\n\ndef run_solver_on_example(ex: dict) -> Tuple[bool, str, str, Dict[str, float]]:\n    \"\"\"Run solver and return (ok, pred, feedback, score_dict).\n\n    score_dict contains {accuracy, execution_time_s} from get_score_dict().\n    \"\"\"\n    global _last_exec_time_s\n    _last_exec_time_s = 0.0\n\n    out = solve_with_PAL_Strategy(ex[\"question\"])\n    pred = get_no_node(out.get(\"final_answer\"))\n    ok, fb = feedback_answer_bbeh(pred, ex[\"solution\"], allowed_set)\n\n    # Populate guide's execution time from the global, then get score_dict\n    guide._last_execution_time_s = _last_exec_time_s\n    score_dict = guide.get_score_dict(ex[\"question\"], out, ex[\"solution\"])\n\n    return ok, str(pred), fb, score_dict\n\ndef evaluate(examples: List[dict], *, name: str) -> Tuple[float, List[Dict[str, float]]]:\n    \"\"\"Evaluate examples, returning (accuracy, list of score_dicts).\"\"\"\n    n_ok = 0\n    all_score_dicts = []\n    for i, ex in enumerate(examples, 1):\n        ok, pred, fb, sd = run_solver_on_example(ex)\n        n_ok += int(ok)\n        all_score_dicts.append(sd)\n        print(f\"[{name}] {i:02d}/{len(examples)} ok={ok} pred={pred} \"\n              f\"exec_time={sd['execution_time_s']:.4f}s :: {fb}\")\n    acc = n_ok / max(1, len(examples))\n    mean_time = sum(sd['execution_time_s'] for sd in all_score_dicts) / max(1, len(all_score_dicts))\n    print(f\"[{name}] accuracy = {acc:.3f} ({n_ok}/{len(examples)}), mean exec_time = {mean_time:.4f}s\")\n    return acc, all_score_dicts\n\n\n# =====================================================================\n# Baseline evaluation\n# =====================================================================\nprint(\"=\" * 60)\nprint(\"BASELINE evaluation on validation set\")\nprint(\"=\" * 60)\nbaseline_acc, baseline_score_dicts = evaluate(val_set, name=\"baseline/val\")\n\n# =====================================================================\n# Per-step metric collection during curriculum training\n# =====================================================================\n# Stores {step, phase, accuracy, execution_time_s, example_idx} per observation\nmetric_log = []\nstep_counter = 0\n\n# Record baseline metrics\nfor i, sd in enumerate(baseline_score_dicts):\n    metric_log.append({\n        \"step\": 0,\n        \"phase\": \"baseline\",\n        \"example_idx\": i,\n        **sd,\n    })\n\n# =====================================================================\n# Curriculum training (Mode B) with metric collection\n# =====================================================================\nif SKIP_OPTIMIZATION:\n    print(\"SKIP_OPTIMIZATION=1 -> skipping optimization/training.\")\nelse:\n    last_successes: List[dict] = []\n\n    for idx, ex in enumerate(train_set, 1):\n        step_counter += 1\n        ok, pred, fb, sd = run_solver_on_example(ex)\n        print(f\"[train] {idx:02d}/{len(train_set)} ok={ok} pred={pred} \"\n              f\"exec_time={sd['execution_time_s']:.4f}s :: {fb}\")\n\n        # Log pre-optimization metric\n        metric_log.append({\n            \"step\": step_counter,\n            \"phase\": \"train_pre\",\n            \"example_idx\": idx - 1,\n            **sd,\n        })\n\n        if ok:\n            last_successes.append(ex)\n            last_successes = last_successes[-VALIDATE_ON_LAST_N:]\n            continue\n\n        # Optimize on the failing example\n        modified, dump_file, history, chosen_state, run_dir = optimize_langgraph(\n            graph_root_function=GRAPH_ROOT,\n            graph_agents_functions=GRAPH_AGENTS,\n            graph_prompts_list=GRAPH_PROMPTS,\n            question=ex[\"question\"],\n            solution=ex[\"solution\"],\n            answer_feedback_func=feedback_answer_bbeh,\n            allowed_answer_set=allowed_set,\n            validation_set=last_successes,\n            accumulation_steps=ACCUMULATION_STEPS,\n            retry=LEARNING_RETRY,\n            max_attempts=MAX_ATTEMPTS,\n            test_optimization=True,\n            stop_on_success=True,\n            seed=SEED,\n            dump_prefix=f\"BBEH_{BBEH_TASK_NAME}__PAL__\",\n            output_folder=OUTPUT_FOLDER,\n        )\n\n        print(\"[train] optimize_langgraph:\", {\"modified\": modified, \"dump_file\": dump_file, \"run_dir\": run_dir})\n        if history:\n            print(\"[train] last history entry:\", history[-1])\n\n        # Re-test after optimization.\n        # Wrapped in try/except: when optimization fails to update params,\n        # the Trace bundle state can be corrupted (Node objects where dicts\n        # are expected), causing ExecutionError in the re-test.\n        try:\n            ok2, pred2, fb2, sd2 = run_solver_on_example(ex)\n            print(f\"[train] after-opt ok={ok2} pred={pred2} \"\n                  f\"exec_time={sd2['execution_time_s']:.4f}s :: {fb2}\")\n\n            # Log post-optimization metric\n            metric_log.append({\n                \"step\": step_counter,\n                \"phase\": \"train_post\",\n                \"example_idx\": idx - 1,\n                **sd2,\n            })\n\n            if ok2:\n                last_successes.append(ex)\n                last_successes = last_successes[-VALIDATE_ON_LAST_N:]\n        except Exception as e:\n            print(f\"[train] after-opt re-test failed (graph state corrupted): {type(e).__name__}: {e}\")\n            print(\"[train] skipping this example and continuing.\")\n\n# =====================================================================\n# Post-training evaluation\n# =====================================================================\nprint(\"\\n\" + \"=\" * 60)\nprint(\"POST-TRAINING evaluation on validation set\")\nprint(\"=\" * 60)\nfinal_acc, final_score_dicts = evaluate(val_set, name=\"final/val\")\n\n# Record final eval metrics\nstep_counter += 1\nfor i, sd in enumerate(final_score_dicts):\n    metric_log.append({\n        \"step\": step_counter,\n        \"phase\": \"final\",\n        \"example_idx\": i,\n        **sd,\n    })\n\nprint(f\"\\nSummary: baseline_val_acc={baseline_acc:.3f}, final_val_acc={final_acc:.3f}\")\nprint(f\"Total metric observations collected: {len(metric_log)}\")"
+   "source": [
+    "from typing import List, Dict, Tuple\n",
+    "import time\n",
+    "\n",
+    "# -----------------------\n",
+    "# Multi-objective instrumented solver + evaluator\n",
+    "# -----------------------\n",
+    "\n",
+    "# Build the guide with multi-objective support\n",
+    "guide = LangGraphGuide(\n",
+    "    feedback_func=feedback_answer_bbeh,\n",
+    "    answer_key=\"final_answer\",\n",
+    "    allowed_answer_set=allowed_set,\n",
+    ")\n",
+    "\n",
+    "def run_solver_on_example(ex: dict) -> Tuple[bool, str, str, Dict[str, float]]:\n",
+    "    \"\"\"Run solver and return (ok, pred, feedback, score_dict).\n",
+    "\n",
+    "    score_dict contains {accuracy, execution_time_s} from get_score_dict().\n",
+    "    \"\"\"\n",
+    "    global _last_exec_time_s\n",
+    "    _last_exec_time_s = 0.0\n",
+    "\n",
+    "    out = solve_with_PAL_Strategy(ex[\"question\"])\n",
+    "    pred = get_no_node(out[\"final_answer\"])\n",
+    "    ok, fb = feedback_answer_bbeh(pred, ex[\"solution\"], allowed_set)\n",
+    "\n",
+    "    # Populate guide's execution time from the global, then get score_dict\n",
+    "    guide._last_execution_time_s = _last_exec_time_s\n",
+    "    score_dict = guide.get_score_dict(ex[\"question\"], out, ex[\"solution\"])\n",
+    "\n",
+    "    return ok, str(pred), fb, score_dict\n",
+    "\n",
+    "def evaluate(examples: List[dict], *, name: str) -> Tuple[float, List[Dict[str, float]]]:\n",
+    "    \"\"\"Evaluate examples, returning (accuracy, list of score_dicts).\"\"\"\n",
+    "    n_ok = 0\n",
+    "    all_score_dicts = []\n",
+    "    for i, ex in enumerate(examples, 1):\n",
+    "        ok, pred, fb, sd = run_solver_on_example(ex)\n",
+    "        n_ok += int(ok)\n",
+    "        all_score_dicts.append(sd)\n",
+    "        print(f\"[{name}] {i:02d}/{len(examples)} ok={ok} pred={pred} \"\n",
+    "              f\"exec_time={sd['execution_time_s']:.4f}s :: {fb}\")\n",
+    "    acc = n_ok / max(1, len(examples))\n",
+    "    mean_time = sum(sd['execution_time_s'] for sd in all_score_dicts) / max(1, len(all_score_dicts))\n",
+    "    print(f\"[{name}] accuracy = {acc:.3f} ({n_ok}/{len(examples)}), mean exec_time = {mean_time:.4f}s\")\n",
+    "    return acc, all_score_dicts\n",
+    "\n",
+    "\n",
+    "# =====================================================================\n",
+    "# Baseline evaluation\n",
+    "# =====================================================================\n",
+    "print(\"=\" * 60)\n",
+    "print(\"BASELINE evaluation on validation set\")\n",
+    "print(\"=\" * 60)\n",
+    "baseline_acc, baseline_score_dicts = evaluate(val_set, name=\"baseline/val\")\n",
+    "\n",
+    "# =====================================================================\n",
+    "# Per-step metric collection during curriculum training\n",
+    "# =====================================================================\n",
+    "# Stores {step, phase, accuracy, execution_time_s, example_idx} per observation\n",
+    "metric_log = []\n",
+    "step_counter = 0\n",
+    "\n",
+    "# Record baseline metrics\n",
+    "for i, sd in enumerate(baseline_score_dicts):\n",
+    "    metric_log.append({\n",
+    "        \"step\": 0,\n",
+    "        \"phase\": \"baseline\",\n",
+    "        \"example_idx\": i,\n",
+    "        **sd,\n",
+    "    })\n",
+    "\n",
+    "# =====================================================================\n",
+    "# Curriculum training (Mode B) with metric collection\n",
+    "# =====================================================================\n",
+    "if SKIP_OPTIMIZATION:\n",
+    "    print(\"SKIP_OPTIMIZATION=1 -> skipping optimization/training.\")\n",
+    "else:\n",
+    "    last_successes: List[dict] = []\n",
+    "\n",
+    "    for idx, ex in enumerate(train_set, 1):\n",
+    "        step_counter += 1\n",
+    "        try:\n",
+    "            ok, pred, fb, sd = run_solver_on_example(ex)\n",
+    "        except Exception as e:\n",
+    "            print(f\"[train] {idx:02d}/{len(train_set)} CRASHED: {type(e).__name__}: {e}\")\n",
+    "            ok, pred, fb = False, \"ERROR\", str(e)\n",
+    "            sd = {\"accuracy\": 0.0, \"execution_time_s\": 0.0}\n",
+    "        print(f\"[train] {idx:02d}/{len(train_set)} ok={ok} pred={pred} \"\n",
+    "              f\"exec_time={sd['execution_time_s']:.4f}s :: {fb}\")\n",
+    "\n",
+    "        # Log pre-optimization metric\n",
+    "        metric_log.append({\n",
+    "            \"step\": step_counter,\n",
+    "            \"phase\": \"train_pre\",\n",
+    "            \"example_idx\": idx - 1,\n",
+    "            **sd,\n",
+    "        })\n",
+    "\n",
+    "        if ok:\n",
+    "            last_successes.append(ex)\n",
+    "            last_successes = last_successes[-VALIDATE_ON_LAST_N:]\n",
+    "            continue\n",
+    "\n",
+    "        # Save pre-optimization function references for crash recovery\n",
+    "        _pre_opt_agents = {name: globals().get(name) for name in GRAPH_AGENTS}\n",
+    "\n",
+    "        # Optimize on the failing example\n",
+    "        modified, dump_file, history, chosen_state, run_dir = optimize_langgraph(\n",
+    "            graph_root_function=GRAPH_ROOT,\n",
+    "            graph_agents_functions=GRAPH_AGENTS,\n",
+    "            graph_prompts_list=GRAPH_PROMPTS,\n",
+    "            question=ex[\"question\"],\n",
+    "            solution=ex[\"solution\"],\n",
+    "            answer_feedback_func=feedback_answer_bbeh,\n",
+    "            allowed_answer_set=allowed_set,\n",
+    "            validation_set=last_successes,\n",
+    "            accumulation_steps=ACCUMULATION_STEPS,\n",
+    "            retry=LEARNING_RETRY,\n",
+    "            max_attempts=MAX_ATTEMPTS,\n",
+    "            test_optimization=True,\n",
+    "            stop_on_success=True,\n",
+    "            seed=SEED,\n",
+    "            dump_prefix=f\"BBEH_{BBEH_TASK_NAME}__PAL__\",\n",
+    "            output_folder=OUTPUT_FOLDER,\n",
+    "        )\n",
+    "\n",
+    "        print(\"[train] optimize_langgraph:\", {\"modified\": modified, \"dump_file\": dump_file, \"run_dir\": run_dir})\n",
+    "        if history:\n",
+    "            print(\"[train] last history entry:\", history[-1])\n",
+    "\n",
+    "        # Re-test after optimization.\n",
+    "        # Wrapped in try/except: when optimization fails to update params,\n",
+    "        # the Trace bundle state can be corrupted (Node objects where dicts\n",
+    "        # are expected), causing ExecutionError in the re-test.\n",
+    "        try:\n",
+    "            ok2, pred2, fb2, sd2 = run_solver_on_example(ex)\n",
+    "            print(f\"[train] after-opt ok={ok2} pred={pred2} \"\n",
+    "                  f\"exec_time={sd2['execution_time_s']:.4f}s :: {fb2}\")\n",
+    "\n",
+    "            # Log post-optimization metric\n",
+    "            metric_log.append({\n",
+    "                \"step\": step_counter,\n",
+    "                \"phase\": \"train_post\",\n",
+    "                \"example_idx\": idx - 1,\n",
+    "                **sd2,\n",
+    "            })\n",
+    "\n",
+    "            if ok2:\n",
+    "                last_successes.append(ex)\n",
+    "                last_successes = last_successes[-VALIDATE_ON_LAST_N:]\n",
+    "        except Exception as e:\n",
+    "            print(f\"[train] after-opt re-test failed (graph state corrupted): {type(e).__name__}: {e}\")\n",
+    "            # Restore pre-optimization functions so next example doesn't crash\n",
+    "            for _name, _orig in _pre_opt_agents.items():\n",
+    "                if _orig is not None:\n",
+    "                    globals()[_name] = _orig\n",
+    "            print(\"[train] restored original functions; continuing.\")\n",
+    "\n",
+    "# =====================================================================\n",
+    "# Post-training evaluation\n",
+    "# =====================================================================\n",
+    "print(\"\\n\" + \"=\" * 60)\n",
+    "print(\"POST-TRAINING evaluation on validation set\")\n",
+    "print(\"=\" * 60)\n",
+    "final_acc, final_score_dicts = evaluate(val_set, name=\"final/val\")\n",
+    "\n",
+    "# Record final eval metrics\n",
+    "step_counter += 1\n",
+    "for i, sd in enumerate(final_score_dicts):\n",
+    "    metric_log.append({\n",
+    "        \"step\": step_counter,\n",
+    "        \"phase\": \"final\",\n",
+    "        \"example_idx\": i,\n",
+    "        **sd,\n",
+    "    })\n",
+    "\n",
+    "print(f\"\\nSummary: baseline_val_acc={baseline_acc:.3f}, final_val_acc={final_acc:.3f}\")\n",
+    "print(f\"Total metric observations collected: {len(metric_log)}\")"
+   ]
   },
   {
    "cell_type": "markdown",
@@ -856,7 +1149,7 @@
     "\n",
     "This demonstrates the M2 multi-objective infrastructure on a real LLM task.\n",
     "The same get_score_dict() interface works with BasicSearch, BeamsearchAlgorithm,\n",
-    "and PrioritySearch (see t6_m2_trainers.ipynb for those algorithms).\n",
+    "and PrioritySearch (see multiobjective_trainers.ipynb for those algorithms).\n",
     "\"\"\")"
    ]
   }
@@ -874,4 +1167,4 @@
  },
  "nbformat": 4,
  "nbformat_minor": 5
-}
\ No newline at end of file
+}
diff --git a/examples/notebooks/t6_m1_vector_scores.ipynb b/examples/notebooks/multiobjective_quickstart.ipynb
similarity index 98%
rename from examples/notebooks/t6_m1_vector_scores.ipynb
rename to examples/notebooks/multiobjective_quickstart.ipynb
index 52bc5c73..07732b73 100644
--- a/examples/notebooks/t6_m1_vector_scores.ipynb
+++ b/examples/notebooks/multiobjective_quickstart.ipynb
@@ -6,7 +6,7 @@
    "id": "a0000001",
    "metadata": {},
    "outputs": [],
-   "source": "import os, sys\n\n# In Colab: clone and install from GitHub\n# Locally: add repo root to sys.path so opto is importable\ntry:\n    import google.colab\n    IN_COLAB = True\nexcept ImportError:\n    IN_COLAB = False\n\nif IN_COLAB:\n    !git clone https://github.com/carlosrod723/OpenTrace.git Trace\n    %cd Trace\n    !git checkout t6-multi-objective-m0\n    !sed -i 's/python_requires=\">=3.13\"/python_requires=\">=3.12\"/' setup.py\n    !pip install -e .\nelse:\n    # Local: ensure repo root is on sys.path\n    _nb_dir = os.path.dirname(os.path.abspath(\"__file__\"))\n    _repo_root = os.path.abspath(os.path.join(_nb_dir, \"..\", \"..\"))\n    if _repo_root not in sys.path:\n        sys.path.insert(0, _repo_root)\n    import opto\n    print(f\"Using local opto from: {os.path.dirname(opto.__file__)}\")"
+   "source": "import os, sys\n\n# In Colab: clone and install from GitHub\n# Locally: add repo root to sys.path so opto is importable\ntry:\n    import google.colab\n    IN_COLAB = True\nexcept ImportError:\n    IN_COLAB = False\n\nif IN_COLAB:\n    !git clone https://github.com/AgentOpt/OpenTrace.git Trace\n    %cd Trace\n    !git checkout experimental\n    !pip install -e .\nelse:\n    # Local: ensure repo root is on sys.path\n    _nb_dir = os.path.dirname(os.path.abspath(\"__file__\"))\n    _repo_root = os.path.abspath(os.path.join(_nb_dir, \"..\", \"..\"))\n    if _repo_root not in sys.path:\n        sys.path.insert(0, _repo_root)\n    import opto\n    print(f\"Using local opto from: {os.path.dirname(opto.__file__)}\")"
   },
   {
    "cell_type": "markdown",
@@ -15,7 +15,7 @@
    "source": [
     "# T6 Multi-Objective Vector Scores — M1 Implementation\n",
     "\n",
-    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/AgentOpt/OpenTrace/blob/pull/61/head/examples/notebooks/t6_m1_vector_scores.ipynb)\n",
+    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/AgentOpt/OpenTrace/blob/experimental/examples/notebooks/multiobjective_quickstart.ipynb)\n",
     "\n",
     "**Milestone 1 Deliverable** — Core multi-objective infrastructure\n",
     "\n",
@@ -1124,4 +1124,4 @@
  },
  "nbformat": 4,
  "nbformat_minor": 5
-}
\ No newline at end of file
+}
diff --git a/examples/notebooks/t6_m2_trainers.ipynb b/examples/notebooks/multiobjective_trainers.ipynb
similarity index 96%
rename from examples/notebooks/t6_m2_trainers.ipynb
rename to examples/notebooks/multiobjective_trainers.ipynb
index 8984f43e..5c97209b 100644
--- a/examples/notebooks/t6_m2_trainers.ipynb
+++ b/examples/notebooks/multiobjective_trainers.ipynb
@@ -6,7 +6,7 @@
    "id": "cell-setup",
    "metadata": {},
    "outputs": [],
-   "source": "import os, sys\n\n# In Colab: clone and install from GitHub\n# Locally: add repo root to sys.path so opto is importable\ntry:\n    import google.colab\n    IN_COLAB = True\nexcept ImportError:\n    IN_COLAB = False\n\nif IN_COLAB:\n    %cd /content\n    !rm -rf Trace  # clean slate\n    !git clone https://github.com/carlosrod723/OpenTrace.git Trace\n    %cd Trace\n    !git checkout t6-multi-objective-m0\n    !sed -i 's/python_requires=\">=3.13\"/python_requires=\">=3.12\"/' setup.py\n    !pip install -e .\n    !pip install cvxpy matplotlib pandas\n    _repo_root = os.getcwd()  # /content/Trace after %cd\nelse:\n    # Local: ensure repo root is on sys.path\n    _nb_dir = os.path.dirname(os.path.abspath(\"__file__\"))\n    _repo_root = os.path.abspath(os.path.join(_nb_dir, \"..\", \"..\"))\n    if _repo_root not in sys.path:\n        sys.path.insert(0, _repo_root)\n    import opto\n    print(f\"Using local opto from: {os.path.dirname(opto.__file__)}\")\n\nprint(f\"Repo root: {_repo_root}\")\n\n# Verify cvxpy is available (required for SixHumpCamel SOS certificate)\ntry:\n    import cvxpy\n    print(f\"cvxpy {cvxpy.__version__} available\")\nexcept ImportError:\n    raise ImportError(\"cvxpy is required: pip install cvxpy\")"
+   "source": "import os, sys\n\n# In Colab: clone and install from GitHub\n# Locally: add repo root to sys.path so opto is importable\ntry:\n    import google.colab\n    IN_COLAB = True\nexcept ImportError:\n    IN_COLAB = False\n\nif IN_COLAB:\n    %cd /content\n    !rm -rf Trace  # clean slate\n    !git clone https://github.com/AgentOpt/OpenTrace.git Trace\n    %cd Trace\n    !git checkout experimental\n    !pip install -e .\n    !pip install cvxpy matplotlib pandas\n    _repo_root = os.getcwd()  # /content/Trace after %cd\nelse:\n    # Local: ensure repo root is on sys.path\n    _nb_dir = os.path.dirname(os.path.abspath(\"__file__\"))\n    _repo_root = os.path.abspath(os.path.join(_nb_dir, \"..\", \"..\"))\n    if _repo_root not in sys.path:\n        sys.path.insert(0, _repo_root)\n    import opto\n    print(f\"Using local opto from: {os.path.dirname(opto.__file__)}\")\n\nprint(f\"Repo root: {_repo_root}\")\n\n# Verify cvxpy is available (required for SixHumpCamel SOS certificate)\ntry:\n    import cvxpy\n    print(f\"cvxpy {cvxpy.__version__} available\")\nexcept ImportError:\n    raise ImportError(\"cvxpy is required: pip install cvxpy\")"
   },
   {
    "cell_type": "markdown",
@@ -15,7 +15,7 @@
    "source": [
     "# T6 M2 — BeamsearchAlgorithm & PrioritySearch Multi-Objective\n",
     "\n",
-    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/AgentOpt/OpenTrace/blob/pull/61/head/examples/notebooks/t6_m2_trainers.ipynb)\n",
+    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/AgentOpt/OpenTrace/blob/experimental/examples/notebooks/multiobjective_trainers.ipynb)\n",
     "\n",
     "**Milestone 2 Deliverable** — Multi-objective support in BeamsearchAlgorithm and PrioritySearch\n",
     "\n",
@@ -182,4 +182,4 @@
  },
  "nbformat": 4,
  "nbformat_minor": 5
-}
\ No newline at end of file
+}
diff --git a/setup.py b/setup.py
index 8fdfd139..dbd60be5 100644
--- a/setup.py
+++ b/setup.py
@@ -29,5 +29,5 @@
     long_description=open('README.md', encoding="utf8").read(),
     packages=setuptools.find_packages(include=["opto*"]),
     install_requires=install_requires,
-    python_requires=">=3.13",
+    python_requires=">=3.10",
 )