diff --git a/.gitignore b/.gitignore index 17aa1eeb..ba03db69 100644 --- a/.gitignore +++ b/.gitignore @@ -168,4 +168,5 @@ OAI_CONFIG_LIST *.gv.pdf # jupyter book API output -docs/api/* \ No newline at end of file +docs/api/*examples/notebooks/bbeh/ +t6_m2_bbeh_2.ipynb diff --git a/README.md b/README.md index fdfc153c..e46a2e00 100644 --- a/README.md +++ b/README.md @@ -32,7 +32,7 @@ Or for development, clone the repo and run the following. pip install -e . -The library requires Python >= 3.9. By default (starting with v0.1.3.5), we use [LiteLLM](https://github.com/BerriAI/litellm) as the backend of LLMs. For backward compatibility, we provide backend-support with [AutoGen](https://github.com/microsoft/autogen); when installing, users can add `[autogen]` tag to install a compatible AutoGen version (e.g., `pip install trace-opt[autogen]`). You may require [Git Large File Storage](https://git-lfs.com/) if +The library requires Python >= 3.10. By default (starting with v0.1.3.5), we use [LiteLLM](https://github.com/BerriAI/litellm) as the backend of LLMs. For backward compatibility, we provide backend-support with [AutoGen](https://github.com/microsoft/autogen); when installing, users can add `[autogen]` tag to install a compatible AutoGen version (e.g., `pip install trace-opt[autogen]`). You may require [Git Large File Storage](https://git-lfs.com/) if git is unable to clone the repository. **For questions or reporting bugs, please use Github Issues or post on our [Discord channel](https://discord.gg/4VeAvwFcWy). We actively check these channels.** @@ -241,6 +241,36 @@ Defining and training an agent through Trace will give you more flexibility and | Advanced | [Robotic Arm Control](https://agentopt.github.io/Trace/examples/robotics/metaworld.html) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/AgentOpt/Trace/blob/website/docs/examples/robotics/metaworld.ipynb) | Trace can optimize code to control a robotic arm after observing a full trajectory of interactions. | +## Multi-Objective Optimization + +Trace supports **multi-objective optimization** where candidates are evaluated on +multiple metrics simultaneously (e.g. accuracy + token cost, or base loss + +regularization loss). + +See the full guide: **[docs/multi_objective_scores.md](docs/multi_objective_scores.md)** + +Key features: +- **Vector scores** — `Guide.get_score_dict()` returns `Dict[str, float]` with named metrics +- **Weighted scalarization** and **Pareto dominance** ranking via `ObjectiveConfig` +- Supported in `BasicSearchAlgorithm`, `BeamsearchAlgorithm`, and `BeamsearchHistoryAlgorithm` +- Token-minimization pattern using `UsageTrackingLLM` + `TokenUsageAugmentingGuide` + +Canonical notebooks: + +| Notebook | Description | +|---|---| +| [multiobjective_quickstart](examples/notebooks/multiobjective_quickstart.ipynb) | Core vector-score infrastructure and BasicSearch integration | +| [multiobjective_trainers](examples/notebooks/multiobjective_trainers.ipynb) | Beamsearch and PrioritySearch multi-objective support | +| [multiobjective_bbeh_langgraph](examples/notebooks/multiobjective_bbeh_langgraph.ipynb) | Real LLM task: BBEH boolean expressions with accuracy + execution time | + +Trace-Bench multi-objective benchmarks (in [AgentOpt/Trace-Bench](https://github.com/AgentOpt/Trace-Bench)): + +| Notebook | Task | Metrics | +|---|---|---| +| `multiobjective_convex` | SixHumpCamel | base_loss, reg_loss | +| `multiobjective_bbeh` | BBEH boolean_expressions | accuracy, execution_time_s | +| `multiobjective_gsm8k` | GSM8K + token usage | error, tokens_in, tokens_out | + ## Supported Optimizers Currently, we support three optimizers: diff --git a/docs/T6_technical_plan.md b/docs/T6_technical_plan.md deleted file mode 100644 index 60891818..00000000 --- a/docs/T6_technical_plan.md +++ /dev/null @@ -1,843 +0,0 @@ -# T6 Technical Plan — Multi-Objective Vector Scores for Trainer Selection - -**Version:** 1.0 (Refined) -**Author:** Carlos Rodriguez -**Date:** February 9, 2026 -**Status:** M0 Deliverable — Analysis + Architecture + Interface Spec - -**Target repos / branches:** -- **Primary (implementation + PR):** [`AgentOpt/OpenTrace@experimental`](https://github.com/AgentOpt/OpenTrace/tree/experimental) -- **Benchmark integration (M3):** [`AgentOpt/Trace-Bench`](https://github.com/AgentOpt/Trace-Bench) - ---- - -## Table of Contents - -1. [Executive Summary](#1-executive-summary) -2. [Goals, Non-Goals, Success Criteria](#2-goals-non-goals-success-criteria) -3. [Current Code Reality (Baseline)](#3-current-code-reality-baseline) -4. [Proposed Architecture (Minimal Delta)](#4-proposed-architecture-minimal-delta) -5. [Public API & Data Contracts](#5-public-api--data-contracts) -6. [Module Modifications (Files to Create / Modify)](#6-module-modifications) -7. [Edge Cases & Defensive Design](#7-edge-cases--defensive-design) -8. [Milestones & Validation Gates](#8-milestones--validation-gates) -9. [Test Plan](#9-test-plan) -10. [Risks & Mitigation](#10-risks--mitigation) -11. [Design Decisions (Resolved)](#11-design-decisions-resolved) -12. [Appendix: Code Touchpoints](#12-appendix-code-touchpoints) - ---- - -## 1. Executive Summary - -Today, trainer selection in Trace is driven by a **single scalar score**. Guides return `Tuple[float, str]` via `get_feedback()`, evaluators produce `np.array` of floats, and trainers (`BasicSearchAlgorithm`, `BeamsearchAlgorithm`) select candidates via scalar comparison (`max(candidates, key=lambda x: x[0])` and `sorted(..., key=lambda x: x[0])` respectively). This blocks trainer-side search from exploiting multiple metrics like `{accuracy, latency_ms, cost}`. - -**Motivation note (from team discussion):** -Putting multiple metrics into the *feedback dict/text* is useful for optimizers (OptoPrime/OPRO), but trainers (BasicSearch/UCB/PrioritySearch/GEPA) typically only inspect the **scalar score** for ranking/UCB and ignore additional feedback structure. Therefore, enabling **vector score / score-as-dict** (with backward-compatible scalar reduction) is required for multi-objective trainer selection. - -### What this plan adds - -| Component | Change | -|-----------|--------| -| **Score contract** | `Dict[str, float]` returned by guides (optional), with backward-compatible scalar fallback | -| **ObjectiveConfig** | Frozen dataclass defining selection mode: `scalar` (default), `weighted`, or `pareto` | -| **objectives.py** (new) | All multi-objective logic isolated in pure, testable functions | -| **Evaluators** | Vector-score aggregation helpers (`evaluate_vector`, `aggregate_vector_scores`) | -| **BasicSearchAlgorithm** | Selection via `select_best(candidates, objective_config)` | -| **BeamsearchAlgorithm** | Selection via `select_top_k(candidates, objective_config, k)` | -| **PrioritySearch** (optional) | Scalarize heap priority via ObjectiveConfig; store dict for logging | -| **Benchmarks** (M3) | 3 simple benchmarks integrated into Trace-Bench | - -### Guiding principles - -- **Backward compatibility is non-negotiable.** `mode="scalar"` (the default) preserves identical behavior. -- **Isolate complexity.** All multi-objective logic lives in `objectives.py` — pure functions, easy to test. -- **Minimal churn.** Trainers gain an optional `objective_config` parameter; existing call sites are untouched. -- **Determinism.** Fixed `seed` → deterministic selection, especially Pareto tie-breaks. - ---- - -## 2. Goals, Non-Goals, Success Criteria - -### 2.1 Goals - -| ID | Goal | Acceptance Signal | -|----|------|-------------------| -| G1 | **Backward compatibility** | Existing scalar-score guides/trainers produce identical results when `objective_config` is `None` or `mode="scalar"` | -| G2 | **Vector score support** | Guide returns `{"accuracy": 1.0, "latency_ms": 120.0}` and trainers select candidates using weighted or Pareto mode | -| G3 | **Determinism** | Fixed `seed` → identical selection across runs (tested in CI) | -| G4 | **Actionability** | Every milestone: Colab notebook + pytest coverage (M1+) | -| G5 | **Benchmarks** | 3 benchmarks defined, integrated into Trace-Bench, runnable from notebooks | - -### 2.2 Non-goals (explicit) - -- No multi-objective UCB (MO-UCB) — too risky for v1 scope. -- No Pareto archive / non-dominated set management inside PrioritySearch. -- No changes to optimizer internals or new telemetry infrastructure. -- No modification to `get_feedback()` return signature (we use a helper instead). - -### 2.3 Crisp success criteria - -All of the following must be true: - -1. Scalar-only trainers still work and produce same results by default. -2. Multi-objective guide dict works end-to-end for BasicSearch + Beamsearch. -3. Deterministic behavior with fixed seed (tests + notebook). -4. Each milestone delivers a runnable Colab notebook. -5. From M1 onward, new functions have pytest tests and CI is green. -6. M3: three benchmarks exist, run, and Trace-Bench integration works. - ---- - -## 3. Current Code Reality (Baseline) - -### 3.1 Guide — scalar score contract - -```python -# opto/trainer/guide.py - -class Guide: - def get_feedback(self, query, response, reference=None, **kwargs) -> Tuple[float, str]: - raise NotImplementedError - - def metric(self, query, response, reference=None, **kwargs) -> float: - return self.get_feedback(query, response, reference)[0] # extracts scalar -``` - -**Implication:** `metric()` always returns `float`. Multi-metric feedback is not usable for selection. - -### 3.2 Evaluators — scalar arrays - -```python -# opto/trainer/evaluators.py - -def evaluate(agent, guide, inputs, infos, ...) -> np.ndarray: - # Calls guide.metric() per example → float - # Returns np.array of shape (N,) or (N, num_samples) -``` - -**Implication:** All scores are numeric scalars aggregated via `np.mean()`. - -### 3.3 BasicSearchAlgorithm — scalar max selection - -```python -# opto/trainer/algorithms/basic_algorithms.py :: BasicSearchAlgorithm.optimizer_step() - -def validate(): - scores = evaluate(self.agent, self.validate_guide, ...) - return np.mean(scores) if all([s is not None for s in scores]) else -np.inf - -# Selection: -candidates.append((score, update_dict)) # score is float -best_score, best_update = max(candidates, key=lambda x: x[0]) # scalar max -``` - -**Insertion point:** Replace `max(candidates, ...)` with `select_best(candidates, objective_config)`. - -### 3.4 BeamsearchAlgorithm — scalar sort selection - -```python -# opto/trainer/algorithms/beamsearch_algorithm.py :: BeamsearchAlgorithm.select() - -scored_candidates.append((validation_score, candidate_params)) # float -sorted_candidates = sorted(scored_candidates, key=lambda x: x[0], reverse=True) -selected_candidates = sorted_candidates[:beam_width] # take top-k by scalar -``` - -**Insertion point:** Replace scalar sort with `select_top_k(scored_candidates, objective_config, k=beam_width)`. - -### 3.5 Shared patterns across both trainers - -| Pattern | BasicSearch | Beamsearch | -|---------|-------------|------------| -| Validate | `np.mean(scores)` → float | `np.mean(validation_scores)` → float | -| Store | `(score, update_dict)` | `(validation_score, candidate_params)` | -| Select | `max(candidates, key=λ x: x[0])` | `sorted(candidates, key=λ x: x[0])[:k]` | -| Fallback | `-np.inf` | `-np.inf` | - -Both converge to the same abstraction: **given a list of `(score, params)` pairs, select the best or top-k.** This is exactly what `objectives.py` will provide. - -### 3.6 Existing infrastructure we leverage - -- **Logger abstraction:** `BaseLogger` with `log(name, value, step)` — can log each metric in a vector score. -- **StubLLM / DummyLLM:** Wraps deterministic callables — usable for CI and no-keys notebooks. -- **`batch_run` / `async_run`:** Parallelism utilities already in place. - ---- - -## 4. Proposed Architecture (Minimal Delta) - -### 4.1 Core idea - -Isolate all multi-objective logic into one new module (`opto/trainer/objectives.py`) containing **pure functions**: - -``` -to_score_dict() → scalar/dict to dict conversion (neutral name) -apply_minimize() → flip signs for minimize metrics -weighted_scalarize()→ dict → float via weighted sum -pareto_rank() → dominance ranking + tie-break -select_best() → given candidates + config, return best index -select_top_k() → given candidates + config, return top-k indices -``` - -Trainers call these functions instead of inline `max()` / `sorted()`. When `objective_config` is `None`, the functions fall through to scalar comparison — **identical to current behavior**. - -### 4.2 Data flow (target) - -``` -Guide.get_feedback() - │ - ├── returns (float, str) ← existing path, unchanged - └── returns (Dict[str,float], str) ← new path (via get_score_dict helper) - │ - ▼ -Evaluator.evaluate_vector() - │ - ├── per-example: List[Dict[str, float]] - └── aggregated: Dict[str, float] (mean per metric) - │ - ▼ -Trainer selection (objectives.py) - │ - ├── mode="scalar" → max(mean_scores) ← unchanged - ├── mode="weighted" → max(weighted_scalarize()) ← new - └── mode="pareto" → pareto_rank() + tie-break ← new -``` - -### 4.3 Backward compatibility guarantee - -The entire vector-score path is **opt-in**: - -1. If `objective_config` is `None` → existing scalar path, no new code executed. -2. If guide returns `float` and `objective_config` is provided → `to_score_dict()` wraps it as `{"score": float}`, weights default to `{"score": 1.0}`. -3. If guide returns `Dict[str, float]` and `objective_config` is `None` → `ValueError` is raised (no hidden hard-coded dict→scalar reduction). Pass an explicit `ObjectiveConfig(mode="scalar", scalarize_dict="mean")` to reduce via mean, or `scalarize_dict="score"` to use a single key. - ---- - -## 5. Public API & Data Contracts - -### 5.1 Score types - -```python -from typing import Union, Dict - -ScalarScore = float -VectorScore = Dict[str, float] # JSON-serializable, all values finite -ScoreLike = Union[int, float, bool, Dict[str, float]] -``` - -**Contract:** -- "Higher is better" by default for all metrics. -- Metrics to minimize are declared in `ObjectiveConfig.minimize` (semantics: negate internally). -- All dict values must be finite floats. `NaN` / `±inf` in a dict raises `ValueError`. -- `int` and `bool` scalar scores are accepted and converted to `float` (e.g., `LLMJudge` returns `int` 0/1, test guides return `bool`). - -### 5.2 ObjectiveConfig - -```python -from dataclasses import dataclass, field -from typing import Literal, Optional, Dict, Tuple - -@dataclass(frozen=True) -class ObjectiveConfig: - """Configuration for multi-objective candidate selection. - - Attributes: - mode: Selection strategy. - - "scalar": Use existing scalar comparison (default, backward-compatible). - - "weighted": Scalarize via weighted sum, then select max. - - "pareto": Pareto dominance ranking with configurable tie-break. - weights: Per-metric weights for weighted scalarization. - Missing metrics use missing_value. Metrics not present in the weights dict - are ignored (not included in the weighted sum). - If empty dict in weighted mode, all present metrics get equal weight 1.0. - minimize: Frozenset of metric names where lower is better (users can pass set; auto-converted). - These are negated internally before comparison ("higher-is-better" normalization). - missing_value: Score assigned to missing metrics in a candidate's score dict. - Default: float('-inf') (effectively disqualifies candidates missing required metrics). - pareto_metrics: Subset of metrics to use for Pareto dominance. - If None, all metrics present across candidates are used. - tie_break: Strategy for breaking ties among Pareto-equivalent candidates. - - "weighted": Fall back to weighted scalarization among tied candidates. - - "lexicographic": Sort by metrics in alphabetical order. - - "random_seeded": Seeded random shuffle. - seed: Random seed for deterministic tie-breaking. - """ - mode: Literal["scalar", "weighted", "pareto"] = "scalar" - weights: Dict[str, float] = field(default_factory=dict) - minimize: frozenset = field(default_factory=frozenset) - missing_value: float = float("-inf") - pareto_metrics: Optional[Tuple[str, ...]] = None - tie_break: Literal["weighted", "lexicographic", "random_seeded"] = "weighted" - seed: int = 0 - - def __post_init__(self): - # Convert set → frozenset for true immutability + hashability - if isinstance(self.minimize, set): - object.__setattr__(self, 'minimize', frozenset(self.minimize)) - # Validate weights are non-negative - for k, v in self.weights.items(): - if v < 0: - raise ValueError(f"Weight for '{k}' must be non-negative, got {v}") - # Validate pareto_metrics - if self.pareto_metrics is not None and len(self.pareto_metrics) == 0: - raise ValueError("pareto_metrics must be None (auto) or non-empty tuple") -``` - -**Validation rules (enforced in `__post_init__`):** -- `minimize` is stored as `frozenset` for true immutability (users can pass `set` for convenience; it's auto-converted). -- `mode="weighted"` with empty `weights` → auto-assign equal weight 1.0 to all encountered metrics. -- `mode="pareto"` with `pareto_metrics=None` → use union of all metric keys across candidates. -- `mode="pareto"` with `pareto_metrics=()` → `ValueError`. -- All weight values must be non-negative. -- `minimize` metric names must be valid strings (warning if not found in any candidate). - -### 5.3 Guide helper method - -```python -# Added to Guide base class (non-breaking) - -class Guide: - # ... existing methods unchanged ... - - def get_score_dict(self, query: str, response: str, reference=None, **kwargs) -> Dict[str, float]: - """Return evaluation score as a dict (multi-objective selection path). - - Default implementation wraps the scalar training score from get_feedback() as: - {"score": float_value} - - Guides that need multiple metrics should override *get_score_dict()* and return - e.g. {"accuracy": 0.9, "brevity": 0.8, "latency_s": 0.05}. - - Note: get_feedback() should remain scalar (float) for training-loop backward - compatibility. If a subclass returns a dict from get_feedback(), metric() and - scalar evaluators may break; prefer overriding get_score_dict(). - """ - score, _ = self.get_feedback(query, response, reference, **kwargs) - if isinstance(score, dict): - return {k: float(v) for k, v in score.items()} - return {"score": float(score)} -``` - -**Why this approach:** -- `get_score_dict()` is a new method — zero risk of breaking existing subclasses. -- `metric()` always returns `float` — the existing `evaluate()` function (which calls `guide.metric()` and passes results to `np.array()`) and the training loop (which calls `np.mean(scores)`) are completely unaffected. -- Dict scores are only accessible via `get_score_dict()` → `evaluate_vector()`, keeping the two data paths cleanly separated. - -### 5.4 Evaluator additions - -```python -# Added to opto/trainer/evaluators.py - -def evaluate_vector(agent, guide, inputs, infos, min_score=None, - num_samples=1, num_threads=None, description=None - ) -> list: - """Like evaluate(), but returns List[ScoreLike] (float or dict per example). - - Uses guide.get_score_dict() to obtain dict scores per example. - When guide returns scalar, get_score_dict() wraps it as {"score": float}. - - When num_samples > 1: for each example, collects num_samples score dicts, - computes per-key mean across the samples, and returns one aggregated dict - per example. Final output is always List[Dict[str, float]] of length N. - """ - ... - -def aggregate_vector_scores(scores: list) -> Union[float, Dict[str, float]]: - """Aggregate per-example scores into a single summary score. - - - If all scores are float: returns np.mean (existing behavior). - - If all scores are dict: returns per-metric mean dict. - - Mixed float/dict: normalizes all to dict via to_score_dict(), then averages. - - Args: - scores: List of float or Dict[str, float] values. - - Returns: - float (if all scalar) or Dict[str, float] (if any dicts present). - """ - ... -``` - -### 5.5 objectives.py — complete function signatures - -```python -# opto/trainer/objectives.py (NEW FILE) - -from typing import Union, Dict, List, Set, Optional, Tuple, Literal -from dataclasses import dataclass, field - -# --- ObjectiveConfig defined here (see §5.2) --- - -# --- Score type aliases --- -ScalarScore = float -VectorScore = Dict[str, float] -ScoreLike = Union[float, Dict[str, float]] - -# --- Pure utility functions --- - -def to_score_dict(score: ScoreLike) -> Dict[str, float]: - """Convert any score to dict form (neutral name). - - - int/float/bool → {"score": float(value)} - - Dict[str, float] → returned as-is (validated: all values finite) - - Handles int (LLMJudge returns 0/1) and bool (test guides) via isinstance(score, (int, float, bool)). - Backward-compatible alias: `normalize_score = to_score_dict` - - Raises: - TypeError: if score is not int, float, bool, or dict - ValueError: if dict contains non-finite values or is empty - """ - ... - -def apply_minimize(score_dict: Dict[str, float], - minimize: Set[str]) -> Dict[str, float]: - """Negate values for minimize metrics (higher-is-better normalization). - - Returns a new dict with minimize metrics negated. - Metrics not in minimize set are unchanged. - """ - ... - -def weighted_scalarize(score_dict: Dict[str, float], - weights: Dict[str, float], - missing_value: float = float("-inf")) -> float: - """Compute weighted sum of score dict. - - For each metric in weights: - - If present in score_dict: weight * value - - If missing: weight * missing_value - - Metrics in score_dict but NOT in weights are ignored. - If weights is empty, all metrics get equal weight 1.0. - - Returns: - Weighted scalar score. - """ - ... - -def dominates(a: Dict[str, float], b: Dict[str, float], - metrics: Optional[Tuple[str, ...]] = None) -> bool: - """Check if candidate 'a' Pareto-dominates candidate 'b'. - - a dominates b iff: - - a[m] >= b[m] for all metrics m, AND - - a[m] > b[m] for at least one metric m - - Both dicts must already be in "higher-is-better" form (post apply_minimize). - Missing metrics are treated as missing_value (caller should handle before call). - - Args: - a, b: Score dicts (higher-is-better normalized). - metrics: Subset of metrics to compare. If None, use union of keys. - """ - ... - -def pareto_rank(candidates: List[Dict[str, float]], - metrics: Optional[Tuple[str, ...]] = None) -> List[int]: - """Assign Pareto rank to each candidate (0 = non-dominated front). - - Uses standard non-dominated sorting. - - Args: - candidates: List of score dicts (higher-is-better normalized). - metrics: Subset of metrics for dominance. If None, use all present. - - Returns: - List of integer ranks (same length as candidates). Rank 0 = Pareto front. - """ - ... - -def select_best(candidates: List[Tuple[ScoreLike, any]], - objective_config: Optional['ObjectiveConfig'] = None) -> int: - """Select the single best candidate index. - - Args: - candidates: List of (score, payload) tuples. - objective_config: Selection config. If None, uses scalar max (backward-compatible). - - Returns: - Index of best candidate. - - Behavior by mode: - - scalar/None: max(score) where score is float (or mean of dict values). - - weighted: max(weighted_scalarize(normalize(score), config.weights)). - - pareto: rank candidates, tie-break among rank-0 front, return winner. - - Call-site transformation (BasicSearch): - # Current: - best_score, best_update = max(candidates, key=lambda x: x[0]) - # Target: - best_idx = select_best(candidates, objective_config) - best_score, best_update = candidates[best_idx] - """ - ... - -def select_top_k(candidates: List[Tuple[ScoreLike, any]], - objective_config: Optional['ObjectiveConfig'] = None, - k: int = 1) -> List[int]: - """Select the top-k candidate indices. - - Same logic as select_best, but returns k indices. - - For pareto mode: returns rank-0 front (up to k). If front < k, - includes rank-1 candidates by tie-break order, etc. - - Deterministic ordering guaranteed with fixed seed. - """ - ... -``` - ---- - -## 6. Module Modifications - -### 6.1 Files to CREATE - -| File | Contents | Milestone | -|------|----------|-----------| -| `opto/trainer/objectives.py` | `ObjectiveConfig`, `to_score_dict`, `apply_minimize`, `weighted_scalarize`, `dominates`, `pareto_rank`, `select_best`, `select_top_k`, `score_dict_to_scalar`, `to_scalar_score`, `aggregate_score_dicts` | M1 | -| `tests/test_objectives.py` | Unit tests for all functions in objectives.py | M1 | -| `tests/test_evaluators_vector.py` | Tests for evaluate_vector + aggregate_vector_scores | M1 | -| `tests/test_trainers_multiobjective.py` | Integration tests for BasicSearch + Beamsearch with ObjectiveConfig | M2 | -| `examples/notebooks/t6_m0_analysis.ipynb` | M0 analysis notebook | M0 | -| `examples/notebooks/t6_m1_vector_scores.ipynb` | M1 demo notebook | M1 | -| `examples/notebooks/t6_m2_trainers.ipynb` | M2 demo notebook | M2 | -| `examples/notebooks/t6_m3_benchmarks.ipynb` | M3 benchmark notebook | M3 | -| `docs/T6_technical_plan.md` | This document | M0 | -| `docs/multi_objective_scores.md` | User-facing documentation | M4 | - -### 6.2 Files to MODIFY - -| File | Change | Milestone | -|------|--------|-----------| -| `opto/trainer/guide.py` | Add `get_score_dict()` method to `Guide` base class. Keep training loop scalar-safe (`metric()` returns `float`). Dict/vector scores are accessed via `get_score_dict()` for trainer-side selection. | M1 | -| `opto/trainer/evaluators.py` | Add `evaluate_vector()` and `aggregate_vector_scores()`. Existing `evaluate()` unchanged. | M1 | -| `opto/trainer/algorithms/basic_algorithms.py` | Add `objective_config` param to `BasicSearchAlgorithm.train()`. Replace `max(candidates, ...)` with `select_best()` in `optimizer_step()`. | M1 (minimal) / M2 (robust) | -| `opto/trainer/algorithms/beamsearch_algorithm.py` | Add `objective_config` param to `BeamsearchAlgorithm.train()`. Replace scalar sort in `select()` with `select_top_k()`. | M2 | -| `opto/features/priority_search/priority_search.py` | (Optional) Add `objective_config` param. Scalarize heap key via weighted mode. Store dict for logging. Pareto falls back to weighted. | M2 | - -### 6.3 Files NOT modified - -- `opto/trace/` — no changes to trace primitives. -- `opto/optimizers/` — optimizers are upstream of selection; they produce candidates, not rank them. -- Existing tests — no modifications; they validate backward compatibility by continuing to pass. - ---- - -## 7. Edge Cases & Defensive Design - -### 7.1 Score validation - -| Case | Behavior | -|------|----------| -| `score = 0.85` (float) | `to_score_dict()` → `{"score": 0.85}` | -| `score = 1` (int) | `to_score_dict()` → `{"score": 1.0}` (LLMJudge returns int 0/1) | -| `score = True` (bool) | `to_score_dict()` → `{"score": 1.0}` (test guides return bool) | -| `score = {"accuracy": 0.9, "latency_ms": 120.0}` | Returned as-is after validation | -| `score = {}` (empty dict) | `ValueError("Score dict must not be empty")` | -| `score = {"accuracy": float('nan')}` | `ValueError("Score dict contains non-finite value")` | -| `score = {"accuracy": float('inf')}` | `ValueError("Score dict contains non-finite value")` | -| `score = "text"` (wrong type) | `TypeError("Score must be int, float, bool, or Dict[str, float]")` | - -### 7.2 Missing metrics across candidates - -| Case | Behavior | -|------|----------| -| Candidate A has `{accuracy, latency}`, B has `{accuracy}` | B gets `latency = missing_value` (default `-inf`) | -| `weights = {"accuracy": 0.7, "latency": 0.3}`, candidate missing `latency` | Weighted sum uses `0.3 * missing_value` | -| All candidates missing a weighted metric | Warning logged; metric still contributes `weight * missing_value` | - -### 7.3 Mixed scalar/dict batches - -| Case | Behavior | -|------|----------| -| All scores are `float` (or `int`/`bool`) | `aggregate_vector_scores()` returns `float` via `np.mean()` (existing behavior) | -| All scores are `dict` with same keys | `aggregate_vector_scores()` returns per-metric mean `Dict[str, float]` | -| Mixed `float` and `dict` in same batch | `ValueError("All scores in a batch must be the same type (all float or all dict)")` | - -A mixed batch most likely indicates a bug in the guide implementation (e.g., returning `float` on some inputs and `dict` on others). Failing loudly prevents silent incorrect aggregation. - -### 7.4 Single-metric dict - -| Case | Behavior | -|------|----------| -| Guide returns `{"accuracy": 0.9}` with `mode="weighted"` | Weighted sum = `weight * 0.9` (trivially correct) | -| Guide returns `{"accuracy": 0.9}` with `mode="pareto"` | Pareto degenerates to scalar max (single dimension — no tradeoffs). Warning logged. | - -### 7.5 Tie-breaking - -| Case | Behavior | -|------|----------| -| Two candidates with identical weighted score | Deterministic: lower original index wins (stable sort) | -| Pareto front with 3 equivalent candidates, `tie_break="weighted"` | Fall back to weighted scalarization among the 3; select max | -| Pareto front with 3 equivalent candidates, `tie_break="lexicographic"` | Sort by metric names alphabetically, compare values in order | -| Pareto front with 3 equivalent candidates, `tie_break="random_seeded"` | Seeded shuffle with `config.seed`; same seed → same order always | - -### 7.7 Training loop safety - -The training loop has a **separate data path** from evaluation/selection. In `standard_optimization_step()` (basic_algorithms.py:46) and `standard_forward()` (sampler.py:130): - -```python -score, feedback = guide(x, target.data, info) -``` - -This `score` flows into `MinibatchAlgorithm.update()` where `np.mean(scores)` is computed (basic_algorithms.py:511). **This path must always receive `float`.** - -| Constraint | Enforcement | -|-----------|-------------| -| `guide.__call__()` / `get_feedback()` return type is **NOT widened** | No changes to `get_feedback()` signature; it still returns `Tuple[float, str]` | -| Training loop always receives scalar `score` | `metric()` always returns `float`. Vector/dict scores are not used by the training loop and are accessed via `get_score_dict()` for trainer-side selection. | -| Dict scores flow through a separate path | `get_score_dict()` → `evaluate_vector()` → `select_best()` / `select_top_k()` | -| A multi-objective guide must return `(float, str)` from `get_feedback()` for the training loop | The float is a collapsed scalar summary; the full dict is extracted via `get_score_dict()` during selection | - -**Two data paths (by design):** -``` -Training loop: guide() → score (float) → np.mean(scores) ← UNCHANGED -Selection path: get_score_dict() → evaluate_vector() → objectives.py ← NEW -``` - -### 7.6 ObjectiveConfig validation - -| Case | Behavior | -|------|----------| -| `mode="weighted"`, `weights={}` | Auto-assign equal weight 1.0 to all metrics encountered at selection time | -| `mode="pareto"`, `pareto_metrics=()` (empty tuple) | `ValueError("pareto_metrics must be None (auto) or non-empty tuple")` | -| `weights={"accuracy": -0.5}` (negative weight) | `ValueError("All weights must be non-negative")` | -| `minimize={"unknown_metric"}` | Warning logged at selection time if metric never appears; no error (tolerant) | - ---- - -## 8. Milestones & Validation Gates - -### Milestone 0 — Analysis + technical plan + interface spec - -**Deliverables:** -- `docs/T6_technical_plan.md` — this document, finalized -- `examples/notebooks/t6_m0_analysis.ipynb` — Colab-ready notebook - -**Notebook demonstrates:** -- Current Guide score contract (`get_feedback` → `Tuple[float, str]`, `metric` → `float`) -- Where scalar selection happens in BasicSearch (`max(candidates, ...)`) and Beamsearch (`sorted(...)[:k]`) -- Planned behavior prototype: deterministic toy guide returning dict metrics, showing weighted vs Pareto selection on dummy candidates - -**SMART validation:** -- Plan includes final API signatures and precise file list (create/modify) ✓ -- Notebook runs without API keys ✓ -- Notebook prints: current score contract, selection touchpoints, planned selection outputs ✓ - ---- - -### Milestone 1 — ObjectiveConfig + utilities + evaluator support + BasicSearch minimal - -**Deliverables:** -- `opto/trainer/objectives.py` (new) -- `opto/trainer/guide.py` (add `get_score_dict`) -- `opto/trainer/evaluators.py` (add `evaluate_vector`, `aggregate_vector_scores`) -- `opto/trainer/algorithms/basic_algorithms.py` (BasicSearch: accept/use ObjectiveConfig) -- `tests/test_objectives.py`, `tests/test_evaluators_vector.py` -- `examples/notebooks/t6_m1_vector_scores.ipynb` - -**Notebook demonstrates:** -- StubLLM mode: BasicSearchAlgorithm on small candidate set (5-10) with deterministic dummy guide returning dict metrics -- Shows: (a) scalar baseline, (b) weighted mode, (c) Pareto mode, (d) deterministic tie-break under fixed seed -- Real LLM mode (required): tiny dataset (≤5 items) producing ≥2 metrics - -**SMART validation:** -- `pytest -q` passes (all new functions covered) -- Notebook runs in Colab: weighted selection result changes when weights change -- Pareto returns tradeoffs and is deterministic under fixed seed -- Scalar path produces identical results to pre-change behavior - ---- - -### Milestone 2 — Trainer upgrades (Beamsearch + robust BasicSearch) - -**Deliverables:** -- `opto/trainer/algorithms/beamsearch_algorithm.py` (accept ObjectiveConfig, vector selection) -- Expanded BasicSearch tests (edge cases, missing metrics, tie-break policies) -- Optional: minimal PrioritySearch support (weighted scalarization for heap, dict stored for logging) -- `tests/test_trainers_multiobjective.py` -- `examples/notebooks/t6_m2_trainers.ipynb` - -**Notebook demonstrates:** -- BasicSearch + Beamsearch in: scalar mode (baseline), weighted mode, Pareto mode -- StubLLM + real LLM sections - -**SMART validation:** -- `pytest -q` green -- Integration test confirms: weighted vs Pareto select different candidates where expected -- Scalar-only example produces same final best score when `objective_config=None` -- Deterministic tie-break is stable across runs - ---- - -### Milestone 3 — Benchmarks (Trace-Bench integration) - -**Deliverables:** -- PR to Trace-Bench: benchmark configs/tasks + notebook - - **Trace-Bench touchpoints (update `main` if default branch differs):** - - https://github.com/AgentOpt/Trace-Bench/blob/main/LLM4AD/trainers_benchmark.py - - https://github.com/AgentOpt/Trace-Bench/blob/main/LLM4AD/trainers_benchmark_tasks_validation.py - - https://github.com/AgentOpt/Trace-Bench/blob/main/LLM4AD/benchmark_tasks/index.json - - https://github.com/AgentOpt/Trace-Bench/tree/main/LLM4AD/benchmark_tasks - - https://github.com/AgentOpt/Trace-Bench/blob/main/LLM4AD/llm4ad_loader.py - - https://github.com/AgentOpt/Trace-Bench/blob/main/tests/test_lite_optimize_llm4ad.py -- 3 benchmarks: - 1. **Accuracy vs latency** (toy QA dataset) - 2. **Accuracy vs response length** (penalize verbosity) - 3. **Accuracy vs tool calls** (penalize excessive tool usage) -- Trace-Bench notebook: `notebooks/t6_multiobjective_benchmarks.ipynb` (in Trace-Bench repo) - -**SMART validation:** -- Notebook outputs per-benchmark table: weighted-mode best candidate metrics + Pareto-mode set of tradeoffs -- Benchmarks run in StubLLM mode (fast/deterministic) and real LLM mode (small sample) -- Trace-Bench run completes without private datasets -- `pytest -q` green (smoke tests for benchmark integration) - ---- - -### Milestone 4 — Documentation + polished notebooks - -**Deliverables:** -- `docs/multi_objective_scores.md` — user-facing documentation -- README update with pointers to docs and notebooks -- Polished "How-to" notebook: installs from GitHub, runs BasicSearch weighted + Pareto, prints metric tradeoffs - -**SMART validation:** -- Fresh Colab runtime runs how-to notebook without manual patching -- CI green, no behavioral changes beyond documentation/polish - ---- - -## 9. Test Plan - -### 9.1 Unit tests — `tests/test_objectives.py` (M1) - -| Test | Validates | -|------|-----------| -| `test_to_score_dict_from_float` | `0.85` → `{"score": 0.85}` | -| `test_to_score_dict_from_dict` | `{"a": 1.0, "b": 2.0}` → same dict | -| `test_to_score_dict_empty_dict_raises` | `{}` → `ValueError` | -| `test_to_score_dict_nan_raises` | `{"a": float('nan')}` → `ValueError` | -| `test_to_score_dict_wrong_type_raises` | `"text"` → `TypeError` | -| `test_apply_minimize` | `{"acc": 0.9, "lat": 100}` with `minimize={"lat"}` → `{"acc": 0.9, "lat": -100}` | -| `test_apply_minimize_empty_set` | No metrics negated | -| `test_weighted_scalarize_basic` | `{"a": 0.8, "b": 0.2}` with `weights={"a": 0.7, "b": 0.3}` → `0.7*0.8 + 0.3*0.2` | -| `test_weighted_scalarize_missing_metric` | Missing metric uses `missing_value` | -| `test_weighted_scalarize_empty_weights` | Equal weight 1.0 for all metrics | -| `test_dominates_true` | A dominates B (all ≥, at least one >) | -| `test_dominates_false_equal` | A == B → does not dominate | -| `test_dominates_false_tradeoff` | A better on one, B better on another | -| `test_pareto_rank_simple` | 3 candidates with clear rank 0, 1, 2 | -| `test_pareto_rank_all_nondominated` | All candidates rank 0 | -| `test_select_best_scalar_mode` | Falls back to scalar max | -| `test_select_best_weighted_mode` | Returns highest weighted score | -| `test_select_best_pareto_mode` | Returns Pareto-optimal by tie-break | -| `test_select_best_none_config` | `objective_config=None` → scalar max (backward compat) | -| `test_select_top_k_weighted` | Returns k highest weighted scores | -| `test_select_top_k_pareto` | Returns k from Pareto front + spillover | -| `test_deterministic_tie_break_seeded` | Same seed → same result across 100 runs | -| `test_deterministic_tie_break_different_seeds` | Different seeds → potentially different result | - -### 9.2 Unit tests — `tests/test_evaluators_vector.py` (M1) - -| Test | Validates | -|------|-----------| -| `test_aggregate_vector_scores_all_scalar` | `[0.8, 0.9, 0.7]` → `np.mean` (backward compat) | -| `test_aggregate_vector_scores_all_dict` | Per-metric mean computed correctly | -| `test_aggregate_vector_scores_mixed` | Scalars normalized to dict, then averaged | -| `test_evaluate_vector_returns_correct_types` | Returns list of ScoreLike matching guide output | - -### 9.3 Integration tests — `tests/test_trainers_multiobjective.py` (M2) - -| Test | Validates | -|------|-----------| -| `test_basicsearch_scalar_unchanged` | Default behavior identical to pre-change | -| `test_basicsearch_weighted_selects_expected` | Weighted mode picks correct candidate | -| `test_basicsearch_pareto_selects_expected` | Pareto mode picks different candidate than weighted | -| `test_beamsearch_scalar_unchanged` | Default behavior identical | -| `test_beamsearch_weighted_selects_top_k` | Weighted mode picks correct top-k | -| `test_beamsearch_pareto_selects_front` | Pareto mode returns non-dominated front | -| `test_deterministic_across_runs` | Fixed seed → same selections in 5 repeated runs | - -### 9.4 Notebook validation (human / Trace team) - -Each notebook contains: -- **StubLLM (no keys) section:** deterministic dummy guide, runs quickly -- **Real LLM section (required):** small N (5-20 examples), prints cost/latency caveats, requires API key - ---- - -## 10. Risks & Mitigation - -| Risk | Severity | Mitigation | -|------|----------|------------| -| **R1: Missing metrics across candidates** | Medium | `missing_value` in ObjectiveConfig (default `-inf`). Enforce metric presence for configured weights (or warn). | -| **R2: Pareto nondeterminism** | High | Deterministic ordering via stable sort + explicit tie-break rules. Seeded randomness only when requested. | -| **R3: Multi-thread eval ordering** | Medium | Tests run with `num_threads=1` to guarantee stability. Document thread-safety considerations. | -| **R4: Breaking Guide subclasses** | High | Use `get_score_dict()` helper — never change `get_feedback()` signature. Union type on `metric()` is safe because existing callers only receive floats. | -| **R5: Performance regression** | Low | `objectives.py` functions are O(n²) for Pareto ranking on n candidates, but n is typically ≤20 (num_proposals). No concern at this scale. | -| **R6: Mixed scalar/dict in same batch** | Medium | `aggregate_vector_scores()` rejects mixed batches with `ValueError`. A mixed batch indicates a bug in the guide. | -| **R7: Training loop receives dict score** | High | `guide.__call__()` / `get_feedback()` return type is NOT widened. `metric()` always returns `float`. Dict scores only flow through `get_score_dict()` → `evaluate_vector()`. See §7.7. | - ---- - -## 11. Design Decisions (Resolved) - -> **Post-review update (Ching-An, Feb 2026):** All dict→scalar reduction is now controlled by `ObjectiveConfig.scalarize_dict` (values: `"score"`, `"mean"`, `"weighted"`). Guide produces raw metrics only. `normalize_score` renamed to `to_score_dict` (neutral name; backward-compat alias kept). `aggregate_score_dicts()` moved from evaluators to objectives.py (Objective-side policy). Dict scores with `config=None` now raise `ValueError` (no hidden hard-coded reduction). - -### D1: Where to implement scalar→dict normalization? - -**Decision: Option A — `Guide.get_score_dict()` helper + `objectives.to_score_dict()`** - -- `get_score_dict()` on Guide provides a clean entry point for subclasses. -- `to_score_dict()` in objectives.py is the canonical utility (pure function, testable). Renamed from `normalize_score` per Ching-An's review (neutral name; backward-compat alias kept). -- All dict→scalar reduction is controlled by `ObjectiveConfig` (via `scalarize_dict` field). No hidden hard-coded defaults. -- Avoids widening `get_feedback()` return type (higher churn, breaks typing). - -### D2: Pareto selection definition - -**Decision: Option A — Standard dominance on aggregated metrics, return single best by tie-break.** - -- `select_best()` returns one winner. `select_top_k()` returns k winners. -- Trainers don't need to manage a "front" — they just get indices. -- Beamsearch naturally uses `select_top_k(k=beam_width)`. - -### D3: PrioritySearch scope - -**Decision: Minimal (in-scope).** - -- Scalarize heap priority via `weighted_scalarize()`. -- Store full `score_dict` on each candidate for logging. -- `mode="pareto"` falls back to weighted with documented warning. -- Pareto archive is out-of-scope for v1. - ---- - -## 12. Appendix: Code Touchpoints - -### OpenTrace / experimental - -| File | URL | -|------|-----| -| Guide base | [guide.py](https://github.com/AgentOpt/OpenTrace/blob/experimental/opto/trainer/guide.py) | -| Evaluators | [evaluators.py](https://github.com/AgentOpt/OpenTrace/blob/experimental/opto/trainer/evaluators.py) | -| BasicSearch | [basic_algorithms.py](https://github.com/AgentOpt/OpenTrace/blob/experimental/opto/trainer/algorithms/basic_algorithms.py) | -| Beamsearch | [beamsearch_algorithm.py](https://github.com/AgentOpt/OpenTrace/blob/experimental/opto/trainer/algorithms/beamsearch_algorithm.py) | -| PrioritySearch | [priority_search.py](https://github.com/AgentOpt/OpenTrace/blob/experimental/opto/features/priority_search/priority_search.py) | - -### Trace-Bench - -| File | URL | -|------|-----| -| Repo | [Trace-Bench](https://github.com/AgentOpt/Trace-Bench) | - -### Selection logic summary (current → target) - -| Trainer | Current Code | Target Code | -|---------|-------------|-------------| -| BasicSearch | `max(candidates, key=lambda x: x[0])` | `select_best(candidates, objective_config)` | -| Beamsearch | `sorted(candidates, key=lambda x: x[0], reverse=True)[:k]` | `select_top_k(candidates, objective_config, k)` | -| PrioritySearch | scalar heap key | `weighted_scalarize(score_dict, config)` for heap key | diff --git a/examples/notebooks/t6_m0_analysis.ipynb b/docs/dev/multi_objective_design_exploration.ipynb similarity index 97% rename from examples/notebooks/t6_m0_analysis.ipynb rename to docs/dev/multi_objective_design_exploration.ipynb index 2549d76a..50c92a70 100644 --- a/examples/notebooks/t6_m0_analysis.ipynb +++ b/docs/dev/multi_objective_design_exploration.ipynb @@ -35,7 +35,7 @@ "cell_type": "markdown", "id": "b1a58d26", "metadata": {}, - "source": "# T6 Multi-Objective Vector Scores — M0 Analysis\n\n[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/AgentOpt/OpenTrace/blob/pull/61/head/examples/notebooks/t6_m0_analysis.ipynb)\n\n**Milestone 0 Deliverable** — Analysis + Technical Plan + Interface Spec\n\nThis notebook demonstrates:\n1. **Current baseline**: How Guide returns scalar scores, how evaluators aggregate, where selection happens\n2. **Exact touchpoints**: The specific lines of code in BasicSearch and Beamsearch that perform scalar selection\n3. **Planned behavior**: A deterministic prototype showing weighted vs Pareto selection on toy candidates\n\n**Motivation (why score-as-dict):** adding extra metrics into the *feedback dict/text* can help optimizers (OptoPrime/OPRO), but trainers typically only use the scalar score for ranking/UCB and ignore additional feedback structure. To enable Pareto/weighted multi-objective selection at the trainer level, we need vector score (score-as-dict) with backward-compatible scalar reduction.\n\n**No API keys required for M0.** All examples use deterministic dummy data. (From M1 onward, milestone notebooks must validate both StubLLM and real LLM modes.)\n\n---" + "source": "# T6 Multi-Objective Vector Scores — M0 Analysis\n\n[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/AgentOpt/OpenTrace/blob/experimental/docs/dev/multi_objective_design_exploration.ipynb)\n\n**Milestone 0 Deliverable** — Analysis + Technical Plan + Interface Spec\n\nThis notebook demonstrates:\n1. **Current baseline**: How Guide returns scalar scores, how evaluators aggregate, where selection happens\n2. **Exact touchpoints**: The specific lines of code in BasicSearch and Beamsearch that perform scalar selection\n3. **Planned behavior**: A deterministic prototype showing weighted vs Pareto selection on toy candidates\n\n**Motivation (why score-as-dict):** adding extra metrics into the *feedback dict/text* can help optimizers (OptoPrime/OPRO), but trainers typically only use the scalar score for ranking/UCB and ignore additional feedback structure. To enable Pareto/weighted multi-objective selection at the trainer level, we need vector score (score-as-dict) with backward-compatible scalar reduction.\n\n**No API keys required for M0.** All examples use deterministic dummy data. (From M1 onward, milestone notebooks must validate both StubLLM and real LLM modes.)\n\n---" }, { "cell_type": "markdown", @@ -932,4 +932,4 @@ }, "nbformat": 4, "nbformat_minor": 5 -} \ No newline at end of file +} diff --git a/docs/multi_objective_scores.md b/docs/multi_objective_scores.md new file mode 100644 index 00000000..3128f14f --- /dev/null +++ b/docs/multi_objective_scores.md @@ -0,0 +1,423 @@ +# Multi-Objective Vector Scores + +## Why multi-objective optimization? + +Standard single-objective optimization collapses all concerns into one scalar +score. This works when you care about exactly one metric, but real tasks have +competing concerns: accuracy vs. API cost, quality vs. latency, base loss vs. +regularization. A single number hides these trade-offs — you can't tell whether +a candidate is cheap-but-wrong or correct-but-expensive. + +Multi-objective mode makes each metric explicit. Instead of `score = 0.85`, you +get `{"accuracy": 0.95, "tokens_out": 120, "latency_s": 0.3}`. The trainer +then uses weighted scalarization or Pareto ranking to select candidates, giving +you visibility into trade-offs and the ability to re-prioritize without +retraining. + +**Best use cases:** 2-4 competing metrics, minimizing API cost while +maintaining quality, understanding accuracy-vs-speed trade-offs, regularized +optimization problems. + +**What you gain:** explicit per-metric tracking, Pareto frontier exploration, +tunable weight priorities, token-efficient candidate selection. + +**Jump to:** +[Switching to multi-objective](#switching-from-scalar-to-multi-objective) | +[ObjectiveConfig reference](#objectiveconfig-reference) | +[Token minimization](#adding-token-minimization) | +[Canonical demos](#canonical-demos) | +[Data flow](#data-flow) | +[Running in Trace-Bench](#running-in-trace-bench) + +--- + +## Overview + +By default, OpenTrace guides return a **scalar** score (a single float). Multi- +objective mode extends this to **vector scores** — a `Dict[str, float]` where +each key is a named metric (e.g. `accuracy`, `tokens_out`, `base_loss`). + +The training loop evaluates candidates on all metrics simultaneously, then uses +an `ObjectiveConfig` to decide which candidate is best — either by weighted +scalarization or by Pareto dominance ranking. + +### When to use multi-objective + +| Scenario | Recommendation | +|---|---| +| Single quality metric (accuracy, loss) | Scalar mode — no changes needed | +| Quality + cost (accuracy + token usage) | Multi-objective weighted mode | +| Multiple competing losses (base_loss + reg_loss) | Multi-objective weighted or pareto | +| Exploring trade-off frontiers | Pareto mode | + +### What works well + +- **BasicSearchAlgorithm** — full multi-objective support via `select_best()`. +- **BeamsearchAlgorithm** — full support via `select_top_k()` for beam ranking. +- **BeamsearchHistoryAlgorithm** — full support (inherits from Beamsearch). +- **PrioritySearch** — supported through the Trace-Bench runner. + +### Current limitations + +- **UCBSearch** does not support multi-objective selection. It uses its own + internal scoring and ignores `ObjectiveConfig`. +- Pareto ranking with many metrics (>4) becomes expensive. Weight-based + scalarization is more efficient when relative metric importance is known. + +--- + +## Switching from scalar to multi-objective + +### Step 1 — Return a score dict from your Guide + +Override `get_score_dict()` in your `Guide` subclass to return a dict instead +of relying on the default scalar wrapper: + +```python +from opto.trainer.guide import Guide + +class MyGuide(Guide): + def get_feedback(self, query, response, reference=None, **kwargs): + # ... compute score and feedback ... + return score, feedback + + def get_score_dict(self, query, response, reference=None, **kwargs): + # Return multiple named metrics + accuracy = 1.0 if response.strip() == reference.strip() else 0.0 + length_penalty = len(response) / 1000.0 + return {"accuracy": accuracy, "length": length_penalty} +``` + +The base `Guide.get_score_dict()` wraps the scalar from `get_feedback()` as +`{"score": float_value}`. Override it to return your own metric names. + +### Step 2 — Create an ObjectiveConfig + +```python +from opto.trainer.objectives import ObjectiveConfig + +config = ObjectiveConfig( + mode="weighted", # or "pareto" + weights={"accuracy": 1.0, "length": 0.5}, # relative importance + minimize=frozenset({"length"}), # lower is better +) +``` + +### Step 3 — Pass it to the trainer + +```python +from opto.trainer.algorithms.basic_algorithms import BasicSearchAlgorithm + +trainer = BasicSearchAlgorithm(agent, optimizer) +trainer.train( + guide, + train_dataset, + objective_config=config, + num_proposals=4, + num_epochs=3, +) +``` + +The trainer will call `guide.get_score_dict()` via `evaluate_vector()`, aggregate +per-metric means via `aggregate_score_dicts()`, and use `select_best()` (or +`select_top_k()` for beam search) with your config to pick the winning candidate. + +--- + +## ObjectiveConfig reference + +```python +@dataclass(frozen=True) +class ObjectiveConfig: + mode: str = "scalar" # "scalar" | "weighted" | "pareto" + weights: Dict[str, float] # per-metric weights (empty = equal weight 1.0) + minimize: frozenset # metric names where lower is better + missing_value: float = -inf # fallback for missing metrics + pareto_metrics: Tuple[str, ...] # subset for Pareto dominance (None = all) + tie_break: str = "weighted" # "weighted" | "lexicographic" | "random_seeded" + seed: int = 0 # for deterministic tie-breaking + scalarize_dict: str = "score" # "score" | "mean" | "weighted" + score_key: str = "score" # key used when scalarize_dict="score" +``` + +### Mode: `"weighted"` + +Computes a weighted sum of all metrics (after negating those in `minimize`), +then selects the candidate with the highest scalarized value. + +```python +ObjectiveConfig( + mode="weighted", + weights={"accuracy": 1.0, "tokens_out": 1e-3}, + minimize=frozenset({"tokens_out"}), +) +``` + +Metrics not listed in `weights` are ignored. If `weights` is empty, all metrics +get equal weight 1.0. + +### Mode: `"pareto"` + +Performs non-dominated sorting (Pareto ranking). Rank-0 candidates are on the +Pareto front. If multiple candidates share rank 0, the `tie_break` strategy +resolves the winner: + +- `"weighted"` — fall back to weighted scalarization among the front. +- `"lexicographic"` — sort by metric name alphabetically, pick highest. +- `"random_seeded"` — seeded random shuffle (deterministic). + +```python +ObjectiveConfig( + mode="pareto", + weights={"accuracy": 1.0, "tokens_out": 1e-3}, # used for tie-break + minimize=frozenset({"tokens_out"}), + tie_break="weighted", + seed=42, +) +``` + +### Mode: `"scalar"` (default) + +Backward-compatible. Treats scores as single floats. Dict scores are reduced +via `scalarize_dict`: +- `"score"` — extract `score_dict[score_key]` (default). +- `"mean"` — `mean(score_dict.values())`. +- `"weighted"` — `weighted_scalarize()`. + +--- + +## Adding token minimization + +The GSM8K demo shows how to add token-count metrics to any existing guide +without modifying it. The pattern uses two components: + +### UsageTrackingLLM + +A wrapper around any LLM that records token counts (input and output) using a +`contextvars.ContextVar`. It works transparently — the wrapped LLM behaves +identically, but token counts are captured per-call. + +```python +from trace_bench.examples.multiobjective_gsm8k import UsageTrackingLLM + +# Wrap your LLM +tracked_llm = UsageTrackingLLM(base_llm) +``` + +### TokenUsageAugmentingGuide + +A decorator guide that wraps an existing guide and appends `tokens_in` and +`tokens_out` to its score dict: + +```python +from trace_bench.examples.multiobjective_gsm8k import TokenUsageAugmentingGuide + +base_guide = MyGuide() +guide = TokenUsageAugmentingGuide(base_guide, tracked_llm) + +# guide.get_score_dict() now returns e.g.: +# {"accuracy": 1.0, "tokens_in": 350.0, "tokens_out": 120.0} +``` + +### Full configuration with token minimization + +```python +config = ObjectiveConfig( + mode="weighted", + weights={"error": 1.0, "tokens_in": 1e-3, "tokens_out": 1e-3}, + minimize=frozenset({"error", "tokens_in", "tokens_out"}), +) +``` + +The small weights on token metrics (1e-3) ensure that accuracy dominates the +selection, but among equally accurate candidates, the one using fewer tokens +wins. + +--- + +## Canonical demos + +Three reference implementations demonstrate multi-objective patterns. Each lives +in the Trace-Bench repository under `trace_bench/examples/` with companion +notebooks under `notebooks/`. + +### Convex (SixHumpCamel) + +**File:** `trace_bench/examples/multiobjective_convex.py` +**Notebook:** `notebooks/multiobjective_convex.ipynb` + +Optimizes a 2D input to minimize two losses independently: +- `base_loss` — the Six-Hump Camel function value. +- `reg_loss` — L2-squared regularization. + +The `ConvexRewardGuide.get_score_dict()` returns both metrics. This is the +simplest multi-objective example — no LLM, no external dependencies. + +### BBEH (boolean_expressions) + +**File:** `trace_bench/examples/multiobjective_bbeh.py` +**Notebook:** `notebooks/multiobjective_bbeh.ipynb` + +Optimizes a code-generation agent on BIG-Bench Extra Hard boolean expression +problems with two objectives: +- `accuracy` — exact-match correctness (minimize error). +- `execution_time_s` — wall-clock time for code execution (minimize). + +Uses PAL (Program-Aided Language) strategy: the agent writes Python code that +is executed to extract the answer. + +### GSM8K + Token Usage + +**File:** `trace_bench/examples/multiobjective_gsm8k.py` +**Notebook:** `notebooks/multiobjective_gsm8k.ipynb` + +Optimizes a math-solving agent on GSM8K with three objectives: +- `error` — 1 minus exact-match accuracy (minimize). +- `tokens_in` — input token count (minimize). +- `tokens_out` — output token count (minimize). + +Demonstrates the `UsageTrackingLLM` + `TokenUsageAugmentingGuide` pattern for +adding token metrics to any task. + +--- + +## Data flow + +### Evaluation pipeline + +```mermaid +graph TD + A["Guide.get_score_dict()"] -->|"per-example Dict[str, float]"| B["evaluate_vector()"] + B -->|"List[Dict[str, float]]"| C["aggregate_score_dicts()"] + C -->|"per-candidate mean Dict"| D["select_best() / select_top_k()"] + D -->|"uses ObjectiveConfig"| E["Best candidate selected"] +``` + +1. **evaluate_vector()** (`opto/trainer/evaluators.py`) calls + `guide.get_score_dict()` for each input and returns a + `List[Dict[str, float]]`. +2. **aggregate_score_dicts()** (`opto/trainer/objectives.py`) computes per- + metric means across all examples for a single candidate. +3. **select_best()** / **select_top_k()** rank candidates according to the + `ObjectiveConfig` and return the winning index/indices. + +### Selection mode decision + +```mermaid +graph TD + S["ObjectiveConfig.mode"] -->|"scalar"| SC["to_scalar_score() → argmax"] + S -->|"weighted"| W["apply_minimize() → weighted_scalarize() → argmax"] + S -->|"pareto"| P["apply_minimize() → pareto_rank()"] + P --> F{"Single rank-0\ncandidate?"} + F -->|"Yes"| R["Return winner"] + F -->|"No"| TB["tie_break strategy"] + TB -->|"weighted"| TW["weighted_scalarize among front → argmax"] + TB -->|"lexicographic"| TL["sort by metric name → pick highest"] + TB -->|"random_seeded"| TR["seeded shuffle → pick first"] +``` + +Trainer algorithms (BasicSearch, Beamsearch) call this pipeline internally when +`objective_config` is provided and `mode != "scalar"`. + +--- + +## Running in Trace-Bench + +### CLI + +```bash +# List available multi-objective tasks +trace-bench list-tasks --bench internal + +# Validate a config without running +trace-bench validate --config configs/m3_multiobjective.yaml + +# Run the full multi-objective benchmark +export TRACE_LITELLM_MODEL=openrouter/x-ai/grok-4.1-fast +trace-bench run --config configs/m3_multiobjective.yaml +``` + +### YAML config format + +The multi-objective config (`configs/m3_multiobjective.yaml`) uses this +structure: + +```yaml +mode: real +seeds: [42] +max_workers: 6 +resume: auto +job_timeout: 1200 + +tasks: + - id: "internal:multiobjective_convex" + eval_kwargs: + objective_mode: "weighted" + + - id: "internal:multiobjective_convex" + eval_kwargs: + objective_mode: "pareto" + + # ... same pattern for bbeh and gsm8k + +trainers: + - id: BasicSearchAlgorithm + params_variants: + - num_proposals: 4 + num_epochs: 2 + batch_size: 1 + + - id: BeamsearchAlgorithm + params_variants: + - beam_width: 2 + num_proposals: 4 + max_depth: 2 + batch_size: 1 +``` + +**Key fields:** +- `tasks[].id` — registry task ID (e.g. `internal:multiobjective_bbeh`). +- `tasks[].eval_kwargs.objective_mode` — `"weighted"` or `"pareto"`. Passed to + the task's `build_trace_problem()` which constructs the `ObjectiveConfig`. +- `trainers[].id` — algorithm name. +- `trainers[].params_variants` — list of parameter sets. The runner expands + tasks x trainers x variants x seeds into individual jobs. + +### Task registration + +Each task module exposes a `build_trace_problem(**eval_kwargs)` function that +returns a dict with: + +```python +{ + "param": trace.node(..., trainable=True), + "guide": MyGuide(), + "train_dataset": {"inputs": [...], "infos": [...]}, + "optimizer_kwargs": {"objective": "...", "memory_size": 10}, + "objective_config": ObjectiveConfig(...), + "metadata": {"benchmark": "multiobjective", ...}, +} +``` + +The `objective_config` is consumed by the trainer's `train()` method. The +`eval_kwargs` from the YAML `tasks[].eval_kwargs` are forwarded directly to +`build_trace_problem()`. + +### LLM model selection + +The LLM is selected at runtime via the `TRACE_LITELLM_MODEL` environment +variable. Common provider configurations: + +```bash +# OpenRouter +export OPENROUTER_API_KEY=... +export TRACE_LITELLM_MODEL=openrouter/x-ai/grok-4.1-fast + +# Direct provider +export XAI_API_KEY=... +export TRACE_LITELLM_MODEL=xai/grok-4.1-fast + +# DeepSeek +export DEEPSEEK_API_KEY=... +export TRACE_LITELLM_MODEL=deepseek/deepseek-chat +``` diff --git a/examples/notebooks/t6_m2_bbeh.ipynb b/examples/notebooks/multiobjective_bbeh_langgraph.ipynb similarity index 73% rename from examples/notebooks/t6_m2_bbeh.ipynb rename to examples/notebooks/multiobjective_bbeh_langgraph.ipynb index d53632d5..ec5e48dc 100644 --- a/examples/notebooks/t6_m2_bbeh.ipynb +++ b/examples/notebooks/multiobjective_bbeh_langgraph.ipynb @@ -4,7 +4,7 @@ "cell_type": "markdown", "id": "cell-title", "metadata": {}, - "source": "# T6 M2 — BBEH Boolean Expressions with Multi-Objective Instrumentation\n\n[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/AgentOpt/OpenTrace/blob/pull/61/head/examples/notebooks/t6_m2_bbeh.ipynb)\n\n**Milestone 2 Deliverable** — Multi-objective scoring on a real LLM task\n\nThis notebook demonstrates multi-objective optimization on the **BBEH boolean_expressions** benchmark\nusing the **PAL (Program-Aided Language model)** strategy from Xavier's original experiment.\n\nTwo objectives are tracked:\n- **accuracy** (binary: 1.0 = correct, 0.0 = wrong)\n- **execution_time_s** (end-to-end wall-clock seconds per example: LLM call + code execution)\n\nThe `LangGraphGuide.get_score_dict()` method returns both metrics per example,\nenabling the M2 multi-objective infrastructure to track and visualize tradeoffs.\n\n**Requires a real LLM API key** (OpenRouter recommended, default model: `openai/gpt-5-nano`).\n\n---" + "source": "# T6 M2 — BBEH Boolean Expressions with Multi-Objective Instrumentation\n\n[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/AgentOpt/OpenTrace/blob/experimental/examples/notebooks/multiobjective_bbeh_langgraph.ipynb)\n\n**Milestone 2 Deliverable** — Multi-objective scoring on a real LLM task\n\nThis notebook demonstrates multi-objective optimization on the **BBEH boolean_expressions** benchmark\nusing the **PAL (Program-Aided Language model)** strategy from Xavier's original experiment.\n\nTwo objectives are tracked:\n- **accuracy** (binary: 1.0 = correct, 0.0 = wrong)\n- **execution_time_s** (end-to-end wall-clock seconds per example: LLM call + code execution)\n\nThe `LangGraphGuide.get_score_dict()` method returns both metrics per example,\nenabling the M2 multi-objective infrastructure to track and visualize tradeoffs.\n\n**Requires a real LLM API key** (OpenRouter recommended, default model: `openai/gpt-5-nano`).\n\n---" }, { "cell_type": "code", @@ -20,7 +20,7 @@ "id": "cell-setup", "metadata": {}, "outputs": [], - "source": "import os, sys, subprocess\n\nif IN_COLAB:\n if not os.path.exists('/content/Trace'):\n print(\"Setting up Trace...\")\n !pip install langgraph langchain langchain_openai datasets tqdm langchain_community litellm dspy black matplotlib pandas\n !git clone https://github.com/AgentOpt/OpenTrace.git /content/Trace\n %cd /content/Trace\n !git pull origin experimental && git checkout experimental\n !sed -i 's/python_requires=\">=3.13\"/python_requires=\">=3.12\"/' setup.py\n !pip install -e .\n sys.path.append('/content/Trace')\nelse:\n # Local: add repo root to sys.path\n _nb_dir = os.path.dirname(os.path.abspath(\"__file__\"))\n _repo_root = os.path.abspath(os.path.join(_nb_dir, \"..\", \"..\"))\n if _repo_root not in sys.path:\n sys.path.insert(0, _repo_root)\n\n# Clone BBEH benchmark tasks\nif not os.path.exists('bbeh'):\n !git clone https://github.com/google-deepmind/bbeh.git\nelse:\n print(\"bbeh/ already exists, skipping clone.\")\n\n# Soft-import display\ntry:\n from IPython.display import display\nexcept Exception:\n def display(*args, **kwargs):\n return None\n\nprint(f\"{IN_COLAB=}\")" + "source": "import os, sys, subprocess\n\nif IN_COLAB:\n if not os.path.exists('/content/Trace'):\n print(\"Setting up Trace...\")\n !pip install langgraph langchain langchain_openai datasets tqdm langchain_community litellm dspy black matplotlib pandas\n !git clone https://github.com/AgentOpt/OpenTrace.git /content/Trace\n %cd /content/Trace\n !git pull origin experimental && git checkout experimental\n !pip install -e .\n sys.path.append('/content/Trace')\nelse:\n # Local: add repo root to sys.path\n _nb_dir = os.path.dirname(os.path.abspath(\"__file__\"))\n _repo_root = os.path.abspath(os.path.join(_nb_dir, \"..\", \"..\"))\n if _repo_root not in sys.path:\n sys.path.insert(0, _repo_root)\n\n# Clone BBEH benchmark tasks\nif not os.path.exists('bbeh'):\n !git clone https://github.com/google-deepmind/bbeh.git\nelse:\n print(\"bbeh/ already exists, skipping clone.\")\n\n# Soft-import display\ntry:\n from IPython.display import display\nexcept Exception:\n def display(*args, **kwargs):\n return None\n\nprint(f\"{IN_COLAB=}\")" }, { "cell_type": "code", @@ -205,15 +205,16 @@ "# -----------------------\n", "class LangGraphTrainer(_TraceMinibatch):\n", " def __init__(self, *, graph_root_function: str, graph_agents_functions: list[str], scope: dict,\n", - " optimizer, parameters: list):\n", + " optimizer, parameters: list,\n", + " original_root=None, original_agents=None):\n", " object.__init__(self)\n", " self.root_name = graph_root_function\n", " self.agent_names = list(graph_agents_functions)\n", " self.scope = scope\n", " self.optimizer = optimizer\n", " self.parameters = list(parameters)\n", - " self._original_root = scope[graph_root_function]\n", - " self._original_agents = {n: scope[n] for n in graph_agents_functions if n in scope}\n", + " self._original_root = original_root if original_root is not None else scope[graph_root_function]\n", + " self._original_agents = original_agents if original_agents is not None else {n: scope[n] for n in graph_agents_functions if n in scope}\n", "\n", " def restore_originals(self):\n", " self.scope[self.root_name] = self._original_root\n", @@ -439,6 +440,10 @@ " if isinstance(scope.get(graph_root_function), FunModule):\n", " scope[graph_root_function] = scope[graph_root_function]._fun\n", "\n", + " # Save original (pre-bind) functions so trainer can restore on corruption\n", + " original_root = scope.get(graph_root_function)\n", + " original_agents = {name: scope[name] for name in graph_agents_functions if name in scope}\n", + "\n", " parameters = []\n", " for name in graph_agents_functions:\n", " if name not in scope:\n", @@ -509,6 +514,8 @@ " scope=scope,\n", " optimizer=opt,\n", " parameters=parameters,\n", + " original_root=original_root,\n", + " original_agents=original_agents,\n", " )\n", " modified, history, best_state, last_state = trainer.train(\n", " guide=guide,\n", @@ -548,7 +555,113 @@ "id": "cell-pal-strategy", "metadata": {}, "outputs": [], - "source": "import re, time\nfrom langgraph.graph import StateGraph, START, END\n\n# -----------------------\n# Strategy: PAL (Program-Aided Language model)\n# -----------------------\nprompt_parse_problem = node(\n \"Read the problem and write Python code that sets a variable named `result` to the final answer.\\n\"\n \"- Output ONLY valid Python (no markdown fences).\\n\"\n \"- If the task is multiple-choice, set result to the option label exactly (e.g., '(A)').\\n\\n\"\n \"Problem:\\n\",\n trainable=True,\n description=\"PAL prompt that generates python code producing a `result`.\"\n)\n\n# Global variable to capture execution time from the graph invocation.\n# This is read by run_solver_on_example() to populate the guide's score_dict.\n_last_exec_time_s = 0.0\n\ndef parse_problem(state: dict):\n question = get_no_node(state.get(\"question\", \"\"))\n prompt = prompt_parse_problem + question\n code_str = llm_call(get_no_node(prompt))\n return {\"code\": code_str.strip(), \"question\": question}\n\ndef execute_code(state: dict):\n \"\"\"Execute LLM-generated Python code.\n\n The PAL strategy: exec() the code produced by the LLM and extract the\n `result` variable as the final answer.\n \"\"\"\n def strip_python_tags(code: str) -> str:\n return re.sub(\n r'(?s)(?:.*?```(?:python)?\\s*\\n(.*?)(?:\\n```.*)?|(.*))',\n lambda m: m.group(1) if m.group(1) is not None else m.group(2),\n code,\n )\n\n update = {}\n try:\n code_to_run = strip_python_tags(get_no_node(state.get(\"code\", \"\")))\n local_vars = {}\n exec(code_to_run, {}, local_vars) # noqa: S102 - intentional PAL strategy\n local_vars.pop(\"__builtins__\", None)\n\n if \"result\" in local_vars:\n update[\"final_answer\"] = node(local_vars[\"result\"])\n elif len(local_vars) == 1:\n update[\"final_answer\"] = node(next(iter(local_vars.values())))\n else:\n update[\"final_answer\"] = node(None)\n\n except Exception as e:\n update[\"final_answer\"] = node(None)\n update[\"error\"] = str(e)\n\n return update\n\ndef create_graph_solve_with_PAL_Strategy():\n g = StateGraph(dict)\n g.add_node(\"parse\", parse_problem)\n g.add_node(\"calculate\", execute_code)\n g.add_edge(START, \"parse\")\n g.add_edge(\"parse\", \"calculate\")\n g.add_edge(\"calculate\", END)\n return g\n\ndef solve_with_PAL_Strategy(problem: str) -> dict:\n global _last_exec_time_s\n _last_exec_time_s = 0.0 # reset before each invocation\n\n g = create_graph_solve_with_PAL_Strategy()\n compiled = g.compile()\n\n if SHOW_MERMAID_GRAPH:\n try:\n from IPython.display import Image, display\n display(Image(compiled.get_graph(xray=1).draw_mermaid_png()))\n except Exception:\n pass\n\n # --- M2: measure end-to-end graph execution time (LLM call + code exec) ---\n t0 = time.perf_counter()\n result = compiled.invoke({\"question\": get_no_node(problem)})\n t1 = time.perf_counter()\n _last_exec_time_s = t1 - t0\n\n if \"final_answer\" not in result:\n return {\"final_answer\": node(\"No solution found\")}\n if isinstance(result[\"final_answer\"], str):\n return {\"final_answer\": node(result[\"final_answer\"])}\n return result\n\n# Default graph spec\nGRAPH_ROOT = \"solve_with_PAL_Strategy\"\nGRAPH_AGENTS = [\"parse_problem\", \"execute_code\"]\nGRAPH_PROMPTS = [prompt_parse_problem]\n\nprint(\"PAL strategy loaded (with end-to-end timing instrumentation).\")\nprint(\"solve_with_PAL_Strategy() measures total graph time (LLM + code exec) via time.perf_counter().\")" + "source": [ + "import re, time\n", + "from langgraph.graph import StateGraph, START, END\n", + "\n", + "# -----------------------\n", + "# Strategy: PAL (Program-Aided Language model)\n", + "# -----------------------\n", + "prompt_parse_problem = node(\n", + " \"Read the problem and write Python code that sets a variable named `result` to the final answer.\\n\"\n", + " \"- Output ONLY valid Python (no markdown fences).\\n\"\n", + " \"- If the task is multiple-choice, set result to the option label exactly (e.g., '(A)').\\n\\n\"\n", + " \"Problem:\\n\",\n", + " trainable=True,\n", + " description=\"PAL prompt that generates python code producing a `result`.\"\n", + ")\n", + "\n", + "# Global variable to capture execution time from the graph invocation.\n", + "# This is read by run_solver_on_example() to populate the guide's score_dict.\n", + "_last_exec_time_s = 0.0\n", + "\n", + "def parse_problem(state: dict):\n", + " state = get_no_node(state)\n", + " question = get_no_node(state.get(\"question\", \"\"))\n", + " prompt = prompt_parse_problem + question\n", + " code_str = llm_call(get_no_node(prompt))\n", + " return {\"code\": code_str.strip(), \"question\": question}\n", + "\n", + "def execute_code(state: dict):\n", + " \"\"\"Execute LLM-generated Python code.\n", + "\n", + " The PAL strategy: exec() the code produced by the LLM and extract the\n", + " `result` variable as the final answer.\n", + " \"\"\"\n", + " def strip_python_tags(code: str) -> str:\n", + " return re.sub(\n", + " r'(?s)(?:.*?```(?:python)?\\s*\\n(.*?)(?:\\n```.*)?|(.*))',\n", + " lambda m: m.group(1) if m.group(1) is not None else m.group(2),\n", + " code,\n", + " )\n", + "\n", + " update = {}\n", + " try:\n", + " code_to_run = strip_python_tags(get_no_node(state.get(\"code\", \"\")))\n", + " local_vars = {}\n", + " exec(code_to_run, {}, local_vars) # noqa: S102 - intentional PAL strategy\n", + " local_vars.pop(\"__builtins__\", None)\n", + "\n", + " if \"result\" in local_vars:\n", + " update[\"final_answer\"] = node(local_vars[\"result\"])\n", + " elif len(local_vars) == 1:\n", + " update[\"final_answer\"] = node(next(iter(local_vars.values())))\n", + " else:\n", + " update[\"final_answer\"] = node(None)\n", + "\n", + " except Exception as e:\n", + " update[\"final_answer\"] = node(None)\n", + " update[\"error\"] = str(e)\n", + "\n", + " return update\n", + "\n", + "def create_graph_solve_with_PAL_Strategy():\n", + " g = StateGraph(dict)\n", + " g.add_node(\"parse\", parse_problem)\n", + " g.add_node(\"calculate\", execute_code)\n", + " g.add_edge(START, \"parse\")\n", + " g.add_edge(\"parse\", \"calculate\")\n", + " g.add_edge(\"calculate\", END)\n", + " return g\n", + "\n", + "def solve_with_PAL_Strategy(problem: str) -> dict:\n", + " global _last_exec_time_s\n", + " _last_exec_time_s = 0.0 # reset before each invocation\n", + "\n", + " g = create_graph_solve_with_PAL_Strategy()\n", + " compiled = g.compile()\n", + "\n", + " if SHOW_MERMAID_GRAPH:\n", + " try:\n", + " from IPython.display import Image, display\n", + " display(Image(compiled.get_graph(xray=1).draw_mermaid_png()))\n", + " except Exception:\n", + " pass\n", + "\n", + " # --- M2: measure end-to-end graph execution time (LLM call + code exec) ---\n", + " t0 = time.perf_counter()\n", + " try:\n", + " result = compiled.invoke({\"question\": get_no_node(problem)})\n", + " except Exception as e:\n", + " _last_exec_time_s = time.perf_counter() - t0\n", + " return {\"final_answer\": node(None), \"error\": str(e)}\n", + " t1 = time.perf_counter()\n", + " _last_exec_time_s = t1 - t0\n", + "\n", + " if \"final_answer\" not in result:\n", + " return {\"final_answer\": node(\"No solution found\")}\n", + " if isinstance(result[\"final_answer\"], str):\n", + " return {\"final_answer\": node(result[\"final_answer\"])}\n", + " return result\n", + "\n", + "# Default graph spec\n", + "GRAPH_ROOT = \"solve_with_PAL_Strategy\"\n", + "GRAPH_AGENTS = [\"parse_problem\", \"execute_code\"]\n", + "GRAPH_PROMPTS = [prompt_parse_problem]\n", + "\n", + "print(\"PAL strategy loaded (with end-to-end timing instrumentation).\")\n", + "print(\"solve_with_PAL_Strategy() measures total graph time (LLM + code exec) via time.perf_counter().\")" + ] }, { "cell_type": "code", @@ -662,7 +775,187 @@ "id": "cell-training", "metadata": {}, "outputs": [], - "source": "from typing import List, Dict, Tuple\nimport time\n\n# -----------------------\n# Multi-objective instrumented solver + evaluator\n# -----------------------\n\n# Build the guide with multi-objective support\nguide = LangGraphGuide(\n feedback_func=feedback_answer_bbeh,\n answer_key=\"final_answer\",\n allowed_answer_set=allowed_set,\n)\n\ndef run_solver_on_example(ex: dict) -> Tuple[bool, str, str, Dict[str, float]]:\n \"\"\"Run solver and return (ok, pred, feedback, score_dict).\n\n score_dict contains {accuracy, execution_time_s} from get_score_dict().\n \"\"\"\n global _last_exec_time_s\n _last_exec_time_s = 0.0\n\n out = solve_with_PAL_Strategy(ex[\"question\"])\n pred = get_no_node(out.get(\"final_answer\"))\n ok, fb = feedback_answer_bbeh(pred, ex[\"solution\"], allowed_set)\n\n # Populate guide's execution time from the global, then get score_dict\n guide._last_execution_time_s = _last_exec_time_s\n score_dict = guide.get_score_dict(ex[\"question\"], out, ex[\"solution\"])\n\n return ok, str(pred), fb, score_dict\n\ndef evaluate(examples: List[dict], *, name: str) -> Tuple[float, List[Dict[str, float]]]:\n \"\"\"Evaluate examples, returning (accuracy, list of score_dicts).\"\"\"\n n_ok = 0\n all_score_dicts = []\n for i, ex in enumerate(examples, 1):\n ok, pred, fb, sd = run_solver_on_example(ex)\n n_ok += int(ok)\n all_score_dicts.append(sd)\n print(f\"[{name}] {i:02d}/{len(examples)} ok={ok} pred={pred} \"\n f\"exec_time={sd['execution_time_s']:.4f}s :: {fb}\")\n acc = n_ok / max(1, len(examples))\n mean_time = sum(sd['execution_time_s'] for sd in all_score_dicts) / max(1, len(all_score_dicts))\n print(f\"[{name}] accuracy = {acc:.3f} ({n_ok}/{len(examples)}), mean exec_time = {mean_time:.4f}s\")\n return acc, all_score_dicts\n\n\n# =====================================================================\n# Baseline evaluation\n# =====================================================================\nprint(\"=\" * 60)\nprint(\"BASELINE evaluation on validation set\")\nprint(\"=\" * 60)\nbaseline_acc, baseline_score_dicts = evaluate(val_set, name=\"baseline/val\")\n\n# =====================================================================\n# Per-step metric collection during curriculum training\n# =====================================================================\n# Stores {step, phase, accuracy, execution_time_s, example_idx} per observation\nmetric_log = []\nstep_counter = 0\n\n# Record baseline metrics\nfor i, sd in enumerate(baseline_score_dicts):\n metric_log.append({\n \"step\": 0,\n \"phase\": \"baseline\",\n \"example_idx\": i,\n **sd,\n })\n\n# =====================================================================\n# Curriculum training (Mode B) with metric collection\n# =====================================================================\nif SKIP_OPTIMIZATION:\n print(\"SKIP_OPTIMIZATION=1 -> skipping optimization/training.\")\nelse:\n last_successes: List[dict] = []\n\n for idx, ex in enumerate(train_set, 1):\n step_counter += 1\n ok, pred, fb, sd = run_solver_on_example(ex)\n print(f\"[train] {idx:02d}/{len(train_set)} ok={ok} pred={pred} \"\n f\"exec_time={sd['execution_time_s']:.4f}s :: {fb}\")\n\n # Log pre-optimization metric\n metric_log.append({\n \"step\": step_counter,\n \"phase\": \"train_pre\",\n \"example_idx\": idx - 1,\n **sd,\n })\n\n if ok:\n last_successes.append(ex)\n last_successes = last_successes[-VALIDATE_ON_LAST_N:]\n continue\n\n # Optimize on the failing example\n modified, dump_file, history, chosen_state, run_dir = optimize_langgraph(\n graph_root_function=GRAPH_ROOT,\n graph_agents_functions=GRAPH_AGENTS,\n graph_prompts_list=GRAPH_PROMPTS,\n question=ex[\"question\"],\n solution=ex[\"solution\"],\n answer_feedback_func=feedback_answer_bbeh,\n allowed_answer_set=allowed_set,\n validation_set=last_successes,\n accumulation_steps=ACCUMULATION_STEPS,\n retry=LEARNING_RETRY,\n max_attempts=MAX_ATTEMPTS,\n test_optimization=True,\n stop_on_success=True,\n seed=SEED,\n dump_prefix=f\"BBEH_{BBEH_TASK_NAME}__PAL__\",\n output_folder=OUTPUT_FOLDER,\n )\n\n print(\"[train] optimize_langgraph:\", {\"modified\": modified, \"dump_file\": dump_file, \"run_dir\": run_dir})\n if history:\n print(\"[train] last history entry:\", history[-1])\n\n # Re-test after optimization.\n # Wrapped in try/except: when optimization fails to update params,\n # the Trace bundle state can be corrupted (Node objects where dicts\n # are expected), causing ExecutionError in the re-test.\n try:\n ok2, pred2, fb2, sd2 = run_solver_on_example(ex)\n print(f\"[train] after-opt ok={ok2} pred={pred2} \"\n f\"exec_time={sd2['execution_time_s']:.4f}s :: {fb2}\")\n\n # Log post-optimization metric\n metric_log.append({\n \"step\": step_counter,\n \"phase\": \"train_post\",\n \"example_idx\": idx - 1,\n **sd2,\n })\n\n if ok2:\n last_successes.append(ex)\n last_successes = last_successes[-VALIDATE_ON_LAST_N:]\n except Exception as e:\n print(f\"[train] after-opt re-test failed (graph state corrupted): {type(e).__name__}: {e}\")\n print(\"[train] skipping this example and continuing.\")\n\n# =====================================================================\n# Post-training evaluation\n# =====================================================================\nprint(\"\\n\" + \"=\" * 60)\nprint(\"POST-TRAINING evaluation on validation set\")\nprint(\"=\" * 60)\nfinal_acc, final_score_dicts = evaluate(val_set, name=\"final/val\")\n\n# Record final eval metrics\nstep_counter += 1\nfor i, sd in enumerate(final_score_dicts):\n metric_log.append({\n \"step\": step_counter,\n \"phase\": \"final\",\n \"example_idx\": i,\n **sd,\n })\n\nprint(f\"\\nSummary: baseline_val_acc={baseline_acc:.3f}, final_val_acc={final_acc:.3f}\")\nprint(f\"Total metric observations collected: {len(metric_log)}\")" + "source": [ + "from typing import List, Dict, Tuple\n", + "import time\n", + "\n", + "# -----------------------\n", + "# Multi-objective instrumented solver + evaluator\n", + "# -----------------------\n", + "\n", + "# Build the guide with multi-objective support\n", + "guide = LangGraphGuide(\n", + " feedback_func=feedback_answer_bbeh,\n", + " answer_key=\"final_answer\",\n", + " allowed_answer_set=allowed_set,\n", + ")\n", + "\n", + "def run_solver_on_example(ex: dict) -> Tuple[bool, str, str, Dict[str, float]]:\n", + " \"\"\"Run solver and return (ok, pred, feedback, score_dict).\n", + "\n", + " score_dict contains {accuracy, execution_time_s} from get_score_dict().\n", + " \"\"\"\n", + " global _last_exec_time_s\n", + " _last_exec_time_s = 0.0\n", + "\n", + " out = solve_with_PAL_Strategy(ex[\"question\"])\n", + " pred = get_no_node(out[\"final_answer\"])\n", + " ok, fb = feedback_answer_bbeh(pred, ex[\"solution\"], allowed_set)\n", + "\n", + " # Populate guide's execution time from the global, then get score_dict\n", + " guide._last_execution_time_s = _last_exec_time_s\n", + " score_dict = guide.get_score_dict(ex[\"question\"], out, ex[\"solution\"])\n", + "\n", + " return ok, str(pred), fb, score_dict\n", + "\n", + "def evaluate(examples: List[dict], *, name: str) -> Tuple[float, List[Dict[str, float]]]:\n", + " \"\"\"Evaluate examples, returning (accuracy, list of score_dicts).\"\"\"\n", + " n_ok = 0\n", + " all_score_dicts = []\n", + " for i, ex in enumerate(examples, 1):\n", + " ok, pred, fb, sd = run_solver_on_example(ex)\n", + " n_ok += int(ok)\n", + " all_score_dicts.append(sd)\n", + " print(f\"[{name}] {i:02d}/{len(examples)} ok={ok} pred={pred} \"\n", + " f\"exec_time={sd['execution_time_s']:.4f}s :: {fb}\")\n", + " acc = n_ok / max(1, len(examples))\n", + " mean_time = sum(sd['execution_time_s'] for sd in all_score_dicts) / max(1, len(all_score_dicts))\n", + " print(f\"[{name}] accuracy = {acc:.3f} ({n_ok}/{len(examples)}), mean exec_time = {mean_time:.4f}s\")\n", + " return acc, all_score_dicts\n", + "\n", + "\n", + "# =====================================================================\n", + "# Baseline evaluation\n", + "# =====================================================================\n", + "print(\"=\" * 60)\n", + "print(\"BASELINE evaluation on validation set\")\n", + "print(\"=\" * 60)\n", + "baseline_acc, baseline_score_dicts = evaluate(val_set, name=\"baseline/val\")\n", + "\n", + "# =====================================================================\n", + "# Per-step metric collection during curriculum training\n", + "# =====================================================================\n", + "# Stores {step, phase, accuracy, execution_time_s, example_idx} per observation\n", + "metric_log = []\n", + "step_counter = 0\n", + "\n", + "# Record baseline metrics\n", + "for i, sd in enumerate(baseline_score_dicts):\n", + " metric_log.append({\n", + " \"step\": 0,\n", + " \"phase\": \"baseline\",\n", + " \"example_idx\": i,\n", + " **sd,\n", + " })\n", + "\n", + "# =====================================================================\n", + "# Curriculum training (Mode B) with metric collection\n", + "# =====================================================================\n", + "if SKIP_OPTIMIZATION:\n", + " print(\"SKIP_OPTIMIZATION=1 -> skipping optimization/training.\")\n", + "else:\n", + " last_successes: List[dict] = []\n", + "\n", + " for idx, ex in enumerate(train_set, 1):\n", + " step_counter += 1\n", + " try:\n", + " ok, pred, fb, sd = run_solver_on_example(ex)\n", + " except Exception as e:\n", + " print(f\"[train] {idx:02d}/{len(train_set)} CRASHED: {type(e).__name__}: {e}\")\n", + " ok, pred, fb = False, \"ERROR\", str(e)\n", + " sd = {\"accuracy\": 0.0, \"execution_time_s\": 0.0}\n", + " print(f\"[train] {idx:02d}/{len(train_set)} ok={ok} pred={pred} \"\n", + " f\"exec_time={sd['execution_time_s']:.4f}s :: {fb}\")\n", + "\n", + " # Log pre-optimization metric\n", + " metric_log.append({\n", + " \"step\": step_counter,\n", + " \"phase\": \"train_pre\",\n", + " \"example_idx\": idx - 1,\n", + " **sd,\n", + " })\n", + "\n", + " if ok:\n", + " last_successes.append(ex)\n", + " last_successes = last_successes[-VALIDATE_ON_LAST_N:]\n", + " continue\n", + "\n", + " # Save pre-optimization function references for crash recovery\n", + " _pre_opt_agents = {name: globals().get(name) for name in GRAPH_AGENTS}\n", + "\n", + " # Optimize on the failing example\n", + " modified, dump_file, history, chosen_state, run_dir = optimize_langgraph(\n", + " graph_root_function=GRAPH_ROOT,\n", + " graph_agents_functions=GRAPH_AGENTS,\n", + " graph_prompts_list=GRAPH_PROMPTS,\n", + " question=ex[\"question\"],\n", + " solution=ex[\"solution\"],\n", + " answer_feedback_func=feedback_answer_bbeh,\n", + " allowed_answer_set=allowed_set,\n", + " validation_set=last_successes,\n", + " accumulation_steps=ACCUMULATION_STEPS,\n", + " retry=LEARNING_RETRY,\n", + " max_attempts=MAX_ATTEMPTS,\n", + " test_optimization=True,\n", + " stop_on_success=True,\n", + " seed=SEED,\n", + " dump_prefix=f\"BBEH_{BBEH_TASK_NAME}__PAL__\",\n", + " output_folder=OUTPUT_FOLDER,\n", + " )\n", + "\n", + " print(\"[train] optimize_langgraph:\", {\"modified\": modified, \"dump_file\": dump_file, \"run_dir\": run_dir})\n", + " if history:\n", + " print(\"[train] last history entry:\", history[-1])\n", + "\n", + " # Re-test after optimization.\n", + " # Wrapped in try/except: when optimization fails to update params,\n", + " # the Trace bundle state can be corrupted (Node objects where dicts\n", + " # are expected), causing ExecutionError in the re-test.\n", + " try:\n", + " ok2, pred2, fb2, sd2 = run_solver_on_example(ex)\n", + " print(f\"[train] after-opt ok={ok2} pred={pred2} \"\n", + " f\"exec_time={sd2['execution_time_s']:.4f}s :: {fb2}\")\n", + "\n", + " # Log post-optimization metric\n", + " metric_log.append({\n", + " \"step\": step_counter,\n", + " \"phase\": \"train_post\",\n", + " \"example_idx\": idx - 1,\n", + " **sd2,\n", + " })\n", + "\n", + " if ok2:\n", + " last_successes.append(ex)\n", + " last_successes = last_successes[-VALIDATE_ON_LAST_N:]\n", + " except Exception as e:\n", + " print(f\"[train] after-opt re-test failed (graph state corrupted): {type(e).__name__}: {e}\")\n", + " # Restore pre-optimization functions so next example doesn't crash\n", + " for _name, _orig in _pre_opt_agents.items():\n", + " if _orig is not None:\n", + " globals()[_name] = _orig\n", + " print(\"[train] restored original functions; continuing.\")\n", + "\n", + "# =====================================================================\n", + "# Post-training evaluation\n", + "# =====================================================================\n", + "print(\"\\n\" + \"=\" * 60)\n", + "print(\"POST-TRAINING evaluation on validation set\")\n", + "print(\"=\" * 60)\n", + "final_acc, final_score_dicts = evaluate(val_set, name=\"final/val\")\n", + "\n", + "# Record final eval metrics\n", + "step_counter += 1\n", + "for i, sd in enumerate(final_score_dicts):\n", + " metric_log.append({\n", + " \"step\": step_counter,\n", + " \"phase\": \"final\",\n", + " \"example_idx\": i,\n", + " **sd,\n", + " })\n", + "\n", + "print(f\"\\nSummary: baseline_val_acc={baseline_acc:.3f}, final_val_acc={final_acc:.3f}\")\n", + "print(f\"Total metric observations collected: {len(metric_log)}\")" + ] }, { "cell_type": "markdown", @@ -856,7 +1149,7 @@ "\n", "This demonstrates the M2 multi-objective infrastructure on a real LLM task.\n", "The same get_score_dict() interface works with BasicSearch, BeamsearchAlgorithm,\n", - "and PrioritySearch (see t6_m2_trainers.ipynb for those algorithms).\n", + "and PrioritySearch (see multiobjective_trainers.ipynb for those algorithms).\n", "\"\"\")" ] } @@ -874,4 +1167,4 @@ }, "nbformat": 4, "nbformat_minor": 5 -} \ No newline at end of file +} diff --git a/examples/notebooks/t6_m1_vector_scores.ipynb b/examples/notebooks/multiobjective_quickstart.ipynb similarity index 98% rename from examples/notebooks/t6_m1_vector_scores.ipynb rename to examples/notebooks/multiobjective_quickstart.ipynb index 52bc5c73..07732b73 100644 --- a/examples/notebooks/t6_m1_vector_scores.ipynb +++ b/examples/notebooks/multiobjective_quickstart.ipynb @@ -6,7 +6,7 @@ "id": "a0000001", "metadata": {}, "outputs": [], - "source": "import os, sys\n\n# In Colab: clone and install from GitHub\n# Locally: add repo root to sys.path so opto is importable\ntry:\n import google.colab\n IN_COLAB = True\nexcept ImportError:\n IN_COLAB = False\n\nif IN_COLAB:\n !git clone https://github.com/carlosrod723/OpenTrace.git Trace\n %cd Trace\n !git checkout t6-multi-objective-m0\n !sed -i 's/python_requires=\">=3.13\"/python_requires=\">=3.12\"/' setup.py\n !pip install -e .\nelse:\n # Local: ensure repo root is on sys.path\n _nb_dir = os.path.dirname(os.path.abspath(\"__file__\"))\n _repo_root = os.path.abspath(os.path.join(_nb_dir, \"..\", \"..\"))\n if _repo_root not in sys.path:\n sys.path.insert(0, _repo_root)\n import opto\n print(f\"Using local opto from: {os.path.dirname(opto.__file__)}\")" + "source": "import os, sys\n\n# In Colab: clone and install from GitHub\n# Locally: add repo root to sys.path so opto is importable\ntry:\n import google.colab\n IN_COLAB = True\nexcept ImportError:\n IN_COLAB = False\n\nif IN_COLAB:\n !git clone https://github.com/AgentOpt/OpenTrace.git Trace\n %cd Trace\n !git checkout experimental\n !pip install -e .\nelse:\n # Local: ensure repo root is on sys.path\n _nb_dir = os.path.dirname(os.path.abspath(\"__file__\"))\n _repo_root = os.path.abspath(os.path.join(_nb_dir, \"..\", \"..\"))\n if _repo_root not in sys.path:\n sys.path.insert(0, _repo_root)\n import opto\n print(f\"Using local opto from: {os.path.dirname(opto.__file__)}\")" }, { "cell_type": "markdown", @@ -15,7 +15,7 @@ "source": [ "# T6 Multi-Objective Vector Scores — M1 Implementation\n", "\n", - "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/AgentOpt/OpenTrace/blob/pull/61/head/examples/notebooks/t6_m1_vector_scores.ipynb)\n", + "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/AgentOpt/OpenTrace/blob/experimental/examples/notebooks/multiobjective_quickstart.ipynb)\n", "\n", "**Milestone 1 Deliverable** — Core multi-objective infrastructure\n", "\n", @@ -1124,4 +1124,4 @@ }, "nbformat": 4, "nbformat_minor": 5 -} \ No newline at end of file +} diff --git a/examples/notebooks/t6_m2_trainers.ipynb b/examples/notebooks/multiobjective_trainers.ipynb similarity index 96% rename from examples/notebooks/t6_m2_trainers.ipynb rename to examples/notebooks/multiobjective_trainers.ipynb index 8984f43e..5c97209b 100644 --- a/examples/notebooks/t6_m2_trainers.ipynb +++ b/examples/notebooks/multiobjective_trainers.ipynb @@ -6,7 +6,7 @@ "id": "cell-setup", "metadata": {}, "outputs": [], - "source": "import os, sys\n\n# In Colab: clone and install from GitHub\n# Locally: add repo root to sys.path so opto is importable\ntry:\n import google.colab\n IN_COLAB = True\nexcept ImportError:\n IN_COLAB = False\n\nif IN_COLAB:\n %cd /content\n !rm -rf Trace # clean slate\n !git clone https://github.com/carlosrod723/OpenTrace.git Trace\n %cd Trace\n !git checkout t6-multi-objective-m0\n !sed -i 's/python_requires=\">=3.13\"/python_requires=\">=3.12\"/' setup.py\n !pip install -e .\n !pip install cvxpy matplotlib pandas\n _repo_root = os.getcwd() # /content/Trace after %cd\nelse:\n # Local: ensure repo root is on sys.path\n _nb_dir = os.path.dirname(os.path.abspath(\"__file__\"))\n _repo_root = os.path.abspath(os.path.join(_nb_dir, \"..\", \"..\"))\n if _repo_root not in sys.path:\n sys.path.insert(0, _repo_root)\n import opto\n print(f\"Using local opto from: {os.path.dirname(opto.__file__)}\")\n\nprint(f\"Repo root: {_repo_root}\")\n\n# Verify cvxpy is available (required for SixHumpCamel SOS certificate)\ntry:\n import cvxpy\n print(f\"cvxpy {cvxpy.__version__} available\")\nexcept ImportError:\n raise ImportError(\"cvxpy is required: pip install cvxpy\")" + "source": "import os, sys\n\n# In Colab: clone and install from GitHub\n# Locally: add repo root to sys.path so opto is importable\ntry:\n import google.colab\n IN_COLAB = True\nexcept ImportError:\n IN_COLAB = False\n\nif IN_COLAB:\n %cd /content\n !rm -rf Trace # clean slate\n !git clone https://github.com/AgentOpt/OpenTrace.git Trace\n %cd Trace\n !git checkout experimental\n !pip install -e .\n !pip install cvxpy matplotlib pandas\n _repo_root = os.getcwd() # /content/Trace after %cd\nelse:\n # Local: ensure repo root is on sys.path\n _nb_dir = os.path.dirname(os.path.abspath(\"__file__\"))\n _repo_root = os.path.abspath(os.path.join(_nb_dir, \"..\", \"..\"))\n if _repo_root not in sys.path:\n sys.path.insert(0, _repo_root)\n import opto\n print(f\"Using local opto from: {os.path.dirname(opto.__file__)}\")\n\nprint(f\"Repo root: {_repo_root}\")\n\n# Verify cvxpy is available (required for SixHumpCamel SOS certificate)\ntry:\n import cvxpy\n print(f\"cvxpy {cvxpy.__version__} available\")\nexcept ImportError:\n raise ImportError(\"cvxpy is required: pip install cvxpy\")" }, { "cell_type": "markdown", @@ -15,7 +15,7 @@ "source": [ "# T6 M2 — BeamsearchAlgorithm & PrioritySearch Multi-Objective\n", "\n", - "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/AgentOpt/OpenTrace/blob/pull/61/head/examples/notebooks/t6_m2_trainers.ipynb)\n", + "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/AgentOpt/OpenTrace/blob/experimental/examples/notebooks/multiobjective_trainers.ipynb)\n", "\n", "**Milestone 2 Deliverable** — Multi-objective support in BeamsearchAlgorithm and PrioritySearch\n", "\n", @@ -182,4 +182,4 @@ }, "nbformat": 4, "nbformat_minor": 5 -} \ No newline at end of file +} diff --git a/setup.py b/setup.py index 8fdfd139..dbd60be5 100644 --- a/setup.py +++ b/setup.py @@ -29,5 +29,5 @@ long_description=open('README.md', encoding="utf8").read(), packages=setuptools.find_packages(include=["opto*"]), install_requires=install_requires, - python_requires=">=3.13", + python_requires=">=3.10", )