Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
74 changes: 74 additions & 0 deletions docs/proposals/1-triage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
---
title: PlanExe Proposal Triage — 80/20 Landscape
date: 2026-02-25
status: working note
author: Egon + Larry
---

# Overview

Simon asked us to triage the proposal space with an 80:20 lens. The goal of this note is to capture:
1. Which proposals deliver outsized value (the 20% that unlock 80% of the architecture)
2. Which other proposals are nearby in the graph and could reuse their artifacts or reasoning
3. High-leverage parameter tweaks, code tweaks, and second/third order effects
4. Gaps in the current docs and ideas for new proposals
5. Relevant questions/tasks you might not have asked yet

We focused on the most recent proposals ("67+" cluster) plus the ones directly touching the validation/orchestration story that FermiSanityCheck will unlock.

# High-Leverage Proposals (the 20%)

1. **#07 Elo Ranking System (1,751 lines)** – Core ranking mechanism for comparing idea variants, plan quality, and post-plan summaries. Louis-level heuristics here inform nearly every downstream comparison use case.
2. **#63 Luigi Agent Integration & #64 Post-plan Orchestration Layer** – These three documents (#63, #64, #66) describe how PlanExe schedules, retries, and enriches its Luigi DAG. Any change to the DAG (including FermiSanityCheck or arcgentica-style loops) ripples through this cluster.
3. **#62 Agent-first Frontend Discoverability (609 lines)** – Defines the agent UX, which depends on the scoring/ranking engine (#07) and the reliability signals that our validation cluster will provide.
4. **#69 Arcgentica Agent Patterns (279 lines)** – The arcgentica comparison is already referencing our validation work and sets the guardrails for self-evaluation/soft-autonomy.
5. **#41 Autonomous Execution of Plan & #05 Semantic Plan Search Graph** – These represent core system-level capabilities (distributed execution and semantic search) whose outputs feed the ranking and reporting layers.

These documents together unlock most of the architectural work. They interlock around: planning quality signals (#07, #69, Fermi), orchestration (#63, #64, #66), and the interfaces (#62, #41, #05).

# Related Proposals & Reuse Opportunities

- **#07 Elo Ranking + #62 Agent-first Frontend** can share heuristics. Instead of reinventing ranking weights in #62, reuse the cost/feasibility tradeoffs defined in #07 plus FermiSanityCheck flags as features.
- **#63-66 orchestration cluster** already describe Luigi tasks. The validation loop doc should be cross-referenced there to show where FermiSanityCheck sits in the DAG and how downstream tasks like WBS, Scheduler, ExpertOrchestrator should consume the validation report.
- **#69 + #56 (Adversarial Red Team) + #43 (Assumption Drift Monitor)** form a validation cluster. FermiSanityCheck is the front line; these others are observers (red team, drift monitor) that should consume the validation report and escalate to human review.
- **#32 Gantt Parallelization & #33 CBS** could re-use the same thresholds as FermiSanityCheck when calculating duration plausibility (e.g., if duration falls outside the published feasible range, highlight the same issue in the Gantt UI).

# 80:20 Tweaks & Parameter Changes

- **Ranking weights (#07)** – adjust cost vs. feasibility vs. confidence to surface plans that pass quantitative grounding. No rewrite needed; just new weights (e.g., penalize plans where FermiSanityCheck flags >3 assumptions).
- **Batch size thresholds (#63)** – the Luigi DAG currently runs every task. We can gate the WBS tasks with a flag that only fires if FermiSanityCheck passes or fails softly, enabling a smaller workflow for low-risk inputs without re-architecting.
- **Risk terminology alignment (#38 & #44)** – harmonize the words used in the risk propagation network and investor audit pack so they can share visualization tooling, reducing duplicate explanations.

# Second/Third Order Effects

- **Validation loop → downstream trust**: Once FermiSanityCheck is in place, client reports (e.g., #60 plan-to-repo, #41 autonomous execution) can annotate numbers with the validation status, reducing rework.
- **Arcgentica/agent patterns**: Hardening PlanExe encourages stricter typed outputs (#69). This lets the UI (#08) and ranking engine (#07) rely on structured data instead of parsing Markdown.
- **Quantitative grounding improves ranking** (#07, #62) which in turn makes downstream dashboards (#60, #62) more actionable and reduces QA overhead.
- **Clustering proposals** (#63-66, #69, #56) around validation/orchestration helps the next human reviewer (Simon) make a single decision that affects multiple docs.

# Gaps & Future Proposal Ideas

- **FermiSanityCheck Implementation Roadmap** – Document how MakeAssumptions output becomes QuantifiedAssumption, where heuristics live, and how Luigi tasks consume the validation_report. (We have the spec in `planexe-validation-loop-spec.md` but not a public proposal yet.)
- **Validation Observability Dashboard** – A proposal capturing how the validation report is surfaced to humans (per #44, #60). Could cover alerts (Slack/Discord) when FermiSanityCheck fails or when repeated fails accumulate.
- **Arbitration Workflow** – When FermiSanityCheck fails and ReviewPlan still thinks the plan is OK, we need a human-in-the-loop workflow. This is not yet documented anywhere.

# Questions You Might Not Be Asking

1. What are the acceptance criteria for FermiSanityCheck? (confidence levels, heuristics, why 100× spans?)
2. Who owns the validation report downstream? Should ExpertOrchestrator or Governance phases be responsible for acting on it?
3. Does FermiSanityCheck expire per run or is it stored for audit trails (per #42 evidence traceability)?
4. Can we reuse the same heuristics for other tasks (#32 Gantt, #34 finance) to maximize payoff?
5. How do we rank the outputs once FermiSanityCheck is added? Should ranking (#07) penalize low confidence even if the costs look good?
6. Do we need a battle plan for manual overrides when FermiSanityCheck is overzealous (e.g., ROI assumptions where domain experts know the average is >100×)?

# Tasks We Can Own Now

- Extract the QuantifiedAssumption schema (claim, lower_bound, upper_bound, unit, confidence, evidence) and add it to PlanExe’s assumption bundle.
- Implement a FermiSanityCheck Luigi task that runs immediately after MakeAssumptions and produces validation_report.json.
- Hook the validation report into DistillAssumptions / ReviewAssumptions by adding a `validation_passed` flag.
- Update #69 and #56 docs with references to the validation report to keep the narrative cohesive.
- Create the proposed dashboard proposal (validation observability) to track how many plans fail numeric sanity each week.

# Summary

The high-leverage 20% of proposals are: ranking (#07), orchestration (#63-66), UI (#62), arcgentica patterns (#69), and autonomous execution/search (#41, #05). We can activate them by implementing FermiSanityCheck, aligning their heuristics, and surfacing the new validation signals in the UI/dashboards. The docs already cover most of the research; now we need a short, focused proposal/clustering doc (this one) plus the Fermi implementation and dashboards. After Simon approves, we can execute the chosen cluster.
2 changes: 2 additions & 0 deletions worker_plan/worker_plan_api/filenames.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,8 @@ class FilenameEnum(str, Enum):
REVIEW_ASSUMPTIONS_MARKDOWN = "003-9-review_assumptions.md"
CONSOLIDATE_ASSUMPTIONS_FULL_MARKDOWN = "003-10-consolidate_assumptions_full.md"
CONSOLIDATE_ASSUMPTIONS_SHORT_MARKDOWN = "003-11-consolidate_assumptions_short.md"
FERMI_SANITY_CHECK_REPORT = "003-12-fermi_sanity_check_report.json"
FERMI_SANITY_CHECK_SUMMARY = "003-13-fermi_sanity_check_summary.md"
PRE_PROJECT_ASSESSMENT_RAW = "004-1-pre_project_assessment_raw.json"
PRE_PROJECT_ASSESSMENT = "004-2-pre_project_assessment.json"
PROJECT_PLAN_RAW = "005-1-project_plan_raw.json"
Expand Down
284 changes: 284 additions & 0 deletions worker_plan/worker_plan_internal/assume/domain_normalizer.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,284 @@
"""
Author: Larry (Claude Opus 4.6)
Date: 2026-02-25
PURPOSE: Domain-aware normalization for FermiSanityCheck. Loads domain profiles (YAML),
auto-detects project domain from assumptions, and normalizes currency/units/confidence
to standard metric/English output for AI agents.
SRP/DRY check: Pass - Consumes QuantifiedAssumption schema + domain profile YAML.
Outputs normalized assumptions ready for validation.
"""

import logging
import yaml
from dataclasses import dataclass, field
from enum import Enum
from typing import List, Optional, Dict, Any
from pathlib import Path

from worker_plan_internal.assume.quantified_assumptions import (
QuantifiedAssumption,
ConfidenceLevel,
)

LOGGER = logging.getLogger(__name__)

# Find domain profiles YAML
DOMAIN_PROFILES_PATH = Path(__file__).parent.parent / "docs" / "domain-profiles" / "domain-profile-schema.md"


class DomainProfile:
"""Represents a single domain profile (carpenter, dentist, etc.)"""

def __init__(self, profile_dict: Dict[str, Any]):
self.id = profile_dict.get("id")
self.name = profile_dict.get("name")
self.description = profile_dict.get("description")

# Currency
currency_cfg = profile_dict.get("currency", {})
self.default_currency = currency_cfg.get("default", "USD")
self.currency_aliases = set(currency_cfg.get("aliases", []))
self.currency_aliases.add(self.default_currency.lower())

# Units
units_cfg = profile_dict.get("units", {})
self.metric_first = units_cfg.get("metric", True)
self.unit_conversions = {}
for conv in units_cfg.get("convert", []):
self.unit_conversions[conv["from"].lower()] = {
"to": conv["to"],
"factor": conv["factor"],
}

# Heuristics
heuristics = profile_dict.get("heuristics", {})
self.budget_keywords = set(heuristics.get("budget_keywords", []))
self.timeline_keywords = set(heuristics.get("timeline_keywords", []))
self.team_keywords = set(heuristics.get("team_keywords", []))

confidence_kw = heuristics.get("confidence_keywords", {})
self.high_confidence_words = set(confidence_kw.get("high", []))
self.medium_confidence_words = set(confidence_kw.get("medium", []))
self.low_confidence_words = set(confidence_kw.get("low", []))

# Detection
detection = profile_dict.get("detection", {})
self.currency_signals = set(detection.get("currency_signals", []))
self.unit_signals = set(detection.get("unit_signals", []))
self.keyword_signals = set(detection.get("keyword_signals", []))

def score_match(self, currency_found: List[str], units_found: List[str], keywords_found: List[str]) -> int:
"""Score how well this profile matches the found signals."""
score = 0
for c in currency_found:
if c.lower() in [s.lower() for s in self.currency_signals]:
score += 10
for u in units_found:
if u.lower() in [s.lower() for s in self.unit_signals]:
score += 5
for k in keywords_found:
if k.lower() in [s.lower() for s in self.keyword_signals]:
score += 3
return score


@dataclass
class NormalizedAssumption:
"""Assumption after domain-aware normalization."""
assumption_id: str
original_claim: str
normalized_claim: str
domain_id: str
currency: str # Normalized to domain default
currency_eur_equivalent: Optional[float] = None # For comparison
unit: str = "metric" # All converted to metric
confidence: ConfidenceLevel = ConfidenceLevel.medium
notes: List[str] = field(default_factory=list)


class DomainNormalizer:
"""Loads domain profiles and normalizes assumptions to metric/currency/confidence."""

def __init__(self, profiles_yaml_path: Optional[str] = None):
self.profiles: Dict[str, DomainProfile] = {}
self.default_profile = None

path = Path(profiles_yaml_path) if profiles_yaml_path else DOMAIN_PROFILES_PATH
self._load_profiles(path)

def _load_profiles(self, yaml_path: Path) -> None:
"""Load domain profiles from YAML file."""
if not yaml_path.exists():
LOGGER.warning(f"Domain profiles not found at {yaml_path}; using defaults")
self._create_default_profiles()
return

try:
with open(yaml_path, "r") as f:
content = f.read()
# Extract YAML from markdown code block
if "```yaml" in content:
yaml_start = content.index("```yaml") + 7
yaml_end = content.index("```", yaml_start)
yaml_str = content[yaml_start:yaml_end]
else:
yaml_str = content

data = yaml.safe_load(yaml_str)
if data and "profiles" in data:
for profile_dict in data["profiles"]:
profile = DomainProfile(profile_dict)
self.profiles[profile.id] = profile
if not self.default_profile:
self.default_profile = profile

LOGGER.info(f"Loaded {len(self.profiles)} domain profiles from {yaml_path}")
except Exception as e:
LOGGER.error(f"Error loading domain profiles: {e}; using defaults")
self._create_default_profiles()

def _create_default_profiles(self) -> None:
"""Create minimal default profiles if YAML not available."""
default_profile_dict = {
"id": "default",
"name": "General Business",
"description": "Default profile for unclassified projects.",
"currency": {"default": "USD", "aliases": ["usd", "$"]},
"units": {"metric": True, "convert": []},
"heuristics": {
"budget_keywords": ["budget", "cost"],
"timeline_keywords": ["days", "weeks"],
"team_keywords": ["team", "people"],
"confidence_keywords": {
"high": ["guarantee", "have done"],
"medium": ["plan to", "expect"],
"low": ["estimate", "maybe"],
},
},
"detection": {
"currency_signals": ["USD", "$"],
"unit_signals": [],
"keyword_signals": [],
},
}
self.default_profile = DomainProfile(default_profile_dict)
self.profiles["default"] = self.default_profile

def detect_domain(self, assumption: QuantifiedAssumption) -> DomainProfile:
"""Auto-detect domain profile from assumption metadata."""
# Extract signals from assumption
currency_found = []
if assumption.unit:
currency_found.append(assumption.unit)

units_found = []
if assumption.unit:
units_found.append(assumption.unit)

keywords_found = []
# Extract keywords from claim + evidence
claim_lower = assumption.claim.lower()
evidence_lower = (assumption.evidence or "").lower()
combined = f"{claim_lower} {evidence_lower}".split()

# Score all profiles
scores = {}
for profile_id, profile in self.profiles.items():
score = profile.score_match(currency_found, units_found, combined)
scores[profile_id] = score

# Pick highest scoring profile
if scores:
best_profile_id = max(scores, key=scores.get)
if scores[best_profile_id] > 0:
return self.profiles[best_profile_id]

return self.default_profile

def normalize_currency(
self, value: Optional[float], from_currency: str, to_profile: DomainProfile
) -> tuple[Optional[float], Optional[float]]:
"""
Convert currency to profile default.
Returns (normalized_value, eur_equivalent).
"""
if value is None:
return None, None

# Placeholder conversion rates (in production, use real FX API)
fx_rates = {
"USD": 0.92, # USD → EUR
"DKK": 0.124, # DKK → EUR
"EUR": 1.0,
}

# For now, assume all inputs are in the detected currency or profile default
normalized = value
eur_equiv = value * fx_rates.get(to_profile.default_currency, 1.0)

return normalized, eur_equiv

def normalize_unit(self, value: Optional[float], from_unit: str, to_profile: DomainProfile) -> Optional[float]:
"""Convert unit to metric (based on profile conversions)."""
if value is None or not from_unit:
return value

from_unit_lower = from_unit.lower()
if from_unit_lower in to_profile.unit_conversions:
conversion = to_profile.unit_conversions[from_unit_lower]
return value * conversion["factor"]

return value

def normalize_confidence(self, assumption: QuantifiedAssumption, domain: DomainProfile) -> ConfidenceLevel:
"""Re-assess confidence level based on domain keywords."""
claim_lower = assumption.claim.lower()
evidence_lower = (assumption.evidence or "").lower()
combined = f"{claim_lower} {evidence_lower}"

# Check high confidence
if any(word in combined for word in domain.high_confidence_words):
return ConfidenceLevel.high

# Check low confidence
if any(word in combined for word in domain.low_confidence_words):
return ConfidenceLevel.low

# Default to medium
return ConfidenceLevel.medium

def normalize(self, assumption: QuantifiedAssumption) -> NormalizedAssumption:
"""Normalize a QuantifiedAssumption to domain standards."""
domain = self.detect_domain(assumption)

# Normalize currency
norm_currency, eur_equiv = self.normalize_currency(assumption.lower_bound, assumption.unit or "", domain)

# Normalize unit (keep as "metric" for now)
norm_unit = "metric"

# Re-assess confidence per domain
norm_confidence = self.normalize_confidence(assumption, domain)

# Build normalized claim
norm_claim = f"{assumption.claim} [normalized to {domain.id} domain]"

notes = []
if domain.id != "default":
notes.append(f"Auto-detected domain: {domain.name}")

return NormalizedAssumption(
assumption_id=assumption.assumption_id,
original_claim=assumption.claim,
normalized_claim=norm_claim,
domain_id=domain.id,
currency=domain.default_currency,
currency_eur_equivalent=eur_equiv,
unit=norm_unit,
confidence=norm_confidence,
notes=notes,
)

def normalize_batch(self, assumptions: List[QuantifiedAssumption]) -> List[NormalizedAssumption]:
"""Normalize a batch of assumptions."""
return [self.normalize(assumption) for assumption in assumptions]
Loading