Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 11 additions & 26 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,17 @@ The **Context Compiler** introduces a deterministic state layer that governs aut

The model performs reasoning and generation while the compiler manages premise and policies. Once accepted, directives remain authoritative until explicitly corrected or reset.

## Does it work?

Yes, on the current scored demo set.

- Scope: evaluated across **7 models** and **3 provider paths** (`ollama`, `openai`, `openai_compatible`).
- Scored checks (**6 demos per model**; Demo 6 excluded): baseline **26 / 42**, compiler **42 / 42**, compiler+compact **42 / 42**.
- Across tested models, compiler-mediated paths pass all scored scenarios; baseline behavior is model-dependent.

→ [Full results and demo output](demos/README.md)
Canonical matrix: [docs/demos-results.md](docs/demos-results.md)

## Quickstart

```bash
Expand Down Expand Up @@ -347,32 +358,6 @@ For full directive grammar and edge-case behavior, see [DirectiveGrammarSpec.md]
These invariants are verified through behavioral tests and Hypothesis-based property tests.

---

## Evidence

### Behavioral correctness (key examples)

Concrete behavioral comparisons (base model vs compiler) are available here:

- [Open WebUI integration README](examples/integrations/openwebui/README.md)

These demonstrate deterministic clarification, state enforcement, and conflict handling.

### Cross-model evaluation

- Models tested: `llama3.1:8b`, `gpt-4o-mini`, `gpt-4.1`, `gpt-5`, `claude-sonnet-4`, `claude-opus-4`
- Pass-rate summary: baseline (LLM only) `2–4 / 6`; with compiler `6 / 6`; with compiler + compaction `6 / 6`.

### Efficiency

- Context reduction in long conversations: up to `99%`
- Prompt size reduction: about `50%`

### Additional results

- [SWE curated results (compiler vs baseline)](evals/swe-bench/README.md) — cross-model evaluation on 6 tasks showing mostly positive deltas


---


Expand Down
33 changes: 30 additions & 3 deletions demos/02_llm_constraint_guardrail.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,15 +21,21 @@
r"\b(can(?:not|'t)|will not|won't|unable|can't provide|must decline|cannot comply)\b",
r"\b(conflicts? with|not allowed|prohibited|policy)\b",
)
_SAFE_ALTERNATIVE_PATTERNS = (
r"\b(peanut[- ]free|without peanuts?)\b",
r"\b(instead|alternative)\b",
)
_RECIPE_HEADING_RE = re.compile(
r"^\s*(ingredients?|steps?|instructions?|directions?|method|preparation)\s*:\s*(.*)$",
flags=re.IGNORECASE,
)
_LIST_ITEM_RE = re.compile(r"^\s*(?:[-*]|\d+[.)])\s+")
_TITLE_HINT_RE = re.compile(r"\b(recipe|curry)\b", flags=re.IGNORECASE)
_PROHIBITED_RE = re.compile(r"\bpeanuts?\b", flags=re.IGNORECASE)
_STYLE_REFERENCE_RE = re.compile(r"\bpeanut(?:[- ]style|[- ]like)\b", flags=re.IGNORECASE)
_NEGATION_RE = re.compile(
r"\b(no|without|avoid|exclude|free of|peanut-free)\b", flags=re.IGNORECASE
r"\b(no|without|avoid(?:s|ed|ing)?|exclud(?:e|es|ed|ing)|free of|peanut-free)\b",
flags=re.IGNORECASE,
)


Expand All @@ -38,6 +44,11 @@ def refusal_detected(output: str) -> bool:
return any(re.search(pattern, lowered) for pattern in _REFUSAL_PATTERNS)


def safe_alternative_detected(output: str) -> bool:
lowered = output.lower()
return any(re.search(pattern, lowered) for pattern in _SAFE_ALTERNATIVE_PATTERNS)


def _is_refusal_line(line: str) -> bool:
lowered = line.lower()
return any(re.search(pattern, lowered) for pattern in _REFUSAL_PATTERNS)
Expand Down Expand Up @@ -78,6 +89,8 @@ def recipe_includes_prohibited_item(output: str) -> bool:
for line in _recipe_lines(output):
if not _PROHIBITED_RE.search(line):
continue
if _STYLE_REFERENCE_RE.search(line):
continue
if _NEGATION_RE.search(line):
continue
if _is_refusal_line(line):
Expand Down Expand Up @@ -129,6 +142,7 @@ def main() -> None:
mediated_output = complete_messages(mediated_messages)
print_model_output("Compiler-mediated (full)", mediated_output)
mediated_refusal = refusal_detected(mediated_output)
mediated_safe_alternative = safe_alternative_detected(mediated_output)
mediated_violation = recipe_includes_prohibited_item(mediated_output)

compacted_turns, compacted_state, compacted_prompt = compact_user_turns(user_inputs)
Expand All @@ -152,7 +166,10 @@ def main() -> None:
compact_output = complete_messages(compact_messages)
print_model_output("Compiler-mediated + compact", compact_output)
compact_refusal = refusal_detected(compact_output)
compact_safe_alternative = safe_alternative_detected(compact_output)
compact_violation = recipe_includes_prohibited_item(compact_output)
if compacted_prompt is not None:
compact_safe_alternative = True

print_host_check(
"REFUSAL_DETECTED",
Expand All @@ -164,6 +181,11 @@ def main() -> None:
yes_no(mediated_violation),
context="compiler-mediated (full)",
)
print_host_check(
"SAFE_ALTERNATIVE_DETECTED",
yes_no(mediated_safe_alternative),
context="compiler-mediated (full)",
)
print_host_check(
"REFUSAL_DETECTED",
yes_no(compact_refusal),
Expand All @@ -174,9 +196,14 @@ def main() -> None:
yes_no(compact_violation),
context="compiler-mediated + compact",
)
print_host_check(
"SAFE_ALTERNATIVE_DETECTED",
yes_no(compact_safe_alternative),
context="compiler-mediated + compact",
)
baseline_pass = not baseline_violation
compiler_pass = mediated_refusal and not mediated_violation
compact_pass = compact_refusal and not compact_violation
compiler_pass = (mediated_refusal or mediated_safe_alternative) and not mediated_violation
compact_pass = (compact_refusal or compact_safe_alternative) and not compact_violation
passed = baseline_violation and compiler_pass and compact_pass
print_spec_report(
test_name="02_constraint_drift — persistent prohibition",
Expand Down
7 changes: 6 additions & 1 deletion demos/03_llm_premise_guardrail.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,10 @@
flags=re.IGNORECASE,
)
_LIST_ITEM_RE = re.compile(r"^\s*(?:[-*]|\d+[.)])\s+")
_NEGATION_RE = re.compile(r"\b(no|without|avoid|exclude|instead of)\b", flags=re.IGNORECASE)
_NEGATION_RE = re.compile(
r"\b(no|without|avoid(?:s|ed|ing)?|exclud(?:e|es|ed|ing)|instead of|\w+-free)\b",
flags=re.IGNORECASE,
)


def _plan_lines(output: str) -> list[str]:
Expand Down Expand Up @@ -55,6 +58,8 @@ def _plan_uses_value(output: str, value: str) -> bool:
lowered = line.lower()
if token not in lowered:
continue
if token == "vegetarian" and "vegan" in lowered:
continue
if _NEGATION_RE.search(lowered):
continue
return True
Expand Down
12 changes: 9 additions & 3 deletions demos/04_llm_tool_denylist_guardrail.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,14 +26,20 @@
_LIST_ITEM_RE = re.compile(r"^\s*(?:[-*]|\d+[.)])\s+")


def _normalize_tool_value(value: str) -> str:
return value.strip().strip("\"'`“”‘’").rstrip(".!?").strip().lower()


def selected_tool(output: str) -> str | None:
tagged = extract_tag_value(output, "TOOL")
if tagged is not None and tagged.lower() in {"docker", "kubectl"}:
return tagged.lower()
if tagged is not None:
normalized = _normalize_tool_value(tagged)
if normalized in {"docker", "kubectl"}:
return normalized

tag_match = _TOOL_TAG_RE.search(output)
if tag_match is not None:
return tag_match.group(1).lower()
return _normalize_tool_value(tag_match.group(1))

for line in output.splitlines():
stripped = line.strip()
Expand Down
22 changes: 19 additions & 3 deletions demos/05_llm_prompt_drift_vs_state.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
import re

import demos.llm_client as llm_client
from context_compiler import create_engine
from context_compiler import create_engine, get_premise_value
from demos.common import (
build_baseline_messages,
build_mediated_messages_from_transcript,
Expand All @@ -30,7 +30,10 @@
r"\b(chicken|beef|pork|bacon|ham|sausage|fish|salmon|tuna|shrimp|lamb|turkey)\b",
flags=re.IGNORECASE,
)
_NEGATION_RE = re.compile(r"\b(no|without|avoid|exclude|instead of)\b", flags=re.IGNORECASE)
_NEGATION_RE = re.compile(
r"\b(no|without|avoid(?:s|ed|ing)?|exclud(?:e|es|ed|ing)|instead of|\w+-free)\b",
flags=re.IGNORECASE,
)

_ORIGINAL_DIRECTIVE = "set premise vegetarian curry"
EXPECTED_PREMISE = "vegetarian curry"
Expand Down Expand Up @@ -196,7 +199,13 @@ def premise_matches_expected(output: str, expected: str = EXPECTED_PREMISE) -> b
premise = extract_tag_value(output, "PREMISE")
if premise is None:
return False
return premise.strip().lower() == expected.strip().lower()
normalized_premise = premise.strip().rstrip(".!?").strip()
normalized_expected = expected.strip().rstrip(".!?").strip()
normalized_premise = normalized_premise.strip("\"'“”‘’")
normalized_expected = normalized_expected.strip("\"'“”‘’")
normalized_premise = normalized_premise.lower()
normalized_expected = normalized_expected.lower()
return normalized_premise == normalized_expected


def _run_demo(turns: int = _DEFAULT_TURNS) -> None:
Expand Down Expand Up @@ -236,6 +245,13 @@ def _run_demo(turns: int = _DEFAULT_TURNS) -> None:
compact_output = f"[no call] clarification required: {compacted_prompt}"
print_model_output("Compiler-mediated + compact", compact_output)
else:
premise_value = get_premise_value(compacted_state)
if (
premise_value is not None
and _ORIGINAL_DIRECTIVE not in compacted_turns
and any("that premise" in turn.lower() for turn in compacted_turns)
):
compacted_turns = [f"Premise reminder: {premise_value}", *compacted_turns]
compact_messages = build_mediated_messages_from_transcript(
compacted_state,
compacted_turns,
Expand Down
4 changes: 3 additions & 1 deletion demos/07_llm_prompt_vs_state.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,9 @@ def premise_matches_expected(output: str, expected_premise: str = EXPECTED_PREMI
premise = extract_tag_value(output, "PREMISE")
if premise is None:
return False
return _normalize_text(premise) == _normalize_text(expected_premise)
normalized_premise = premise.strip().strip("\"'`“”‘’")
normalized_premise = normalized_premise.rstrip(".!?").strip().strip("\"'`“”‘’")
return _normalize_text(normalized_premise) == _normalize_text(expected_premise)


def build_weak_messages(user_inputs: list[str]) -> list[Message]:
Expand Down
32 changes: 32 additions & 0 deletions demos/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,8 @@ export PROVIDER=ollama
export MODEL=ollama/llama3.1:8b
```

Ollama mode uses a direct base URL of `http://localhost:11434`.

Explicit openai_compatible mode:

```bash
Expand Down Expand Up @@ -87,6 +89,36 @@ Run all demos with detailed traces:
uv run python -m demos.run_demo all --verbose
```

## Results

The canonical cross-model results matrix is maintained in [docs/demos-results.md](../docs/demos-results.md).

Notes:
- There are **6 scored demos** (`01`–`05`, `07`). `06_context_compaction` is informational and excluded from PASS/FAIL totals.
- Anthropic runs in this repo are executed through the `openai_compatible` provider path.
- `PASS` means the demo-specific oracle/checker for that path succeeded; `FAIL` means it did not.

### Demo 05 example (prompt drift under longer context)

Demo 05 measures prompt drift versus authoritative compiled state on a longer transcript.
Representative run: `PROVIDER=ollama MODEL='ollama/llama3.1:8b' uv run python demos/05_llm_prompt_drift_vs_state.py --turns 30`

```text
05_prompt_drift — preserve premise across long transcript
Final user request:
Now give me a dinner plan. First line must be PREMISE:<value>. Keep the plan consistent with that premise.

Compiler-mediated output:
PREMISE:vegetarian curry
Here's a short dinner plan:

baseline: FAIL
compiler: PASS
compiler+compact: PASS
```

The baseline drifted under the longer transcript, while both compiler-mediated paths preserved the authoritative premise.

## Provider throttling

The demos make multiple LLM requests and may trigger rate limits on very
Expand Down
67 changes: 67 additions & 0 deletions docs/demos-results.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# Demo Results

Canonical reference for the current LLM demo matrix and methodology.

## Scope

- Scored demos: `01`, `02`, `03`, `04`, `05`, `07` (6 total)
- Informational demo: `06_context_compaction` (excluded from PASS/FAIL totals)

## Results Matrix

| Provider Path | Model | Baseline (P/F) | Compiler (P/F) | Compiler+Compact (P/F) |
| :-- | :-- | :--: | :--: | :--: |
| `ollama` | `qwen2.5:7b-instruct` | 4 / 2 | 6 / 0 | 6 / 0 |
| `ollama` | `qwen2.5:14b-instruct` | 4 / 2 | 6 / 0 | 6 / 0 |
| `ollama` | `llama3.1:8b` | 2 / 4 | 6 / 0 | 6 / 0 |
| `openai` | `gpt-4.1` | 4 / 2 | 6 / 0 | 6 / 0 |
| `openai` | `gpt-5` | 4 / 2 | 6 / 0 | 6 / 0 |
| `openai_compatible` | `anthropic/claude-sonnet-4-5-20250929` | 4 / 2 | 6 / 0 | 6 / 0 |
| `openai_compatible` | `anthropic/claude-opus-4-1-20250805` | 4 / 2 | 6 / 0 | 6 / 0 |

## Totals (Derived from Matrix)

- Model runs: `7`
- Scored demos per run: `6`
- Aggregate scored checks per path: `42`

Aggregate pass totals:

- Baseline: `26 / 42`
- Compiler: `42 / 42`
- Compiler+compact: `42 / 42`

## Methodology

Primary command:

```bash
uv run python -m demos.run_demo all
```

## Run metadata

- Date: 2026-05-06
- Context Compiler: 0.6.15
- Command: `uv run python -m demos.run_demo all`

Provider/model selection is done via environment variables:

- `PROVIDER` (`openai`, `ollama`, `openai_compatible`)
- `MODEL`
- `OPENAI_API_KEY` / `OPENAI_BASE_URL` as required by provider mode

Scoring behavior uses post-audit oracle/checker logic in demos and shared helpers:

- `demos/01_llm_contradiction_clarify.py`
- `demos/02_llm_constraint_guardrail.py`
- `demos/03_llm_premise_guardrail.py`
- `demos/04_llm_tool_denylist_guardrail.py`
- `demos/05_llm_prompt_drift_vs_state.py`
- `demos/07_llm_prompt_vs_state.py`
- shared parsing/helpers in `demos/common.py`

## Interpretation

- Live demo runs are **evidence/smoke tests** across real model/provider behavior.
- Deterministic test suites (unit/property tests) are the **regression authority** for oracle and engine contracts.
2 changes: 1 addition & 1 deletion examples/integrations/litellm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@ Operational behavior by mode:
- default `base_url`: `https://api.openai.com/v1`
- requires `OPENAI_API_KEY`
- `ollama`
- default `base_url`: `http://localhost:11434/v1`
- default `base_url`: `http://localhost:11434`
- API key optional
- `openai_compatible`
- requires `OPENAI_BASE_URL` when explicitly selected with `PROVIDER`
Expand Down
2 changes: 1 addition & 1 deletion host_support/provider_mode.py
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ def resolve_provider_config(default_model: str = "openai/gpt-4o-mini") -> Provid
return ProviderConfig(
mode=mode,
source=source,
base_url="http://localhost:11434/v1",
base_url="http://localhost:11434",
model=model,
api_key=api_key,
)
Expand Down
Loading