Baselines by Vishak-Bhat30 · Pull Request #19 · microsoft/interwhen

Vishak-Bhat30 · 2026-03-17T15:49:17Z

ToT + Gt + BoK

The model was interpreting 'directly to the left' as 'immediately adjacent', arguing with the verifier across all 5 correction attempts. The dataset defines it as general compass direction (west = same row, lower column, regardless of distance/walls). Updated feedback to explicitly state: - This is about GENERAL COMPASS DIRECTION, not adjacency - Do NOT consider adjacency or walls - Just compare row/col coordinates - This is the verified correct answer — do not argue Result: 3/3 previously-failing RP examples now correct on first correction.

- Rewrite thinking-phase prompt to instruct model to use full entity names and full direction words (no abbreviations) in the exact 'X is to the [direction] of Y' format - Pre-fill STEP 1 from parsed relations so model jumps to STEP 2 - Add abbreviation expansion (NE/NW/SE/SW → full words) in parse_directional_claims_from_text before regex matching - Strip square brackets around entity names ([Foo] → Foo) - Verified claims now 11-17 per example (was 0 before)

…oning - extract_solution: only search for \boxed{} after </think>, not inside thinking trace - extract_solution: return None if <think> opened but never closed (token limit hit) - extract_solution: return None for empty \boxed{} (from verifier feedback prompts) - extract_solution: strip trailing '= 24' from expressions - extract_solution: add \left/\right LaTeX cleanup - _find_complete_boxed: brace-counting helper for nested LaTeX (e.g. \frac{}{}) replaces naive regex in thinkingPhaseVerifier.py (4 sites) and stepVerifier.py (2 sites) - Soundness: exclude 'no solution' and 'no expression found' cases - Soundness formula: correct / (total - excluded) instead of correct / attempted - Add 'excluded' column to CSV output - Switch model config to microsoft/Phi-4-reasoning with Phi-4 ChatML format - Projected metrics: Accuracy 1062/1362 (77.97%), Soundness 1062/1062 (100%)

Copilot

Pull request overview

This PR expands the verifier-guided inference stack with “thinking-phase” (side-stream) verification monitors for Maze/Game24/SpatialMap, improves parsing/feedback in existing verifiers, and updates example runners and docs to use the new monitor flow.

Changes:

Add thinking-phase verifier monitors (Game24, Maze, SpatialMap) plus shared helpers for robust \boxed{} detection.
Extend SpatialMap and Maze verifiers with new parsing/normalization helpers and improved feedback/error messaging.
Update step-verifier monitors, stream interjection robustness, and example scripts/READMEs to use the new architecture and reporting.

Reviewed changes

Copilot reviewed 22 out of 22 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
interwhen/utils/spatialmap_verifier.py	Adds counting/direction/object question parsing + Z3-based possibility checks and abbreviation handling for parsed claims.
interwhen/utils/maze_verifier.py	Adds direction/turn-type normalization and improves maze parsing and verifier feedback messages.
interwhen/monitors/thinkingPhaseVerifierMaze.py	New monitor that verifies maze tracing during `<think>` via side-streams and then enforces/verifies a structured template after `</think>`.
interwhen/monitors/thinkingPhaseVerifierGame24.py	New monitor that checks partial expressions during `<think>` via side-streams and validates the final boxed expression post-`</think>`.
interwhen/monitors/stepVerifier.py	Enhances Maze relative-position verification and adds SpatialMap final-answer verification for direction/object/counting questions; uses robust boxed parsing helper.
interwhen/monitors/_common.py	New shared helper to locate complete `\boxed{...}` spans while handling nested braces.
interwhen/monitors/init.py	Exports the new thinking-phase monitors.
interwhen/interject.py	Makes SSE chunk parsing more robust and adds an early return when a monitor marks the final answer as correct.
examples/TTSwithVerification/tot_baseline.py	New Tree-of-Thought baseline runner across multiple datasets with evaluation utilities.
examples/TTSwithVerification/spatialmeta.py	New SpatialMap experiment script using the monitor-based architecture.
examples/TTSwithVerification/spatialmap_stepverifier.py	Updates SpatialMap experiment to use the thinking-phase verifier and adds CSV/summary reporting.
examples/TTSwithVerification/mazemeta.py	New Maze experiment script using the monitor-based architecture.
examples/TTSwithVerification/maze_stepverifier.py	Updates Maze experiment to use the thinking-phase verifier and adds CSV/summary reporting.
examples/TTSwithVerification/game24meta.py	New Game24 experiment script using the step-verifier monitor and adds output management/reporting.
examples/TTSwithVerification/game24_stepverifier.py	Updates Game24 experiment to use the thinking-phase verifier and improves boxed extraction robustness.
examples/README.md	Updates example vLLM invocation (tensor parallel change).
examples/EarlyStopping/spatialmap_example.py	Refactors MCQ extraction/evaluation and updates output directory defaults.
examples/EarlyStopping/maze_example.py	Refactors MCQ extraction/evaluation and updates dataset/config defaults.
examples/EarlyStopping/game24_example.py	Improves boxed extraction robustness and sampling config for early-stopping experiments.
README.md	Updates the top-level “Set up target LLM server” example configuration.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

interwhen/utils/spatialmap_verifier.py

+        However, since every constraint in the SpatialMap dataset is diagonal
+        (NE/NW/SE/SW), no two objects can share an x- or y-coordinate.
+        Therefore the strict-cardinal count is always **0** whenever the
+        problem only has diagonal constraints — which is exactly the
+        ground-truth expectation.
+
+        For diagonal directions:
+        - "northeast" → higher x AND higher y
+        - "northwest" → lower x  AND higher y
+        - "southeast" → higher x AND lower y
+        - "southwest" → lower x  AND lower y
+
+        Returns the count, or ``None`` if the solver cannot determine it
+        (e.g. reference entity not found).
+        """
+        direction = direction.lower().strip()
+
+        # Resolve the reference entity's variable names
+        ref_x_key = f"{reference}_x"
+        ref_y_key = f"{reference}_y"
+        if ref_x_key not in self.entities:
+            # Try fuzzy match — dataset names may differ in whitespace
+            for key in self.entities:
+                if key.endswith("_x") and reference.lower() in key.lower():
+                    ref_x_key = key
+                    ref_y_key = key.replace("_x", "_y")
+                    reference = key[:-2]
+                    break
+            else:
+                return None
+
+        ref_x = self.entities[ref_x_key]
+        ref_y = self.entities[ref_y_key]
+
+        # Collect all other entity names (unique base names)
+        all_entities = set()
+        for key in self.entities:
+            if key.endswith("_x"):
+                ename = key[:-2]
+                if ename != reference:
+                    all_entities.add(ename)
+
+        # Determine x/y constraints for the direction
+        is_cardinal = direction in ("north", "south", "east", "west")
+
+        # Since all given constraints are strictly diagonal, any pair of
+        # objects cannot share the same x- or y-coordinate.  Cardinal
+        # directions require an exact match on one axis, which is impossible.
+        if is_cardinal:
+            return 0


interwhen/utils/spatialmap_verifier.py

+    direction = direction.lower().strip()
+    if direction in ('north', 'south', 'east', 'west'):
+        return (0, 0)  # cardinal → always 0 with diagonal-only constraints
+


interwhen/monitors/thinkingPhaseVerifierGame24.py

+    try:
+        value = eval(expr_str, {"__builtins__": None}, {})
+        value = float(value)
+    except Exception as e:
+        errors.append(f"Cannot evaluate expression '{expr_str}': {e}")
+        return "error", False, errors, None


Vishak-Bhat30 and others added 18 commits February 17, 2026 10:24

weithc GPU

147430e

thinking trace verification

9a22a65

Fix maze verifier

2477c52

sm

143a70b

added step verfier for all

e29d107

added step verfier for all

4084bc8

sanity

d88eb0e

sanity

0a07ae2

sanity

6a64f1a

Update feedback prompts and format blocks for appendix alignment

32b5adf

Interwhen on game24, maze, spatialMap and meta prompt method

59228fa

Remove Outputs_TTS from repo

7f8fd47

resolved comments

a9ca217

Merge branch 'main' into sanityCheck

b170457

added baselines

15d55af

Copilot AI review requested due to automatic review settings March 17, 2026 15:49

Copilot started reviewing on behalf of Vishak-Bhat30 March 17, 2026 15:50 View session

Copilot AI reviewed Mar 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Baselines#19

Baselines#19
Vishak-Bhat30 wants to merge 18 commits intomainfrom
baselines

Vishak-Bhat30 commented Mar 17, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Vishak-Bhat30 commented Mar 17, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants