Open
Conversation
The model was interpreting 'directly to the left' as 'immediately adjacent', arguing with the verifier across all 5 correction attempts. The dataset defines it as general compass direction (west = same row, lower column, regardless of distance/walls). Updated feedback to explicitly state: - This is about GENERAL COMPASS DIRECTION, not adjacency - Do NOT consider adjacency or walls - Just compare row/col coordinates - This is the verified correct answer — do not argue Result: 3/3 previously-failing RP examples now correct on first correction.
- Rewrite thinking-phase prompt to instruct model to use full entity names and full direction words (no abbreviations) in the exact 'X is to the [direction] of Y' format - Pre-fill STEP 1 from parsed relations so model jumps to STEP 2 - Add abbreviation expansion (NE/NW/SE/SW → full words) in parse_directional_claims_from_text before regex matching - Strip square brackets around entity names ([Foo] → Foo) - Verified claims now 11-17 per example (was 0 before)
…oning
- extract_solution: only search for \boxed{} after </think>, not inside thinking trace
- extract_solution: return None if <think> opened but never closed (token limit hit)
- extract_solution: return None for empty \boxed{} (from verifier feedback prompts)
- extract_solution: strip trailing '= 24' from expressions
- extract_solution: add \left/\right LaTeX cleanup
- _find_complete_boxed: brace-counting helper for nested LaTeX (e.g. \frac{}{})
replaces naive regex in thinkingPhaseVerifier.py (4 sites) and stepVerifier.py (2 sites)
- Soundness: exclude 'no solution' and 'no expression found' cases
- Soundness formula: correct / (total - excluded) instead of correct / attempted
- Add 'excluded' column to CSV output
- Switch model config to microsoft/Phi-4-reasoning with Phi-4 ChatML format
- Projected metrics: Accuracy 1062/1362 (77.97%), Soundness 1062/1062 (100%)
There was a problem hiding this comment.
Pull request overview
This PR expands the verifier-guided inference stack with “thinking-phase” (side-stream) verification monitors for Maze/Game24/SpatialMap, improves parsing/feedback in existing verifiers, and updates example runners and docs to use the new monitor flow.
Changes:
- Add thinking-phase verifier monitors (Game24, Maze, SpatialMap) plus shared helpers for robust
\boxed{}detection. - Extend SpatialMap and Maze verifiers with new parsing/normalization helpers and improved feedback/error messaging.
- Update step-verifier monitors, stream interjection robustness, and example scripts/READMEs to use the new architecture and reporting.
Reviewed changes
Copilot reviewed 22 out of 22 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| interwhen/utils/spatialmap_verifier.py | Adds counting/direction/object question parsing + Z3-based possibility checks and abbreviation handling for parsed claims. |
| interwhen/utils/maze_verifier.py | Adds direction/turn-type normalization and improves maze parsing and verifier feedback messages. |
| interwhen/monitors/thinkingPhaseVerifierMaze.py | New monitor that verifies maze tracing during <think> via side-streams and then enforces/verifies a structured template after </think>. |
| interwhen/monitors/thinkingPhaseVerifierGame24.py | New monitor that checks partial expressions during <think> via side-streams and validates the final boxed expression post-</think>. |
| interwhen/monitors/stepVerifier.py | Enhances Maze relative-position verification and adds SpatialMap final-answer verification for direction/object/counting questions; uses robust boxed parsing helper. |
| interwhen/monitors/_common.py | New shared helper to locate complete \boxed{...} spans while handling nested braces. |
| interwhen/monitors/init.py | Exports the new thinking-phase monitors. |
| interwhen/interject.py | Makes SSE chunk parsing more robust and adds an early return when a monitor marks the final answer as correct. |
| examples/TTSwithVerification/tot_baseline.py | New Tree-of-Thought baseline runner across multiple datasets with evaluation utilities. |
| examples/TTSwithVerification/spatialmeta.py | New SpatialMap experiment script using the monitor-based architecture. |
| examples/TTSwithVerification/spatialmap_stepverifier.py | Updates SpatialMap experiment to use the thinking-phase verifier and adds CSV/summary reporting. |
| examples/TTSwithVerification/mazemeta.py | New Maze experiment script using the monitor-based architecture. |
| examples/TTSwithVerification/maze_stepverifier.py | Updates Maze experiment to use the thinking-phase verifier and adds CSV/summary reporting. |
| examples/TTSwithVerification/game24meta.py | New Game24 experiment script using the step-verifier monitor and adds output management/reporting. |
| examples/TTSwithVerification/game24_stepverifier.py | Updates Game24 experiment to use the thinking-phase verifier and improves boxed extraction robustness. |
| examples/README.md | Updates example vLLM invocation (tensor parallel change). |
| examples/EarlyStopping/spatialmap_example.py | Refactors MCQ extraction/evaluation and updates output directory defaults. |
| examples/EarlyStopping/maze_example.py | Refactors MCQ extraction/evaluation and updates dataset/config defaults. |
| examples/EarlyStopping/game24_example.py | Improves boxed extraction robustness and sampling config for early-stopping experiments. |
| README.md | Updates the top-level “Set up target LLM server” example configuration. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+191
to
+240
| However, since every constraint in the SpatialMap dataset is diagonal | ||
| (NE/NW/SE/SW), no two objects can share an x- or y-coordinate. | ||
| Therefore the strict-cardinal count is always **0** whenever the | ||
| problem only has diagonal constraints — which is exactly the | ||
| ground-truth expectation. | ||
|
|
||
| For diagonal directions: | ||
| - "northeast" → higher x AND higher y | ||
| - "northwest" → lower x AND higher y | ||
| - "southeast" → higher x AND lower y | ||
| - "southwest" → lower x AND lower y | ||
|
|
||
| Returns the count, or ``None`` if the solver cannot determine it | ||
| (e.g. reference entity not found). | ||
| """ | ||
| direction = direction.lower().strip() | ||
|
|
||
| # Resolve the reference entity's variable names | ||
| ref_x_key = f"{reference}_x" | ||
| ref_y_key = f"{reference}_y" | ||
| if ref_x_key not in self.entities: | ||
| # Try fuzzy match — dataset names may differ in whitespace | ||
| for key in self.entities: | ||
| if key.endswith("_x") and reference.lower() in key.lower(): | ||
| ref_x_key = key | ||
| ref_y_key = key.replace("_x", "_y") | ||
| reference = key[:-2] | ||
| break | ||
| else: | ||
| return None | ||
|
|
||
| ref_x = self.entities[ref_x_key] | ||
| ref_y = self.entities[ref_y_key] | ||
|
|
||
| # Collect all other entity names (unique base names) | ||
| all_entities = set() | ||
| for key in self.entities: | ||
| if key.endswith("_x"): | ||
| ename = key[:-2] | ||
| if ename != reference: | ||
| all_entities.add(ename) | ||
|
|
||
| # Determine x/y constraints for the direction | ||
| is_cardinal = direction in ("north", "south", "east", "west") | ||
|
|
||
| # Since all given constraints are strictly diagonal, any pair of | ||
| # objects cannot share the same x- or y-coordinate. Cardinal | ||
| # directions require an exact match on one axis, which is impossible. | ||
| if is_cardinal: | ||
| return 0 |
Comment on lines
+729
to
+732
| direction = direction.lower().strip() | ||
| if direction in ('north', 'south', 'east', 'west'): | ||
| return (0, 0) # cardinal → always 0 with diagonal-only constraints | ||
|
|
Comment on lines
+110
to
+115
| try: | ||
| value = eval(expr_str, {"__builtins__": None}, {}) | ||
| value = float(value) | ||
| except Exception as e: | ||
| errors.append(f"Cannot evaluate expression '{expr_str}': {e}") | ||
| return "error", False, errors, None |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
ToT + Gt + BoK