Skip to content

Baselines#19

Open
Vishak-Bhat30 wants to merge 18 commits intomainfrom
baselines
Open

Baselines#19
Vishak-Bhat30 wants to merge 18 commits intomainfrom
baselines

Conversation

@Vishak-Bhat30
Copy link
Copy Markdown
Collaborator

ToT + Gt + BoK

Vishak-Bhat30 and others added 18 commits February 17, 2026 10:24
The model was interpreting 'directly to the left' as 'immediately
adjacent', arguing with the verifier across all 5 correction attempts.
The dataset defines it as general compass direction (west = same row,
lower column, regardless of distance/walls).

Updated feedback to explicitly state:
- This is about GENERAL COMPASS DIRECTION, not adjacency
- Do NOT consider adjacency or walls
- Just compare row/col coordinates
- This is the verified correct answer — do not argue

Result: 3/3 previously-failing RP examples now correct on first correction.
- Rewrite thinking-phase prompt to instruct model to use full entity
  names and full direction words (no abbreviations) in the exact
  'X is to the [direction] of Y' format
- Pre-fill STEP 1 from parsed relations so model jumps to STEP 2
- Add abbreviation expansion (NE/NW/SE/SW → full words) in
  parse_directional_claims_from_text before regex matching
- Strip square brackets around entity names ([Foo] → Foo)
- Verified claims now 11-17 per example (was 0 before)
…oning

- extract_solution: only search for \boxed{} after </think>, not inside thinking trace
- extract_solution: return None if <think> opened but never closed (token limit hit)
- extract_solution: return None for empty \boxed{} (from verifier feedback prompts)
- extract_solution: strip trailing '= 24' from expressions
- extract_solution: add \left/\right LaTeX cleanup
- _find_complete_boxed: brace-counting helper for nested LaTeX (e.g. \frac{}{})
  replaces naive regex in thinkingPhaseVerifier.py (4 sites) and stepVerifier.py (2 sites)
- Soundness: exclude 'no solution' and 'no expression found' cases
- Soundness formula: correct / (total - excluded) instead of correct / attempted
- Add 'excluded' column to CSV output
- Switch model config to microsoft/Phi-4-reasoning with Phi-4 ChatML format
- Projected metrics: Accuracy 1062/1362 (77.97%), Soundness 1062/1062 (100%)
Copilot AI review requested due to automatic review settings March 17, 2026 15:49
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR expands the verifier-guided inference stack with “thinking-phase” (side-stream) verification monitors for Maze/Game24/SpatialMap, improves parsing/feedback in existing verifiers, and updates example runners and docs to use the new monitor flow.

Changes:

  • Add thinking-phase verifier monitors (Game24, Maze, SpatialMap) plus shared helpers for robust \boxed{} detection.
  • Extend SpatialMap and Maze verifiers with new parsing/normalization helpers and improved feedback/error messaging.
  • Update step-verifier monitors, stream interjection robustness, and example scripts/READMEs to use the new architecture and reporting.

Reviewed changes

Copilot reviewed 22 out of 22 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
interwhen/utils/spatialmap_verifier.py Adds counting/direction/object question parsing + Z3-based possibility checks and abbreviation handling for parsed claims.
interwhen/utils/maze_verifier.py Adds direction/turn-type normalization and improves maze parsing and verifier feedback messages.
interwhen/monitors/thinkingPhaseVerifierMaze.py New monitor that verifies maze tracing during <think> via side-streams and then enforces/verifies a structured template after </think>.
interwhen/monitors/thinkingPhaseVerifierGame24.py New monitor that checks partial expressions during <think> via side-streams and validates the final boxed expression post-</think>.
interwhen/monitors/stepVerifier.py Enhances Maze relative-position verification and adds SpatialMap final-answer verification for direction/object/counting questions; uses robust boxed parsing helper.
interwhen/monitors/_common.py New shared helper to locate complete \boxed{...} spans while handling nested braces.
interwhen/monitors/init.py Exports the new thinking-phase monitors.
interwhen/interject.py Makes SSE chunk parsing more robust and adds an early return when a monitor marks the final answer as correct.
examples/TTSwithVerification/tot_baseline.py New Tree-of-Thought baseline runner across multiple datasets with evaluation utilities.
examples/TTSwithVerification/spatialmeta.py New SpatialMap experiment script using the monitor-based architecture.
examples/TTSwithVerification/spatialmap_stepverifier.py Updates SpatialMap experiment to use the thinking-phase verifier and adds CSV/summary reporting.
examples/TTSwithVerification/mazemeta.py New Maze experiment script using the monitor-based architecture.
examples/TTSwithVerification/maze_stepverifier.py Updates Maze experiment to use the thinking-phase verifier and adds CSV/summary reporting.
examples/TTSwithVerification/game24meta.py New Game24 experiment script using the step-verifier monitor and adds output management/reporting.
examples/TTSwithVerification/game24_stepverifier.py Updates Game24 experiment to use the thinking-phase verifier and improves boxed extraction robustness.
examples/README.md Updates example vLLM invocation (tensor parallel change).
examples/EarlyStopping/spatialmap_example.py Refactors MCQ extraction/evaluation and updates output directory defaults.
examples/EarlyStopping/maze_example.py Refactors MCQ extraction/evaluation and updates dataset/config defaults.
examples/EarlyStopping/game24_example.py Improves boxed extraction robustness and sampling config for early-stopping experiments.
README.md Updates the top-level “Set up target LLM server” example configuration.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +191 to +240
However, since every constraint in the SpatialMap dataset is diagonal
(NE/NW/SE/SW), no two objects can share an x- or y-coordinate.
Therefore the strict-cardinal count is always **0** whenever the
problem only has diagonal constraints — which is exactly the
ground-truth expectation.

For diagonal directions:
- "northeast" → higher x AND higher y
- "northwest" → lower x AND higher y
- "southeast" → higher x AND lower y
- "southwest" → lower x AND lower y

Returns the count, or ``None`` if the solver cannot determine it
(e.g. reference entity not found).
"""
direction = direction.lower().strip()

# Resolve the reference entity's variable names
ref_x_key = f"{reference}_x"
ref_y_key = f"{reference}_y"
if ref_x_key not in self.entities:
# Try fuzzy match — dataset names may differ in whitespace
for key in self.entities:
if key.endswith("_x") and reference.lower() in key.lower():
ref_x_key = key
ref_y_key = key.replace("_x", "_y")
reference = key[:-2]
break
else:
return None

ref_x = self.entities[ref_x_key]
ref_y = self.entities[ref_y_key]

# Collect all other entity names (unique base names)
all_entities = set()
for key in self.entities:
if key.endswith("_x"):
ename = key[:-2]
if ename != reference:
all_entities.add(ename)

# Determine x/y constraints for the direction
is_cardinal = direction in ("north", "south", "east", "west")

# Since all given constraints are strictly diagonal, any pair of
# objects cannot share the same x- or y-coordinate. Cardinal
# directions require an exact match on one axis, which is impossible.
if is_cardinal:
return 0
Comment on lines +729 to +732
direction = direction.lower().strip()
if direction in ('north', 'south', 'east', 'west'):
return (0, 0) # cardinal → always 0 with diagonal-only constraints

Comment on lines +110 to +115
try:
value = eval(expr_str, {"__builtins__": None}, {})
value = float(value)
except Exception as e:
errors.append(f"Cannot evaluate expression '{expr_str}': {e}")
return "error", False, errors, None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants