Skip to content

ToT Baselines code pushed#26

Open
Vishak-Bhat30 wants to merge 4 commits intomainfrom
vishak/ToT_baseline
Open

ToT Baselines code pushed#26
Vishak-Bhat30 wants to merge 4 commits intomainfrom
vishak/ToT_baseline

Conversation

@Vishak-Bhat30
Copy link
Copy Markdown
Collaborator

ToT baseline code push

Copilot AI review requested due to automatic review settings March 25, 2026 10:28
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a Tree-of-Thought (ToT) baseline implementation under examples/TTSwithVerification/ToT/, including dataset-specific prompts, verina (Lean) evaluation helpers, and ZebraLogic utilities to run and score ToT searches.

Changes:

  • Introduces a ToT search engine with propose/evaluate/search loops and dataset-specific prompt builders.
  • Adds a CLI runner (tot_baseline.py) to execute ToT across multiple tasks and write per-example logs + summaries.
  • Adds helper modules for ZebraLogic parsing/scoring and Verina code/spec compilation-based evaluation.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
examples/TTSwithVerification/ToT/utils/zebralogic_helper.py ZebraLogic dataset processing + JSON extraction + correctness scoring helpers
examples/TTSwithVerification/ToT/utils/verina_tot_utils.py Verina Lean code extraction and compile/unit-test evaluation utilities
examples/TTSwithVerification/ToT/utils/verina_spec_tot_utils.py Verina spec generation extraction + compile-based soundness/completeness testing utilities
examples/TTSwithVerification/ToT/utils/value_prompts.py Few-shot value prompts for ToT “value” scoring across tasks
examples/TTSwithVerification/ToT/utils/tree_of_thought.py Core ToT search implementation and prompt construction helpers
examples/TTSwithVerification/ToT/utils/bla.pypy Duplicate/alternate copy of ToT implementation (currently unclear purpose)
examples/TTSwithVerification/ToT/tot_baseline.py CLI baseline runner wiring datasets, ToT search, synthesis, and evaluation
ToT_problems.txt Added notes/scratch content related to ToT prompting

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +82 to +115
SYSTEM_PROMPT_VANILLA = """\
# Problem Description

You are solving a house grid problem. You are given:
1. Features and Domains
- A fixed number of houses, indexed sequentially (e.g., House 1, House 2, …) from left to right.
- A set of features (e.g., color, name, pet, book genre).
- Each feature has a finite domain of possible values.
2. Constraint:
- Each house has exactly one value per feature.
- No two houses share the same value for the same feature.
3. Clues / Constraints describing:
- Houses and their positions
- Feature values
- Relative ordering (e.g., "next to", "to the left of", "2 houses away from")

# Rules for Solving

1. Reason about the problem in text.
2. You may receive feedback from the user if anything is wrong. Use any feedback to guide your reasoning until a complete solution is reached.
3. Do not stop responding until you've assigned each and every variable.

# Final Answer Reporting Format

```json
{
"House 1": { "feature1": "value1", "feature2": "value2", ... },
"House 2": { "feature1": "value1", "feature2": "value2", ... },
...
}
```

Make sure to use the exact feature/value names as given in the problem description.
Make sure the JSON is valid and parsable."""
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SYSTEM_PROMPT_VANILLA is defined twice in this module (earlier and again here), so the later assignment silently overwrites the first prompt. If both prompts are needed, give them distinct constant names; otherwise remove the duplicate to avoid using the wrong system prompt at runtime.

Copilot uses AI. Check for mistakes.
Comment on lines +166 to +169
clue_text_index, clue_text = clue.strip().split(". ", 1)
clues.append({
"text_index": int(clue_text_index.strip()),
"text": clue_text.strip()
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clue.strip().split('. ', 1) will raise ValueError if a clue line is missing the expected ". " separator (or is empty). Since you already filter out known-bad IDs later, it would be safer to guard this parse (e.g., skip malformed lines or fall back to a different split) to avoid crashing on unexpected dataset formatting.

Suggested change
clue_text_index, clue_text = clue.strip().split(". ", 1)
clues.append({
"text_index": int(clue_text_index.strip()),
"text": clue_text.strip()
clue_line = clue.strip()
if not clue_line:
# Skip empty or whitespace-only lines
continue
index_part, sep, text_part = clue_line.partition(". ")
if not sep:
# Line does not contain the expected ". " separator; skip as malformed
continue
try:
text_index = int(index_part.strip())
except ValueError:
# Index part is not an integer; skip as malformed
continue
clues.append({
"text_index": text_index,
"text": text_part.strip()

Copilot uses AI. Check for mistakes.
2. Constraint:
- Each house has exactly one value per feature.
- No two houses share the same value for the same feature.
3. Clues / Constraints descrbing:
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in the prompt text: "descrbing" → "describing".

Suggested change
3. Clues / Constraints descrbing:
3. Clues / Constraints describing:

Copilot uses AI. Check for mistakes.
Comment on lines +10 to +12
import datasets


Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Importing datasets at module import time will raise ModuleNotFoundError in environments that install only the base pyproject.toml dependencies (it isn't listed there). To keep this utility usable without extra deps, move the import into get_zebralogic_dataset() (or wrap it in a try/except with a clear error explaining which extra to install).

Suggested change
import datasets
try:
import datasets
except ModuleNotFoundError:
class _DatasetsImportErrorProxy:
def __getattr__(self, name):
raise ModuleNotFoundError(
"The 'datasets' package is required to use the ZebraLogic dataset "
"utilities. Please install the appropriate extra, e.g.: "
"`pip install wildbench[dataset]`."
)
datasets = _DatasetsImportErrorProxy()

Copilot uses AI. Check for mistakes.
Comment on lines +1281 to +1312
async def dfs(node: TreeNode, depth: int):
nonlocal best_terminal, best_value

if depth >= self.config.max_depth:
return

# Generate proposals
proposals = await self.propose_next_steps(
task,
problem,
node.trajectory,
llm_server,
self.config.branching_factor
)
node.proposals = proposals

for prop in proposals:
new_trajectory = f"{node.trajectory}\n{prop}" if node.trajectory else prop
value = await self.evaluate_state(task, problem, new_trajectory, llm_server)

child = TreeNode(
trajectory=new_trajectory,
depth=depth + 1,
value=value,
parent=node,
)
node.children.append(child)

if value > best_candidate_value:
best_candidate = child
best_candidate_value = value
self.search_stats["total_nodes_in_tree"] += 1
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In _dfs_search, the nested dfs() assigns to best_candidate / best_candidate_value but doesn't declare them nonlocal, so Python will treat them as locals and if value > best_candidate_value will raise UnboundLocalError. Add nonlocal best_candidate, best_candidate_value (or refactor to store these on self) inside dfs().

Copilot uses AI. Check for mistakes.
Comment on lines +14 to +19
import numpy as np
import pandas as pd
import aiohttp
from datasets import load_dataset
from tqdm import tqdm
from transformers import AutoTokenizer
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This module imports many heavy/optional dependencies (numpy, pandas, aiohttp, datasets, transformers, etc.) that are not used anywhere in the file, and some are not in the base pyproject.toml dependencies. This increases import time and can break users who don't have the full environment. Remove unused imports and/or defer optional imports to the functions that need them with clear error messages.

Suggested change
import numpy as np
import pandas as pd
import aiohttp
from datasets import load_dataset
from tqdm import tqdm
from transformers import AutoTokenizer

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +10
"""
Tree of Thought implementation for interwhen-style streaming completion.

Implements proper ToT search using:
1. Propose function to generate candidate next steps
2. Value function to evaluate intermediate states
3. Search algorithm (BFS/DFS/beam) to explore the tree
4. Integrated with interwhen's async streaming architecture
"""

Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file appears to be a near-duplicate of tree_of_thought.py but with an unusual .pypy extension. Keeping a duplicate copy will cause maintenance drift and confusion about which implementation is authoritative. If this is not intentionally used, consider removing it (or renaming/adding a clear purpose and ensuring it's referenced).

Copilot uses AI. Check for mistakes.
return json.loads(json_match.strip())

# try without assuming md tags
matches = re.findall(r'\{.*\}\s*?}', json_text, re.DOTALL)
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fallback regex for extracting JSON (re.findall(r'\{.*\}\s*?}', ...)) appears to have an extra closing brace and will commonly capture invalid substrings, leading to json.loads failures or partial parses. Consider replacing this with a balanced-brace scan (like you do elsewhere) or a non-greedy {...} match that returns a complete JSON object.

Suggested change
matches = re.findall(r'\{.*\}\s*?}', json_text, re.DOTALL)
matches = re.findall(r'\{.*?\}', json_text, re.DOTALL)

Copilot uses AI. Check for mistakes.
Comment on lines +145 to +152
def evaluate_expression(expr, expected_nums=None):
try:
if expected_nums is not None and not validate_numbers_used(expr, expected_nums):
return False
value = eval(expr, {"__builtins__": None}, {})
return abs(value - 24) < 1e-6
except Exception:
return False
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

evaluate_expression() uses Python eval() on model-produced text. Even with __builtins__ removed, eval is still dangerous (e.g., CPU/memory bombs, object construction via literals) and can be exploited if this script is run on untrusted outputs. Prefer a safe expression parser/evaluator (e.g., ast.parse with a whitelist of nodes/operators) or a dedicated math expression evaluator.

Copilot uses AI. Check for mistakes.
Comment on lines +14 to +19
import numpy as np
import pandas as pd
import aiohttp
from datasets import load_dataset
from tqdm import tqdm
from transformers import AutoTokenizer
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as verina_tot_utils.py: there are many unused heavy imports at module import time (numpy, pandas, aiohttp, datasets, transformers, etc.), and they aren't part of the base package dependencies. Consider removing unused imports and deferring optional ones to call sites to avoid ModuleNotFoundError and reduce startup cost.

Suggested change
import numpy as np
import pandas as pd
import aiohttp
from datasets import load_dataset
from tqdm import tqdm
from transformers import AutoTokenizer
from tqdm import tqdm

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings March 25, 2026 18:08
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 13 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


# ============== Prompt Templates ==============

SYSTEM_PROMPT_VANILLA = """\
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SYSTEM_PROMPT_VANILLA is defined twice; the second definition overwrites the first one at import time, which makes the earlier template unreachable and can silently change runtime behavior. Rename one of them (e.g., SYSTEM_PROMPT_VANILLA_FINAL vs SYSTEM_PROMPT_VANILLA_INTERACTIVE) or delete the unintended duplicate.

Suggested change
SYSTEM_PROMPT_VANILLA = """\
SYSTEM_PROMPT_VANILLA_INTERACTIVE = """\

Copilot uses AI. Check for mistakes.
Make sure to use the exact feature/value names as given in the problem description.
Make sure the JSON is valid and parsable."""

SYSTEM_PROMPT_VANILLA = """\
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SYSTEM_PROMPT_VANILLA is defined twice; the second definition overwrites the first one at import time, which makes the earlier template unreachable and can silently change runtime behavior. Rename one of them (e.g., SYSTEM_PROMPT_VANILLA_FINAL vs SYSTEM_PROMPT_VANILLA_INTERACTIVE) or delete the unintended duplicate.

Copilot uses AI. Check for mistakes.
Comment on lines +121 to +123
def clean_problem_text(problem_text: str, features: dict) -> str:
"""Clean up problem text giving explicit feature domains."""
desc, clues = problem_text.split('## clues:')
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clean_problem_text splits on the exact substring '## clues:', but elsewhere (in process_zebralogic_problem) clue extraction is done via a case/whitespace-tolerant regex. If the dataset contains variants like ## Clues: or extra whitespace, this will raise ValueError and break dataset processing. Use the same regex split approach here (or otherwise normalize and validate) to make the parser consistent and robust.

Copilot uses AI. Check for mistakes.
Comment on lines +237 to +249

# try with md tags
matches = re.findall(r'```json(.*?)```', json_text, re.DOTALL)
if matches and len(matches) > 0:
json_match = matches[-1]
return json.loads(json_match.strip())

# try without assuming md tags
matches = re.findall(r'\{.*\}\s*?}', json_text, re.DOTALL)
if matches and len(matches) > 0:
json_match = matches[0]
return json.loads(json_match.strip())

Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fallback regex r'\{.*\}\s*?}' requires an extra trailing } and will fail to match many valid JSON objects (and may also overmatch across multiple objects due to .*). This can cause extract_last_json to return None even when valid JSON is present. Prefer a balanced-braces scan (like the later approach used in extract_solution_zebralogic in tot_baseline.py) and attempt parsing candidates from the end.

Suggested change
# try with md tags
matches = re.findall(r'```json(.*?)```', json_text, re.DOTALL)
if matches and len(matches) > 0:
json_match = matches[-1]
return json.loads(json_match.strip())
# try without assuming md tags
matches = re.findall(r'\{.*\}\s*?}', json_text, re.DOTALL)
if matches and len(matches) > 0:
json_match = matches[0]
return json.loads(json_match.strip())
# First, try to extract from a fenced ```json code block.
matches = re.findall(r'```json(.*?)```', json_text, re.DOTALL)
if matches:
json_match = matches[-1]
try:
return json.loads(json_match.strip())
except json.JSONDecodeError:
# Fall through to brace-based extraction if fenced block is malformed.
pass
# Fallback: scan for balanced braces and try candidates from the end.
text_len = len(json_text)
brace_starts = [i for i, ch in enumerate(json_text) if ch == '{']
for start in reversed(brace_starts):
depth = 0
end = None
for i in range(start, text_len):
ch = json_text[i]
if ch == '{':
depth += 1
elif ch == '}':
depth -= 1
if depth == 0:
end = i + 1
break
# If depth ever goes negative, this start is invalid.
if depth < 0:
break
if end is None:
continue
candidate = json_text[start:end].strip()
try:
return json.loads(candidate)
except json.JSONDecodeError:
# Try the next earlier '{' if this substring isn't valid JSON.
continue

Copilot uses AI. Check for mistakes.
Comment on lines +244 to +248
# try without assuming md tags
matches = re.findall(r'\{.*\}\s*?}', json_text, re.DOTALL)
if matches and len(matches) > 0:
json_match = matches[0]
return json.loads(json_match.strip())
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fallback regex r'\{.*\}\s*?}' requires an extra trailing } and will fail to match many valid JSON objects (and may also overmatch across multiple objects due to .*). This can cause extract_last_json to return None even when valid JSON is present. Prefer a balanced-braces scan (like the later approach used in extract_solution_zebralogic in tot_baseline.py) and attempt parsing candidates from the end.

Suggested change
# try without assuming md tags
matches = re.findall(r'\{.*\}\s*?}', json_text, re.DOTALL)
if matches and len(matches) > 0:
json_match = matches[0]
return json.loads(json_match.strip())
# try without assuming md tags: scan for a balanced-brace JSON object from the end
brace_depth = 0
end_idx: Optional[int] = None
for i in range(len(json_text) - 1, -1, -1):
ch = json_text[i]
if ch == '}':
if brace_depth == 0:
# potential end of a JSON object
end_idx = i + 1
brace_depth += 1
elif ch == '{':
if brace_depth > 0:
brace_depth -= 1
if brace_depth == 0 and end_idx is not None:
start_idx = i
candidate = json_text[start_idx:end_idx]
try:
return json.loads(candidate.strip())
except json.JSONDecodeError:
# continue searching for an earlier balanced object
end_idx = None
# ignore other characters

Copilot uses AI. Check for mistakes.
best_candidate_value = float('-inf')

async def dfs(node: TreeNode, depth: int):
nonlocal best_terminal, best_value
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In _dfs_search, the nested dfs() assigns to best_candidate and best_candidate_value without declaring them nonlocal. In Python this makes them local variables within dfs(), causing an UnboundLocalError when evaluating if value > best_candidate_value:. Declare nonlocal best_candidate, best_candidate_value inside dfs() (and include them in the existing nonlocal statement) or refactor best-candidate tracking out of the nested function.

Suggested change
nonlocal best_terminal, best_value
nonlocal best_terminal, best_value, best_candidate, best_candidate_value

Copilot uses AI. Check for mistakes.
node.children.append(child)
next_queue.append(child)
self.search_stats["total_nodes_in_tree"] += 1

Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BFS limits the next frontier via next_queue[:max_candidates_per_level] without sorting/filtering by node value, so which nodes survive depends on proposal generation order rather than evaluation score. This can prune the best candidates and materially degrade search quality. Sort next_queue by child.value (descending) before slicing, or keep all nodes for true BFS (and only prune by threshold).

Suggested change
# Limit the next frontier to the highest-value candidates
next_queue.sort(
key=lambda node: node.value if node.value is not None else float("-inf"),
reverse=True,
)

Copilot uses AI. Check for mistakes.
return sorted(used_nums) == sorted(expected_nums)


def evaluate_expression(expr, expected_nums=None):
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using Python eval() on model-produced strings is unsafe even with {"__builtins__": None}; sandbox escapes via object/dunder traversal are a known risk. Replace this with a safe arithmetic evaluator (e.g., parse with ast.parse and only allow numeric literals + BinOp over Add/Sub/Mult/Div + parentheses) to prevent arbitrary code execution.

Copilot uses AI. Check for mistakes.
Comment on lines +148 to +149
value = eval(expr, {"__builtins__": None}, {})
return abs(value - 24) < 1e-6
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using Python eval() on model-produced strings is unsafe even with {"__builtins__": None}; sandbox escapes via object/dunder traversal are a known risk. Replace this with a safe arithmetic evaluator (e.g., parse with ast.parse and only allow numeric literals + BinOp over Add/Sub/Mult/Div + parentheses) to prevent arbitrary code execution.

Copilot uses AI. Check for mistakes.
pre_prompt = (
"You are an expert problem solver. Carefully read the following multiple-choice question "
"and think through the solution step-by-step before providing your final answer. "
"Provide your final answer option by enclosing it within \\boxed{A/B/C/D}.:"
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The prompt text ends with ".:" which looks like an accidental punctuation typo and may reduce instruction clarity. Consider changing it to a single . or : (consistent with other prompts).

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants