Skip to content
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -198,6 +198,9 @@ cython_debug/
# Cursor is an AI-powered code editor. `.cursorignore` specifies files/directories to
# exclude from AI features like autocomplete and code analysis. Recommended for sensitive data
# refer to https://docs.cursor.com/context/ignore-files
.*/Outputs_TTS/
Outputs_TTS/
Outputs_TTS_temp/
.cursorignore
.cursorindexingignore

Expand Down
67 changes: 67 additions & 0 deletions examples/TTSwithVerification/MULTIPROCESS_README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# Multi-Process vLLM Setup for Best-of-K Baseline

This directory contains scripts and code for running the best-of-K baseline with multi-process vLLM serving.

## Setup

### 1. Start vLLM with 4 processes (2 GPUs each)

```bash
bash start_vllm_multiprocess.sh
```

This launches 4 vLLM OpenAI-compatible API servers:
- **Process 1**: GPUs 0-1, Port 8000
- **Process 2**: GPUs 2-3, Port 8001
- **Process 3**: GPUs 4-5, Port 8002
- **Process 4**: GPUs 6-7, Port 8003

Each process uses `tensor-parallel-size 2` for distributed inference.

### 2. Run the baseline

In a separate terminal:

```bash
# Test with 1 example
python bestofk_baseline.py --task game24 --num_examples 1 --k 4 --use_critic

# Run on maze dataset
python bestofk_baseline.py --task maze --num_examples 10 --k 4

# Run on spatialmap dataset
python bestofk_baseline.py --task spatialmap --num_examples 5 --k 4
```

Or use the test script:
```bash
bash run_multiprocess_test.sh game24 5
```

## Load Balancing

- Requests are distributed **round-robin** across the 4 vLLM instances
- Each generation request goes to the next available port (8000 → 8001 → 8002 → 8003 → 8000 ...)
- Critic evaluation requests use separate round-robin tracking (independent counter)
- This ensures even load distribution across all 4 GPU pairs

## Stopping vLLM

```bash
pkill -9 -f "vllm.entrypoints.openai.api_server"
```

## Configuration

Edit `start_vllm_multiprocess.sh` to change:
- `MODEL`: Model name (default: `Qwen/QwQ-32B`)
- `MAX_TOKENS`: Maximum sequence length (default: 8192)
- `GPU_MEMORY`: GPU memory utilization (default: 0.4)
- `TENSOR_PARALLEL`: Must be ≤ 2 for this 8-GPU setup

## Benefits

- **Better throughput**: 4 independent processes handle requests in parallel
- **Fault tolerance**: If one process crashes, others continue
- **GPU utilization**: Balanced load across all 8 GPUs (2 GPUs per process)
- **Reduced latency**: Each process has dedicated GPU resources
41 changes: 41 additions & 0 deletions examples/TTSwithVerification/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -156,6 +156,39 @@ The Z3 solver handles diagonal directions (`Northwest`, `Northeast`, `Southwest`

---

# Best-of-K Baseline

A simple best-of-K baseline that generates K independent reasoning traces per example and selects the best based on:
1. **Ground-truth matching** (default): Greedy selection of first correct answer among K samples
2. **Critic model evaluation** (optional): Use a separate critic LLM to evaluate correctness without access to ground truth

This baseline demonstrates that with sufficient sampling, even simple CoT can achieve good performance.

## Usage

```bash
# Best-of-K with ground-truth evaluation
python ./examples/TTSwithVerification/bestofk_baseline.py --task game24 -n 10 --k 4

# Best-of-K with critic model evaluation
python ./examples/TTSwithVerification/bestofk_baseline.py --task game24 -n 10 --k 4 --use_critic --critic_model Qwen/Qwen3-30B-A3B-Thinking-2507 --critic_port 8001
```

### Parameters

| Argument | Description | Default |
|----------|-------------|---------|
| `--task` | Task: `game24`, `maze`, or `spatialmap` | required |
| `--k` | Number of samples per example | `4` |
| `--use_critic` | Use critic model for evaluation instead of ground truth | `False` |
| `--critic_model` | Model to use for critic evaluation | MAIN_MODEL |
| `--critic_port` | vLLM server port for critic model | `8001` |
| `--num_examples`, `-n` | Number of examples to run | varies |
| `--main_model` | Model for generation | `Qwen/Qwen3-30B-A3B-Thinking-2507` |
| `--port` | vLLM server port for main model | `8000` |

---

## Example Scripts

Each script runs a full evaluation: loading a dataset, building structured prompts, running inference with step verification, and computing accuracy/token statistics.
Expand All @@ -169,6 +202,14 @@ python ./examples/TTSwithVerification/maze_stepverifier.py -n 1

# SpatialMap with step verification
python ./examples/TTSwithVerification/spatialmap_stepverifier.py -n 1

# Best-of-K baseline (standard CoT, no monitors)
python ./examples/TTSwithVerification/bestofk_baseline.py --task game24 -n 1 --k 4
python ./examples/TTSwithVerification/bestofk_baseline.py --task maze -n 1 --k 4
python ./examples/TTSwithVerification/bestofk_baseline.py --task spatialmap -n 1 --k 4

# Best-of-K with critic model evaluation
python ./examples/TTSwithVerification/bestofk_baseline.py --task game24 -n 1 --k 4 --use_critic
```

### Common arguments
Expand Down
Loading