Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
105 commits
Select commit Hold shift + click to select a range
a8be14c
Changes for SGLang support
avnermay Mar 18, 2026
1b2af07
Small test script
avnermay Mar 18, 2026
b9aceb5
Changes
avnermay Mar 18, 2026
fb9546a
Runner helpers
avnermay Mar 18, 2026
e8f7292
Updates to small test, assert in loader.py
avnermay Mar 18, 2026
af8c8ac
Changes
avnermay Mar 18, 2026
ff11967
Refactor of runner_helpers for all send/receive commands to use same …
avnermay Mar 19, 2026
9f3cb9e
Remove uv.lock
avnermay Mar 19, 2026
fc68b48
fix cudagraph_helpers to work with higher version of flashinfer
avnermay Mar 19, 2026
6795127
Switch some torch.empty calls back to torch.zeros for correctness
avnermay Mar 19, 2026
04439b1
Add PrefillRequest and SpeculationRequest objects in runner_helpers.py
avnermay Mar 19, 2026
a3d6cf0
NIT bug fix
avnermay Mar 20, 2026
0b8a6e5
Further refactor of PrefillRequest, SpeculationRequest, SpeculationRe…
avnermay Mar 20, 2026
6a36a14
Improvements to logging
avnermay Mar 21, 2026
b8c1fd7
Support for Phoenix V1
avnermay Mar 23, 2026
4c127df
dist_utils needed for cross-node support
avnermay Mar 23, 2026
7a968e8
Merge branch 'avner/sglang' into avner/sglang-phnx
avnermay Mar 23, 2026
82ca79c
Fix bugs in how recovery_activations and eagle_activations are set an…
avnermay Mar 23, 2026
e632702
Merge branch 'avner/sglang' into avner/sglang-phnx
avnermay Mar 24, 2026
7053b80
FA4 initial implementation by CC
avnermay Mar 28, 2026
66b8b7b
FA4 support
avnermay Mar 28, 2026
65301a3
Add tests and tree_mask.py so that FA4 works
avnermay Mar 28, 2026
5256853
Merge branch 'avner/sglang-fa4' into avner/sglang-phnx-fa4
avnermay Mar 28, 2026
fc1130d
Remove debug loading of Eagle activations
avnermay Mar 28, 2026
aa50214
Merge branch 'avner/sglang' into avner/sglang-fa4
avnermay Mar 28, 2026
42cea6b
Merge branch 'avner/sglang-fa4' into avner/sglang-fa4-phnx
avnermay Mar 28, 2026
d1c9215
Update pyproject.toml to reflect flash-attn 4 dependency, and no more…
Mar 28, 2026
eb13cd3
Merge branch 'avner/sglang-fa4' into avner/sglang-fa4-phnx
Mar 28, 2026
2463748
Fix FA4 import
avnermay Mar 28, 2026
7184e54
Merge branch 'avner/sglang-fa4' into avner/sglang-fa4-phnx
avnermay Mar 28, 2026
d86d0fb
Add logging statement once draft process is waiting for target proces…
avnermay Mar 28, 2026
ab487ac
Merge branch 'avner/sglang-fa4' into avner/sglang-fa4-phnx
avnermay Mar 28, 2026
1425f32
Trust remote code fix
avnermay Mar 28, 2026
743fb40
Merge branch 'avner/sglang-fa4' into avner/sglang-fa4-phnx
avnermay Mar 28, 2026
cb51158
Add logging for draft model warmup
avnermay Mar 28, 2026
bfa56fd
Merge branch 'avner/sglang-fa4' into avner/sglang-fa4-phnx
avnermay Mar 28, 2026
e701bfe
More logging
avnermay Mar 29, 2026
bfcb931
Switch all attention calls to use FA4
avnermay Mar 29, 2026
80f2f76
Merge branch 'avner/sglang-fa4' into avner/sglang-fa4-phnx
avnermay Mar 29, 2026
cce45eb
Add tests for attention fa4
avnermay Mar 29, 2026
332b1f3
Merge branch 'avner/sglang-fa4' into avner/sglang-fa4-phnx
avnermay Mar 29, 2026
080c4a3
Upgrade transformers, pin FA4
avnermay Mar 29, 2026
37954a6
Merge branch 'avner/sglang-fa4' into avner/sglang-fa4-phnx
avnermay Mar 29, 2026
eb5e612
DUMP_TENSORS=false fix
avnermay Mar 30, 2026
08248b2
Merge branch 'avner/sglang-fa4' into avner/sglang-fa4-phnx
avnermay Mar 30, 2026
ff59fdf
Switch from ssh to https git dependency in pyproject.toml
avnermay Mar 31, 2026
dbdaa7b
Merge branch 'avner/sglang-fa4' into avner/sglang-fa4-phnx
avnermay Mar 31, 2026
107602a
Higher timeouts, clearer target <-> draft waiting messages, remove re…
avnermay Apr 1, 2026
0105932
Merge branch 'avner/sglang' into avner/sglang-fa4
avnermay Apr 1, 2026
ddaff75
Merge branch 'avner/sglang-fa4' into avner/sglang-fa4-phnx
avnermay Apr 1, 2026
f8af8e7
Acceptance rate log and force-jit-speculate
avnermay Apr 10, 2026
4c6997f
Improvements to benchmarking
avnermay Apr 10, 2026
b417d75
NIT: print cache_hits as ints
avnermay Apr 10, 2026
c6b6556
Set communicate logits to False in bench.py
avnermay Apr 14, 2026
4902095
Include eagle payload in the same fused tensor as the non-Eagle payload
avnermay Apr 14, 2026
f2ab9a0
Optimization + better profiling support
avnermay Apr 14, 2026
60dfb25
Add phoenix support to bench.py
avnermay Apr 15, 2026
cd88d1b
Add profiling and acceptance rate logging
avnermay Apr 15, 2026
8acc8c2
Merge branch 'avner/main2' into avner/sglang
avnermay Apr 15, 2026
2d25971
Merge branch 'avner/sglang' into avner/sglang-fa4
avnermay Apr 15, 2026
22188cc
Merge branch 'avner/sglang-fa4' into avner/sglang-fa4-phnx
avnermay Apr 15, 2026
440539c
Revert adding 19th argument to flashinfer plan, to make branch compat…
avnermay Apr 15, 2026
526b719
Merge branch 'avner/main' into avner/main2
avnermay Apr 15, 2026
804b713
Merge branch 'avner/main2' into avner/sglang
avnermay Apr 15, 2026
af4ff69
Merge branch 'avner/sglang' into avner/sglang-fa4
avnermay Apr 15, 2026
000bca2
Merge branch 'avner/sglang-fa4' into avner/sglang-fa4-phnx
avnermay Apr 15, 2026
f3182b5
DUMP_TENSORS bug
avnermay Apr 15, 2026
386ca05
Merge branch 'avner/sglang' into avner/sglang-fa4
avnermay Apr 15, 2026
2ea22f2
Merge branch 'avner/sglang-fa4' into avner/sglang-fa4-phnx
avnermay Apr 15, 2026
e8269c5
Bug fix for change in apply_chat_template API in newer transformers v…
avnermay Apr 15, 2026
9b96ef6
Merge branch 'avner/sglang-fa4' into avner/sglang-fa4-phnx
avnermay Apr 15, 2026
dc1b104
CC optimization for case where all extends are the same length (same …
avnermay Apr 15, 2026
2569541
Add llama-8b support to run_sglang_bench.py
avnermay Apr 16, 2026
8862f07
Upgrade sglang-kernel to remain synchronized with latest TGL main branch
avnermay Apr 16, 2026
0307ddb
Remove all phoenix-related code from avner/sglang-fa4-new
avnermay Apr 16, 2026
b200560
Revert "Remove all phoenix-related code from avner/sglang-fa4-new"
avnermay Apr 16, 2026
b1a21d3
SGLang benchmarking update
avnermay Apr 16, 2026
cc053d5
Merge branch 'avner/sglang-fa4' into avner/sglang-fa4-phnx
avnermay Apr 16, 2026
584e795
Support for chat template and Llama 3.1 70B in run_sglang_bench.py
avnermay Apr 17, 2026
9227931
Merge branch 'avner/sglang-fa4' into avner/sglang-fa4-phnx
avnermay Apr 17, 2026
3df2aae
CC bug fixes during testing
avnermay Apr 17, 2026
1c98a9b
Merge branch 'avner/sglang-fa4' into avner/sglang-fa4-phnx
avnermay Apr 17, 2026
7b19eb2
V1 of CC tier 0 and 1 tests
avnermay Apr 17, 2026
10ff3a1
Refactor of JIT logic to be much clearer
avnermay Apr 20, 2026
c2a32c8
Fuse eagle and non-eagle payload in SpeculationRequest send/receive
avnermay Apr 21, 2026
34efc9b
Merge branch 'avner/sglang' into avner/sglang-fa4
avnermay Apr 21, 2026
5046d56
Merge branch 'avner/sglang-fa4' into avner/sglang-fa4-phnx
avnermay Apr 21, 2026
12ade23
dump tensors refactor in runner_helpers.py
avnermay Apr 21, 2026
d5d803d
Merge branch 'avner/sglang' into avner/sglang-fa4
avnermay Apr 21, 2026
f30f646
Merge branch 'avner/sglang-fa4' into avner/sglang-fa4-phnx
avnermay Apr 21, 2026
5290188
Clean up tensor dumping logic in runner_helpers
avnermay Apr 27, 2026
1ec3b89
Clean-up engine tensors on shutdown
avnermay Apr 27, 2026
71bcac9
NIT
avnermay Apr 27, 2026
b598154
Merge branch 'avner/sglang-fa4' into avner/sglang-fa4-phnx
avnermay Apr 27, 2026
9191f97
Dump tensors logic
avnermay Apr 27, 2026
8ef073c
Process cleanup on failure + force-jit-speculate support
avnermay Apr 27, 2026
c8078d6
Make pytest import strategy importlib
avnermay Apr 27, 2026
885efea
Merge branch 'avner/sglang-fa4' into avner/sglang-fa4-phnx
avnermay Apr 27, 2026
49618d0
HF reference tests
avnermay Apr 30, 2026
e45acfb
Merge branch 'avner/main2' into avner/sglang-fa4
avnermay Apr 30, 2026
904ac7a
Merge branch 'avner/sglang-fa4' into avner/sglang-fa4-phnx
avnermay Apr 30, 2026
b5f857e
Undo duplicate dumping of tensors after last merge
avnermay Apr 30, 2026
b6ec991
Merge branch 'avner/sglang-fa4' into avner/sglang-fa4-phnx
avnermay Apr 30, 2026
2ac8180
Refactor of SSD simulation, now allowing for JIT/fast backups
avnermay May 1, 2026
571f48f
Add verbose flag, fix bugs, in tests/hf/test_ssd_vs_hf_reference.py
avnermay May 5, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 17 additions & 7 deletions bench/bench.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,13 +31,14 @@ def parse_arguments():
# Speculative decoding configuration
parser.add_argument("--spec", action="store_true", help="Enable speculative decoding")
parser.add_argument("--eagle", action="store_true", help="Enable eagle speculative decoding (implies --spec, uses default eagle draft for model)")
parser.add_argument("--phoenix", action="store_true", help="Enable eagle speculative decoding (implies --spec, uses default eagle draft for model)")
parser.add_argument("--k", type=int, default=6, help="Speculative decoding k value")
parser.add_argument("--async", action="store_true", help="Enable async speculative decoding")
parser.add_argument("--f", type=int, default=3, help="Async fan out value")
parser.add_argument("--fl", type=int, nargs='+', default=None, help="Fan out list (e.g., --fl 1 3 4 becomes [1, 3, 4])")
parser.add_argument("--flh", type=int, nargs='+', default=None, help="Fan out list (e.g., --flh 1 3 4 becomes [1, 3, 4])")
parser.add_argument("--flm", type=int, nargs='+', default=None, help="Fan out list miss (e.g., --flm 1 3 4 becomes [1, 3, 4])")
parser.add_argument("--backup", type=str, choices=["jit", "fast"], default="jit", help="Backup strategy (jit or fast)")
parser.add_argument("--backup", type=str, choices=["jit", "force-jit", "fast"], default="jit", help="Backup strategy (jit or fast)")

# Memory and batching configuration
parser.add_argument("--block_sz", type=int, default=256, help="KV cache block size (see config.py: kvcache_block_size)")
Expand Down Expand Up @@ -80,11 +81,13 @@ def parse_arguments():
assert not (args.qwen and '--llama' in sys.argv), "--llama and --qwen are mutually exclusive"
if args.qwen:
args.llama = False
if args.eagle:
if args.eagle or args.phoenix:
args.spec = True
assert args.llama, "Eagle and Phoenix currently only support llama models"
assert args.temp == 0.0 and args.dtemp is None, "Eagle and Phoenix currently only support greedy decoding (temp=0)"
assert getattr(args, 'async', False), "Eagle and Phoenix currently only support async speculative decoding"
if getattr(args, 'async', False):
args.spec = True
assert args.llama, "Eagle currently only supports llama models"
assert args.temp == 0.0 and args.dtemp is None, "Eagle currently only supports greedy decoding (temp=0)"
assert getattr(args, 'async', False), "Eagle currently only supports async speculative decoding"
return args


Expand Down Expand Up @@ -129,7 +132,7 @@ def initialize_wandb(args, run_name):
"gpus": args.gpus,
"speculative_decoding": args.spec,
"async_speculative": getattr(args, 'async', False),
"jit_speculative": args.backup == "jit",
"backup_strategy": args.backup,
"k": args.k if args.spec else None,
"f": args.f,
"fan_out_list": args.flh,
Expand All @@ -143,6 +146,8 @@ def initialize_wandb(args, run_name):
"b": args.b,
"block_size": args.block_sz,
"eager": args.eager,
"eagle": args.eagle,
"phoenix": args.phoenix,
"example_mode": args.example,
"humaneval_mode": args.humaneval,
"alpaca_mode": args.alpaca,
Expand Down Expand Up @@ -172,8 +177,11 @@ def create_llm_kwargs(args, draft_path):
max_num_seqs=args.b,
max_model_len=args.max_model_len,
sampler_x=args.x,
jit_speculate=(args.backup == "jit"),
jit_speculate=(args.backup == "jit" or args.backup == "force-jit"),
force_jit_speculate=(args.backup == "force-jit"),
max_steps=args.max_steps,
communicate_cache_hits=True,
communicate_logits=False,
)

if args.flh is not None:
Expand Down Expand Up @@ -296,6 +304,8 @@ def main():
llm_kwargs = create_llm_kwargs(args, draft_path)
if args.eagle:
llm_kwargs['use_eagle'] = True
if args.phoenix:
llm_kwargs['use_phoenix'] = True
if args.debug:
llm_kwargs['debug_mode'] = True

Expand Down
17 changes: 14 additions & 3 deletions bench/bench_helpers.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,9 @@
from typing import List, Optional, Tuple
from transformers import AutoTokenizer
try:
from ssd.paths import DATASET_PATHS, HF_CACHE_DIR, EAGLE3_SPECFORGE_70B, EAGLE3_YUHUILI_8B, EAGLE3_QWEN_32B
from ssd.paths import DATASET_PATHS, HF_CACHE_DIR, EAGLE3_SPECFORGE_70B, EAGLE3_YUHUILI_8B, EAGLE3_QWEN_32B, PHOENIX_70B
except ImportError:
from bench_paths import DATASET_PATHS, HF_CACHE_DIR, EAGLE3_SPECFORGE_70B, EAGLE3_YUHUILI_8B, EAGLE3_QWEN_32B
from bench_paths import DATASET_PATHS, HF_CACHE_DIR, EAGLE3_SPECFORGE_70B, EAGLE3_YUHUILI_8B, EAGLE3_QWEN_32B, PHOENIX_70B


def _get_snapshot_path(base_path: str) -> str:
Expand Down Expand Up @@ -62,6 +62,15 @@ def _get_draft_model_path(args, cache_dir: str) -> str:
else:
raise ValueError(f"EAGLE draft not available for Qwen size {args.size}")

if getattr(args, "phoenix", False):
if args.llama:
if args.size == "70":
return PHOENIX_70B
else:
raise ValueError(f"Phoenix draft not available for Llama size {args.size}")
else:
raise ValueError(f"Phoenix draft not available for Qwen models")

if args.llama:
draft_size_to_model = {
"1": "Llama-3.2-1B-Instruct",
Expand Down Expand Up @@ -157,6 +166,7 @@ def load_dataset_token_ids(
return None

dataset_file_path = DATASET_PATHS[dataset_name]
print(f"Loading dataset '{dataset_name}' from: {dataset_file_path}")
if not os.path.exists(dataset_file_path):
print(
f"Warning: Dataset file not found at {dataset_file_path}, falling back to random tokens")
Expand All @@ -172,10 +182,11 @@ def load_dataset_token_ids(
data = json.loads(line.strip())
text: str = data["text"]
if use_chat_template and hasattr(tokenizer, 'apply_chat_template'):
tokens = tokenizer.apply_chat_template(
result = tokenizer.apply_chat_template(
[{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": text}],
add_generation_prompt=True,
)
tokens = result.input_ids if hasattr(result, 'input_ids') else result
else:
tokens = tokenizer.encode(text, add_special_tokens=False)

Expand Down
24 changes: 23 additions & 1 deletion bench/bench_paths.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,15 +43,29 @@ def _required_env(var_name: str, note: str) -> str:
f"{HF_CACHE_DIR}/models--RedHatAI--Qwen3-32B-speculator.eagle3",
)

PHOENIX_70B = f"{HF_CACHE_DIR}/models--togethercomputer--phoenix-Llama-3p2-1B-Instruct-tgt-Llama-3p3-70b-instruct-UNTRAINED"

MODELS = {
"llama_70b": os.environ.get(
"BENCH_LLAMA_70B",
f"{HF_CACHE_DIR}/models--meta-llama--Llama-3.3-70B-Instruct",
),
"llama_70b_3p1": os.environ.get(
"BENCH_LLAMA_70B_3P1",
f"{HF_CACHE_DIR}/models--meta-llama--Llama-3.1-70B-Instruct",
),
"llama_8b": os.environ.get(
"BENCH_LLAMA_8B",
f"{HF_CACHE_DIR}/models--meta-llama--Llama-3.1-8B-Instruct",
),
"llama_1b": os.environ.get(
"BENCH_LLAMA_1B",
f"{HF_CACHE_DIR}/models--meta-llama--Llama-3.2-1B-Instruct",
),
"qwen_8b": os.environ.get(
"BENCH_QWEN_8B",
f"{HF_CACHE_DIR}/models--Qwen--Qwen3-8B",
),
"qwen_32b": os.environ.get(
"BENCH_QWEN_32B",
f"{HF_CACHE_DIR}/models--Qwen--Qwen3-32B",
Expand All @@ -62,12 +76,20 @@ def _required_env(var_name: str, note: str) -> str:
),
"eagle3_llama_70b": os.environ.get(
"BENCH_EAGLE3_LLAMA_70B",
"lmsys/SGLang-EAGLE3-Llama-3.3-70B-Instruct-SpecForge",
f"{HF_CACHE_DIR}/models--lmsys--SGLang-EAGLE3-Llama-3.3-70B-Instruct-SpecForge",
),
"eagle3_llama_8b": os.environ.get(
"BENCH_EAGLE3_LLAMA_8B",
f"{HF_CACHE_DIR}/models--yuhuili--EAGLE3-LLaMA3.1-Instruct-8B",
),
"eagle3_qwen_32b": os.environ.get(
"BENCH_EAGLE3_QWEN_32B",
"Zhihu-ai/Zhi-Create-Qwen3-32B-Eagle3",
),
"phoenix2_qwen_8b": os.environ.get(
"BENCH_PHOENIX2_QWEN_8B",
"togethercomputer/phnx2-llama-decagon-4layer-v1.0",
),
}


Expand Down
Loading