Summary
The non-greedy rejection sampling path does not compute the draft distribution $q$, so it cannot implement the
lossless acceptance/replacement rule from speculative decoding. It currently uses the target distribution $p$
only (with some candidate masking), which deviates from the referenced algorithms.
Expected (lossless) behavior
From EAGLE-3 paper (end of Sec. 2.1):
$$\mathrm{norm}(\max(0, p_{j+i} - \hat{p}_{j+i}))$$
Additionally, from Fast Inference from Transformers via Speculative Decoding (2022) (Sec. 2.3, Algorithm 1):
- Accept with probability: $\min(1, p/q)$
- If rejected, sample from:
$$\mathrm{norm}(\max(0, p - q))$$
Current behavior in this repo
In eagle/model/utils.py, the non-greedy branch uses target logits only:
-
evaluate_posterior computes:
gt_logits = logits[fi, i - 1][None]
gt_logits = logits_processor(None, gt_logits)[0]
gtp = torch.softmax(gt_logits, dim=0)
...
qx = 1.0
acp = px / qx
This is not min(1, p/q).
-
Replacement sampling uses sample_p = gtp (target-only), and
update_inference_inputs samples from sample_p, not
$\mathrm{norm}(\max(0, p - q))$.
Code references
eagle/model/utils.py (non-greedy branch): evaluate_posterior(...) around lines 360–416
- Replacement sampling:
update_inference_inputs(...) around lines 460–466
This makes the non-greedy path lossy, deviating from the algorithmic guarantees described in both papers.
Suggested fix
At rejection time:
- Compute draft logits $q$ for the same position (run draft model on accepted prefix or use cached KV if available),
- Apply $\mathrm{norm}(\max(0, p - q))$ for replacement,
- Use acceptance probability $\min(1, p/q)$ for the proposed token.
Summary
The non-greedy rejection sampling path does not compute the draft distribution$q$ , so it cannot implement the$p$
lossless acceptance/replacement rule from speculative decoding. It currently uses the target distribution
only (with some candidate masking), which deviates from the referenced algorithms.
Expected (lossless) behavior
From EAGLE-3 paper (end of Sec. 2.1):
Additionally, from Fast Inference from Transformers via Speculative Decoding (2022) (Sec. 2.3, Algorithm 1):
Current behavior in this repo
In
eagle/model/utils.py, the non-greedy branch uses target logits only:evaluate_posteriorcomputes:This is not min(1, p/q).
Replacement sampling uses
$\mathrm{norm}(\max(0, p - q))$ .
sample_p = gtp(target-only), andupdate_inference_inputssamples fromsample_p, notCode references
eagle/model/utils.py(non-greedy branch):evaluate_posterior(...)around lines 360–416update_inference_inputs(...)around lines 460–466This makes the non-greedy path lossy, deviating from the algorithmic guarantees described in both papers.
Suggested fix
At rejection time: