fix: use tied embedding for linear CE fusion output weight by jthomson04 · Pull Request #2363 · NVIDIA-NeMo/RL

jthomson04 · 2026-04-29T21:08:54Z

Summary

_gpt_forward_with_linear_ce_fusion set output_weight_layer = self.output_layer.weight, which is None for models with tied embeddings (e.g. Qwen3 with share_embeddings_and_output_weights=True). The first positional argument to from_parallel_hidden_states_to_logprobs was therefore None, crashing logprob computation when policy.use_linear_ce_fusion_loss=True is set on a tied-embedding model.
Fetch the weight via shared_embedding_or_output_weight() when embeddings are tied, matching the logic already used for the second positional argument, and reuse the same tensor for both args.

Test plan

Run a logprob/training step on a Qwen3 model with policy.use_linear_ce_fusion_loss=True and confirm no crash and matching logprobs vs. the non-fusion path.
Confirm untied-embedding models (e.g. Llama) still produce identical logprobs (the non-tied branch is unchanged).

🤖 Generated with Claude Code

copy-pr-bot · 2026-04-29T21:08:58Z

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

jthomson04 · 2026-04-29T21:55:51Z

/ok to test f3466a3

terrykong

LGTM — clean, minimal bug fix. The tied-embedding path was passing None (self.output_layer.weight) as the first positional arg to from_parallel_hidden_states_to_logprobs, while the correct weight was only computed for the second arg (which turns out to be unused in the callee). The fix correctly resolves the weight via shared_embedding_or_output_weight() for tied-embedding models.

Minor observation (non-blocking): from_parallel_hidden_states_to_logprobs accepts an output_weight parameter (its 3rd positional arg) that is never referenced in the function body — only output_weight_layer (2nd positional arg) gets passed into ChunkedDistributedHiddenStatesToLogprobs.apply(). This dead parameter pre-dates this PR, but could be cleaned up in a follow-up to avoid future confusion.

Generated by Claude Code

terrykong · 2026-05-05T06:31:12Z

-        self.shared_embedding_or_output_weight()
-        if self.share_embeddings_and_output_weights
-        else self.output_layer.weight,
+        output_weight_layer,


nemo_rl/distributed/model_utils.py:2165

Nit: both args now pass the same output_weight_layer, which is correct since the callee's output_weight parameter (line 1836) is never actually read — only output_weight_layer is forwarded to ChunkedDistributedHiddenStatesToLogprobs.apply(). Consider removing the dead output_weight parameter in a follow-up to prevent future confusion.

Good catch — dropped the dead output_weight parameter in 68e1b23.

@terrykong

…en_states_to_logprobs The third positional parameter was never read inside the function — only output_weight_layer is forwarded to ChunkedDistributedHiddenStatesToLogprobs.apply(). Per @terrykong's review note on PR NVIDIA-NeMo#2363, removing the dead arg now to prevent future confusion. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

In `_gpt_forward_with_linear_ce_fusion`, `output_weight_layer` was set unconditionally from `self.output_layer.weight`, which is `None` for models with tied embeddings (e.g. Qwen3) where `share_embeddings_and_output_weights=True`. This crashed in `from_parallel_hidden_states_to_logprobs` because the weight tensor passed positionally was `None`. Fetch the weight via `shared_embedding_or_output_weight()` when embeddings are tied, mirroring the logic that was already used for the second positional argument, and reuse the same tensor for both args. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: jthomson04 <jwillthomson19@gmail.com>

@terrykong

…en_states_to_logprobs The third positional parameter was never read inside the function — only output_weight_layer is forwarded to ChunkedDistributedHiddenStatesToLogprobs.apply(). Per @terrykong's review note on PR NVIDIA-NeMo#2363, removing the dead arg now to prevent future confusion. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: jthomson04 <jwillthomson19@gmail.com>

jthomson04 requested a review from a team as a code owner April 29, 2026 21:08

terrykong requested review from guyueh1 and terrykong May 5, 2026 19:47

terrykong previously approved these changes May 5, 2026

View reviewed changes

jthomson04 dismissed terrykong’s stale review via 68e1b23 May 6, 2026 03:09

jthomson04 and others added 2 commits May 5, 2026 20:12

jthomson04 force-pushed the jthomson04/fix-qwen3-tied-embedding-linear-ce-fusion branch from 68e1b23 to fc6c3a6 Compare May 6, 2026 03:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: use tied embedding for linear CE fusion output weight#2363

fix: use tied embedding for linear CE fusion output weight#2363
jthomson04 wants to merge 2 commits intoNVIDIA-NeMo:mainfrom
jthomson04:jthomson04/fix-qwen3-tied-embedding-linear-ce-fusion

jthomson04 commented Apr 29, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Apr 29, 2026

Uh oh!

jthomson04 commented Apr 29, 2026

Uh oh!

terrykong left a comment

Uh oh!

terrykong May 5, 2026

Uh oh!

jthomson04 May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jthomson04 commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

copy-pr-bot Bot commented Apr 29, 2026

Uh oh!

jthomson04 commented Apr 29, 2026

Uh oh!

terrykong left a comment

Choose a reason for hiding this comment

Uh oh!

terrykong May 5, 2026

Choose a reason for hiding this comment

Uh oh!

jthomson04 May 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jthomson04 commented Apr 29, 2026 •

edited

Loading