Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 63b45629e3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| @@ -130,8 +130,8 @@ def roll( | |||
| return func.roll(shifts, axis)(a) | |||
|
|
|||
|
|
|||
| def gather(a: Tensor, /, dim: int, index: Tensor) -> Tensor: | |||
| return func.gather(dim)(a, index) | |||
| def gather(a: Tensor, /, axis: int, index: Tensor) -> Tensor: | |||
There was a problem hiding this comment.
Preserve
dim keyword compatibility in lucid.gather
Renaming the public argument from dim to axis without a compatibility path breaks existing call sites that still pass dim= (for example, lucid/models/text/bert/_model.py uses lucid.gather(..., dim=1, ...) in the default embedding path). In practice, BERT/RoFormer calls that omit token_type_ids will now raise TypeError: gather() got an unexpected keyword argument 'dim', which is a regression outside the new GPT code.
Useful? React with 👍 / 👎.
| attr = getattr(_activation, attr_name, None) | ||
| if issubclass(attr, Module): | ||
| if attr.__name__.lower() == act_name: | ||
| return attr |
There was a problem hiding this comment.
Return an activation instance, not the activation class
get_activation_module_from_name currently returns the module class itself (e.g., GELU) rather than an instantiated module. _GPTMLP stores this in self.act and then executes self.act(self.c_fc(x)), which invokes class construction with a tensor argument and fails for the default hidden_act='gelu'; this makes GPT forward unusable.
Useful? React with 👍 / 👎.
| max_len = config.max_position_embeddings | ||
| self.register_buffer( | ||
| "causal_mask", | ||
| lucid.tril(lucid.ones(max_len, max_len)).reshape(1, 1, max_len, max_len), |
There was a problem hiding this comment.
Build GPT causal mask as additive -inf mask
The registered causal_mask is lower-triangular 0/1, but the attention kernel treats attn_mask as an additive bias to scores (masked positions must be large negative values). Adding this 0/1 mask does not block future tokens, so training/inference are no longer causal and tokens can attend to the future.
Useful? React with 👍 / 👎.
| attn_out = attn_out.swapaxes(1, 2).reshape(B, T, C) | ||
| attn_out = self.resid_drop(self.c_proj(attn_out)) | ||
|
|
||
| return attn_out, past_key_value if use_cache else None |
There was a problem hiding this comment.
Return valid cache objects when
use_cache is enabled
When use_cache=True and no past_key_values are provided (first decoding step), _GPTAttention returns None for each layer’s present cache, so the model returns a list of None. Passing that list back into the next forward causes past_key_values[0].get_seq_length() to crash, which breaks iterative generation with returned caches.
Useful? React with 👍 / 👎.
No description provided.