Skip to content

[All] Added better error messages#2705

Open
ptrendx wants to merge 4 commits intoNVIDIA:mainfrom
ptrendx:pr_better_errors
Open

[All] Added better error messages#2705
ptrendx wants to merge 4 commits intoNVIDIA:mainfrom
ptrendx:pr_better_errors

Conversation

@ptrendx
Copy link
Member

@ptrendx ptrendx commented Feb 25, 2026

Description

Added better error messages throughout the codebase in order to provide more information when the error gets triggered.

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 25, 2026

Greptile Summary

Systematically replaced generic assertions and error messages across 26 files with detailed, actionable error messages that include actual parameter values, expected values, shapes, dtypes, and device information.

Major improvements:

  • C++/CUDA error macros now specify the received dtype and list all valid options
  • Python assertions converted to proper exceptions (ValueError, TypeError, RuntimeError) with context
  • Error messages now include actual vs expected values for easier debugging
  • Device mismatches show which device was used instead of just "needs CUDA"
  • Shape mismatches include both dimensions and computed remainders
  • Fixed boolean comparison in distributed.py:1873 to use if async_op: instead of identity check

Confidence Score: 5/5

  • This PR is safe to merge with no risk
  • Changes are purely focused on improving error messages without modifying any business logic or computational behavior. All changes convert assertions to exceptions or enhance error text with variable interpolation, making debugging easier without introducing functional changes.
  • No files require special attention

Important Files Changed

Filename Overview
transformer_engine/common/common.h Enhanced error messages in C++ macros to specify which dtype was received and list valid options
transformer_engine/jax/flax/transformer.py Added descriptive messages to assertion statements in Flax transformer layers
transformer_engine/pytorch/cpp_extensions/fused_attn.py Converted assertions to exceptions with detailed shape and dtype information for fused attention
transformer_engine/pytorch/distributed.py Replaced assertions with exceptions providing actionable error messages and actual values; fixed async_op boolean check
transformer_engine/pytorch/module/base.py Converted assertions to detailed exceptions in module base with shape, type, and configuration information
transformer_engine/pytorch/permutation.py Enhanced MoE permutation error messages with device, shape, and type details

Last reviewed commit: 221d723

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

26 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

ptrendx and others added 3 commits February 25, 2026 11:52
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Przemyslaw Tredak <ptrendx@gmail.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

26 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@ptrendx
Copy link
Member Author

ptrendx commented Feb 25, 2026

/te-ci

Copy link
Collaborator

@jberchtold-nvidia jberchtold-nvidia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM from the TE/JAX side! Thanks! I have not reviewed the core or PyTorch files

assert FusedAttnFwdPrimitive.inner_primitive is not None
assert (
FusedAttnFwdPrimitive.inner_primitive is not None
), "FusedAttnFwdPrimitive.inner_primitive has not been registered"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit SR: If you'd like, appending "This usually occurs when TransformerEngine was not installed properly and the shared library cannot be loaded. Please see the documentation troubleshooting for assistance" but as is it's already an improvement over no message

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants