fix(evaluation): replace bare raises with proper exceptions and add text_generation_quality request#560
Conversation
davidberenstein1957
left a comment
There was a problem hiding this comment.
Thanks for the PR. Some small changes required :)
| pruna_logger.error(f"DinoScore: device {device} not supported. Supported devices: {self.runs_on}") | ||
| raise | ||
| raise ValueError(f"DinoScore: device {device} not supported. Supported devices: {self.runs_on}") |
There was a problem hiding this comment.
can we define the text once and then raise and log the same message?
| if images.ndim != 4: | ||
| pruna_logger.error(f"Expected 4‑D tensor (B, C, H, W); got shape {tuple(images.shape)}") | ||
| raise | ||
| raise ValueError(f"Expected 4-D tensor (B, C, H, W); got shape {tuple(images.shape)}") |
There was a problem hiding this comment.
can we define the text once and then raise and log the same message?
| else: | ||
| pruna_logger.error("SharpnessMetric: unsupported channel count") | ||
| raise | ||
| raise ValueError(f"SharpnessMetric: unsupported channel count {img.shape[0]}. Expected 1 or 3 channels.") |
There was a problem hiding this comment.
can we define the text once and then raise and log the same message?
|
Addressed in 4c63091 - extracted all duplicated error messages into a msg variable so the same text is used for both pruna_logger.error(msg) and |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Comment @cursor review or bugbot run to trigger another review on this PR
| @pytest.mark.cpu | ||
| def test_task_text_generation_quality_request(): | ||
| """Test that 'text_generation_quality' named request creates perplexity metric.""" | ||
| task = Task(request="text_generation_quality", datamodule=PrunaDataModule.from_string("TinyWikiText"), device="cpu") |
There was a problem hiding this comment.
Test will always fail due to missing tokenizer
High Severity
PrunaDataModule.from_string("TinyWikiText") will raise TokenizerMissingError because the text_generation_collate function requires a tokenizer argument that isn't provided. The existing data test for TinyWikiText in tests/data/test_datamodule.py confirms this by always passing tokenizer=bert_tokenizer. This test will never reach the assertions — it will fail during datamodule construction.
begumcig
left a comment
There was a problem hiding this comment.
Looks good to me, thanks alot! We can merge once the tests pass :)
|
@begumcig Thanks for the approval! The CI failure is unrelated to this PR it's I've pushed an empty commit to re-trigger CI. Hopefully the rate limit clears this time. |
|
hi @zamal-db can you rebase your branch or main one more time? We handled the error that was causing the problem and it should be fine now. Thank you so much for your patience 🥹 |
…ext_generation_quality request - Fix 4 bare `raise` statements that crash with `RuntimeError: No active exception to re-raise` outside except blocks: - metric_dino_score.py: raise -> raise ValueError - metric_memory.py: raise -> raise RuntimeError - metric_sharpness.py: 2x raise -> raise ValueError - Fix typo in registry.py: 'dos not' -> 'does not' - Add 'text_generation_quality' named evaluation request to Task (returns perplexity metric) - Add tests for new named request and invalid request error handling
9d70884 to
c5fd0cb
Compare
|
Done, rebased onto latest main. CI should be clean now. Thanks for fixing the flaky test! |
c5fd0cb to
d4a435c
Compare
|
Perfect! Thanks a lot once again @zamal-db |


Description
I was running
SharpnessMetricon outputs from a custom diffusion pipeline and got hit withRuntimeError: No active exception to reraise. Took me a while to figure out what was going on since the actual issue was a tensor shape mismatch, but the bareraiseon this line was masking the real error. The descriptive message was being logged viapruna_logger.error()right above it, but theraiseitself had no exception attached.Checked the rest of the evaluation metrics and found 3 more of the same pattern across 2 other files. Fixed all 4, also caught a small typo in registry.py while I was in there.
Separately, I noticed
Taskonly supportsimage_generation_qualityas a named request. Addedtext_generation_qualitysince perplexity is already available throughTorchMetricWrapperand it felt like a natural complement.Changes
Bug fixes (4 bare raises):
metric_sharpness.py: 2x bareraisereplaced withraise ValueError(...)for wrong tensor dimensions and unsupported channel countmetric_dino_score.py: bareraisereplaced withraise ValueError(...)for unsupported devicemetric_memory.py: bareraisereplaced withraise RuntimeError(...)for multi-GPU without device mapTypo fix:
registry.py:"dos not inherit"fixed to"does not inherit"Feature:
text_generation_qualitynamed evaluation request toTask, returning a perplexity metric to complement the existingimage_generation_qualityrequestRelated Issue
N/A
Type of Change
How Has This Been Tested?
RuntimeError: No active exception to reraisebefore the fix, and raise the correct typed exception with descriptive message aftertest_task_text_generation_quality_requestto verify the named request returns a perplexityTorchMetricWrappertest_task_invalid_named_requestto verify unknown requests raiseValueErrorruff checkon all modified source files, all passChecklist
Additional Notes
None