FIX: Set default refusal scorer behavior for scenarios and initializers to STRICT, add support for underlying model name in initializers#1537
Open
fdubut wants to merge 24 commits intomicrosoft:mainfrom
Open
Conversation
…ing model in initializers
Contributor
|
I have another PR that is related; let's hold off on merging this one for a sec, I'll merge mine into this branch and consolidate |
- Remove duplicate seed_type in harms.prompt (both sides added it independently) - Update stale REFUSAL_GPT4O docstring reference to REFUSAL_GPT4O_OBJECTIVE_ALLOW_SAFE Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tion - Add _collect_child_eval_hashes() to ComponentIdentifier for recursive child eval_hash collection - Add find_dependents_of_tag() to BaseInstanceRegistry for auto-detecting wrapper/composite scorer dependencies via eval_hash matching - Add 4 refusal scorer variants with REFUSAL tag in ScorerInitializer - Add _register_best_refusal_f1() to tag the best refusal scorer by F1 from existing metrics (parallels _register_best_objective_f1) - Refactor initialize_async into 5 phases: base refusal, best refusal selection, dependent scorers, other scorers, best objective selection - Add --tags CLI filtering to evaluate_scorers.py via argparse - Add comprehensive unit tests for all new functionality Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
STRICT was removed in favor of named variants (OBJECTIVE_BLOCK_SAFE, OBJECTIVE_ALLOW_SAFE, etc.). Use the default OBJECTIVE_ALLOW_SAFE path which aligns with the dynamic best-refusal selection system. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
rlundeen2
reviewed
Mar 26, 2026
| # Use harm_definition from CSV headers if available (e.g., "fairness_bias.yaml"), | ||
| # otherwise fall back to deriving from harm_category (e.g., "bias" -> "bias.yaml"). | ||
| # The CSV header is authoritative since the harm_category name may differ from | ||
| # the YAML filename (e.g., harm_category="bias" but file is "fairness_bias.yaml"). |
Contributor
There was a problem hiding this comment.
this was moved, need to check if this fix still matters
Metrics should be regenerated with evaluate_scorers.py after the new refusal scorer variants are finalized. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Document evaluate_scorers.py usage with --tags filtering and the recommended two-step workflow: evaluate refusal scorers first, then re-run all scorers so dependents use the best refusal variant. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Change tags parameter type from list[str] to Sequence[str] to accept list[ScorerInitializerTags] (list is invariant, Sequence is covariant). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
NO_OBJECTIVE_BLOCK_SAFE, NO_OBJECTIVE_ALLOW_SAFE) and dynamically select the best one by F1 from evaluation metrics
scorer are automatically discovered via eval_hash matching, no explicit depends_on needed
TrueFalseInverterScorer) use the best refusal path
refusal)
vars with fallback to model name (credit: @fdubut)
Co-authored with Rich, merged into @fdubut's branch to take their changes also