FEAT: Updating Scorer Metrics Update Workflow and SelfAskRefusalScorer update#1549
Draft
rlundeen2 wants to merge 19 commits intomicrosoft:mainfrom
Draft
FEAT: Updating Scorer Metrics Update Workflow and SelfAskRefusalScorer update#1549rlundeen2 wants to merge 19 commits intomicrosoft:mainfrom
rlundeen2 wants to merge 19 commits intomicrosoft:mainfrom
Conversation
- Remove duplicate seed_type in harms.prompt (both sides added it independently) - Update stale REFUSAL_GPT4O docstring reference to REFUSAL_GPT4O_OBJECTIVE_ALLOW_SAFE Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tion - Add _collect_child_eval_hashes() to ComponentIdentifier for recursive child eval_hash collection - Add find_dependents_of_tag() to BaseInstanceRegistry for auto-detecting wrapper/composite scorer dependencies via eval_hash matching - Add 4 refusal scorer variants with REFUSAL tag in ScorerInitializer - Add _register_best_refusal_f1() to tag the best refusal scorer by F1 from existing metrics (parallels _register_best_objective_f1) - Refactor initialize_async into 5 phases: base refusal, best refusal selection, dependent scorers, other scorers, best objective selection - Add --tags CLI filtering to evaluate_scorers.py via argparse - Add comprehensive unit tests for all new functionality Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Metrics should be regenerated with evaluate_scorers.py after the new refusal scorer variants are finalized. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Document evaluate_scorers.py usage with --tags filtering and the recommended two-step workflow: evaluate refusal scorers first, then re-run all scorers so dependents use the best refusal variant. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Change tags parameter type from list[str] to Sequence[str] to accept list[ScorerInitializerTags] (list is invariant, Sequence is covariant). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…2_26_refusal_scorer_update
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Updates refusal scorer and shows how to do end to end metrics updates.
Many scorers depend on some core scorer — objective scorers wrap a
SelfAskRefusalScorerinsideTrueFalseInverterScorerorTrueFalseCompositeScorer. When the refusal prompt changes (or a new variant performs better), every dependent scorer'sidentity hash changes too, invalidating all their cached metrics. This means re-running evaluations for the entire scorer
suite — a process that takes hours — just because the underlying refusal prompt was tweaked. There was no way to evaluate
refusal variants in isolation, pick the best one, and then selectively re-evaluate only the scorers affected by that
change.
This PR introduces tag-based evaluation and auto-detected dependencies so you can iterate on refusal scorers quickly
without re-running everything.
It also updates the refusal scorers and refusal scorer metrics. GPT-5 refused scores in a different way, making our existing refusal scorers less accurate. So this is the consequence of working through that flow.
Refusal Scorer Human Dataset Updates
First, we ran some refusal tests against gpt-5 and evaluated refusals. Many of these were different from previous models. We added many of these to the human-labeled datasets in both refusal and objective scores. We also trimmed down the refusal dataset to a more manageable size, reducing repetitive samples.
We bumped the dataset version making all previous metrics invalid.
Refusal Scorer update
Replaces the single default refusal scorer with 4 named variants, adds auto-dependency detection so wrapper scorers automatically use the best-performing refusal prompt, and provides tag-based batch evaluation via
evaluate_scorers.pyScorer Registry Initializer Update
Dynamic Best-Refusal Selection; now the registry checks for the best refusal scorer, and uses that for all other metrics that need a refusal scorer
best_refusal_f1 / default_refusal_scorer
eval_hash matching — no explicit depends_on declaration needed
Evaluation of different components
objective
(credit: @fdubut for help)