Skip to content

Diagnose cross-modal retrieval gaps#28

Merged
brianmeyer merged 1 commit into
masterfrom
codex/rec-168-cross-modal-diagnostics
May 17, 2026
Merged

Diagnose cross-modal retrieval gaps#28
brianmeyer merged 1 commit into
masterfrom
codex/rec-168-cross-modal-diagnostics

Conversation

@brianmeyer
Copy link
Copy Markdown
Owner

Summary

  • Add an offline cross-modal diagnostics CLI that reads saved ablation JSON and ranks weak or under-sampled categories.
  • Generate checked-in JSON and Markdown diagnostics for the current benchmark artifact.
  • Preserve video_query_path in benchmark per-query rows and document the diagnostics flow in README/release docs.

Research grounding

Tests

  • python3 -m pytest -q tests/test_cross_modal_diagnostics.py tests/test_cross_modal_benchmark_defs.py
  • python3 -m pytest -q

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 95529a3f97

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +190 to +191
generic_count = sum(1 for query in query_values if query in GENERIC_MEDIA_QUERIES)
return generic_count > 0 and generic_count >= max(1, len(query_values) // 2)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Require a true majority before tagging generic-query artifact

Use a ceiling threshold here; with len(query_values) // 2, odd-sized categories are misclassified. For example, with 3 media queries and only 1 placeholder, generic_count >= 1 becomes true, so the tool emits generic_query_artifact and can inflate priority_score/follow-up actions even when generic prompts are not the majority. This skews the ranked diagnosis for small categories and should use a strict majority check (e.g., >= ceil(n/2) or > n/2).

Useful? React with 👍 / 👎.

@brianmeyer brianmeyer merged commit a224452 into master May 17, 2026
4 checks passed
@brianmeyer brianmeyer deleted the codex/rec-168-cross-modal-diagnostics branch May 17, 2026 18:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant