Diagnose cross-modal retrieval gaps by brianmeyer · Pull Request #28 · brianmeyer/recallforge

brianmeyer · 2026-05-17T17:58:53Z

Summary

Add an offline cross-modal diagnostics CLI that reads saved ablation JSON and ranks weak or under-sampled categories.
Generate checked-in JSON and Markdown diagnostics for the current benchmark artifact.
Preserve video_query_path in benchmark per-query rows and document the diagnostics flow in README/release docs.

Research grounding

BEIR-style qrels/results metrics: https://github.com/beir-cellar/beir
MTEB retrieval evaluation framework: https://github.com/embeddings-benchmark/mteb
ViDoRe pipeline evaluation patterns: https://github.com/illuin-tech/vidore-benchmark

Tests

python3 -m pytest -q tests/test_cross_modal_diagnostics.py tests/test_cross_modal_benchmark_defs.py
python3 -m pytest -q

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 95529a3f97

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-17T18:01:40Z

+    generic_count = sum(1 for query in query_values if query in GENERIC_MEDIA_QUERIES)
+    return generic_count > 0 and generic_count >= max(1, len(query_values) // 2)


Require a true majority before tagging generic-query artifact

Use a ceiling threshold here; with len(query_values) // 2, odd-sized categories are misclassified. For example, with 3 media queries and only 1 placeholder, generic_count >= 1 becomes true, so the tool emits generic_query_artifact and can inflate priority_score/follow-up actions even when generic prompts are not the majority. This skews the ranked diagnosis for small categories and should use a strict majority check (e.g., >= ceil(n/2) or > n/2).

Useful? React with 👍 / 👎.

Diagnose cross-modal retrieval gaps

95529a3

chatgpt-codex-connector Bot reviewed May 17, 2026

View reviewed changes

brianmeyer merged commit a224452 into master May 17, 2026
4 checks passed

brianmeyer deleted the codex/rec-168-cross-modal-diagnostics branch May 17, 2026 18:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Diagnose cross-modal retrieval gaps#28

Diagnose cross-modal retrieval gaps#28
brianmeyer merged 1 commit into
masterfrom
codex/rec-168-cross-modal-diagnostics

brianmeyer commented May 17, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		generic_count = sum(1 for query in query_values if query in GENERIC_MEDIA_QUERIES)
		return generic_count > 0 and generic_count >= max(1, len(query_values) // 2)

Conversation

brianmeyer commented May 17, 2026

Summary

Research grounding

Tests

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant