Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ All notable changes to RecallForge will be documented in this file.

## [Unreleased]

*Nothing yet.*
- Replaced the tiny UAT video clips with compact episodic-memory fixtures, richer transcript sidecars, related artifact metadata, and regression coverage for the video corpus.

## [0.2.1] — 2026-05-17

Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ The reranker delivers **+20.7% R@1 over RRF fusion** and pushes R@10 to 97.8%. E

*Benchmark categories: text_only (30 queries), image_only (30 queries), long_query (12 queries), typo_query (20 queries). See `benchmarks/results/pipeline_ablation_modality_results.json` for full breakdown.*

For release validation, use `benchmarks/cross_modal_ablation.py`. It checkpoints JSON output as it runs, so long MLX benchmark sessions still leave behind a partial artifact if interrupted. To turn that artifact into a ranked fix list, run `benchmarks/cross_modal_diagnostics.py`; the current report is in [docs/research/cross-modal-diagnostics.md](docs/research/cross-modal-diagnostics.md).
For release validation, use `benchmarks/cross_modal_ablation.py`. It checkpoints JSON output as it runs, so long MLX benchmark sessions still leave behind a partial artifact if interrupted. The UAT video corpus now uses compact episodic fixtures with searchable transcript sidecars and related artifact metadata, so video queries exercise meeting, screen-recording, walkthrough, field, and recipe-style memories. To turn a benchmark artifact into a ranked fix list, run `benchmarks/cross_modal_diagnostics.py`; the current report is in [docs/research/cross-modal-diagnostics.md](docs/research/cross-modal-diagnostics.md).

### Latency & resource usage

Expand Down
32 changes: 16 additions & 16 deletions benchmarks/cross_modal_ablation.py
Original file line number Diff line number Diff line change
Expand Up @@ -1001,37 +1001,37 @@ def _media_query_variants(
TEXT_TO_VIDEO = [
# EASY (6 queries)
GroundTruth(
query="architecture walkthrough building tour presentation",
query="office walkthrough connecting floor plan and system architecture",
relevant_paths=["videos/architecture_walkthrough.mp4"],
category="text_to_video",
difficulty="easy",
),
GroundTruth(
query="coding demonstration software development tutorial",
query="screen recording debugging RecallForge video search test",
relevant_paths=["videos/coding_demo.mp4"],
category="text_to_video",
difficulty="easy",
),
GroundTruth(
query="cooking tutorial recipe demonstration video",
query="family dinner pasta recipe video with handwritten substitutions",
relevant_paths=["videos/cooking_tutorial.mp4"],
category="text_to_video",
difficulty="easy",
),
GroundTruth(
query="nature timelapse video forest mountains",
query="weekend trail scouting video forest mountain coast",
relevant_paths=["videos/nature_timelapse.mp4"],
category="text_to_video",
difficulty="easy",
),
GroundTruth(
query="whiteboard session brainstorming meeting recording",
query="product planning whiteboard meeting memory rollups",
relevant_paths=["videos/whiteboard_session.mp4"],
category="text_to_video",
difficulty="easy",
),
GroundTruth(
query="video content with transcript about buildings",
query="walkthrough transcript about floor plan architecture deck",
relevant_paths=["videos/architecture_walkthrough.mp4", "videos/architecture_walkthrough.transcript.json"],
category="text_to_video",
difficulty="easy",
Expand All @@ -1040,33 +1040,33 @@ def _media_query_variants(

# MEDIUM (6 queries)
GroundTruth(
query="programming and software engineering video content",
query="developer screen recording and meeting notes about search pipeline",
relevant_paths=["videos/coding_demo.mp4", "videos/whiteboard_session.mp4"],
category="text_to_video",
difficulty="medium",
graded_relevance={"videos/coding_demo.mp4": 2, "videos/whiteboard_session.mp4": 1},
),
GroundTruth(
query="food preparation and culinary instruction videos",
query="recipe memory with pasta sauce timing and grocery planning",
relevant_paths=["videos/cooking_tutorial.mp4"],
category="text_to_video",
difficulty="medium",
),
GroundTruth(
query="natural environment scenery video footage",
query="outdoor field clip with route planning and park notes",
relevant_paths=["videos/nature_timelapse.mp4"],
category="text_to_video",
difficulty="medium",
),
GroundTruth(
query="meeting recordings with transcripts for review",
query="meeting recordings with transcript action items for review",
relevant_paths=["videos/whiteboard_session.mp4", "videos/whiteboard_session.transcript.json"],
category="text_to_video",
difficulty="medium",
graded_relevance={"videos/whiteboard_session.mp4": 2, "videos/whiteboard_session.transcript.json": 2},
),
GroundTruth(
query="educational video content with searchable transcripts",
query="searchable transcript memories from kitchen and developer videos",
relevant_paths=["videos/cooking_tutorial.mp4", "videos/cooking_tutorial.transcript.json",
"videos/coding_demo.mp4", "videos/coding_demo.transcript.json"],
category="text_to_video",
Expand All @@ -1075,7 +1075,7 @@ def _media_query_variants(
"videos/coding_demo.mp4": 2, "videos/coding_demo.transcript.json": 2},
),
GroundTruth(
query="visual documentation of outdoor spaces",
query="visual documentation of outdoor spaces and walkthrough locations",
relevant_paths=["videos/nature_timelapse.mp4", "videos/architecture_walkthrough.mp4"],
category="text_to_video",
difficulty="medium",
Expand All @@ -1084,7 +1084,7 @@ def _media_query_variants(

# HARD (3 queries)
GroundTruth(
query="multimedia content for learning and development",
query="episodic videos with procedural learning and follow-up actions",
relevant_paths=["videos/cooking_tutorial.mp4", "videos/coding_demo.mp4", "videos/whiteboard_session.mp4",
"videos/cooking_tutorial.transcript.json", "videos/coding_demo.transcript.json", "videos/whiteboard_session.transcript.json"],
category="text_to_video",
Expand All @@ -1093,7 +1093,7 @@ def _media_query_variants(
"videos/cooking_tutorial.transcript.json": 2, "videos/coding_demo.transcript.json": 2, "videos/whiteboard_session.transcript.json": 2},
),
GroundTruth(
query="archived recordings with searchable text content",
query="archived recordings with field notes and architecture narration",
relevant_paths=["videos/architecture_walkthrough.mp4", "videos/architecture_walkthrough.transcript.json",
"videos/nature_timelapse.mp4", "videos/nature_timelapse.transcript.json"],
category="text_to_video",
Expand All @@ -1102,7 +1102,7 @@ def _media_query_variants(
"videos/nature_timelapse.mp4": 2, "videos/nature_timelapse.transcript.json": 2},
),
GroundTruth(
query="comprehensive video library with transcripts",
query="comprehensive episodic video library with transcripts",
relevant_paths=["videos/cooking_tutorial.mp4", "videos/coding_demo.mp4", "videos/whiteboard_session.mp4", "videos/architecture_walkthrough.mp4", "videos/nature_timelapse.mp4"],
category="text_to_video",
difficulty="hard",
Expand Down Expand Up @@ -1434,7 +1434,7 @@ def _media_query_variants(
query="related text",
query_type="video",
video_query_path="videos/coding_demo.mp4",
relevant_paths=["text/tech_cybersecurity.md", "text/tech_cloud_computing.md"],
relevant_paths=["text/ai_agents.md", "text/tech_cloud_computing.md", "text/ai_embeddings.md"],
category="video_to_text",
difficulty="hard",
),
Expand Down
12 changes: 12 additions & 0 deletions docs/RELEASE.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,18 @@ Then run the expanded benchmark:
.venv/bin/python benchmarks/cross_modal_ablation.py --backend mlx --expansion-profile caption_only --output benchmarks/results/cross_modal_ablation_results.json
```

The committed video corpus is an episodic fixture set rather than generic toy clips. Before trusting video-related benchmark changes, confirm the generated sidecars still include searchable `text`, timed segments, and related image/document metadata:

```bash
.venv/bin/python -m pytest -q tests/test_video_corpus.py tests/test_video_sidecars.py
```

The shell video-quality UAT uses a deterministic backend by default so CI and local smoke runs are not gated on live model quality. To exercise the installed vision-language backend on this host, opt in explicitly:

```bash
UAT_VIDEO_LIVE=1 bash tests/uat/test_video_quality.sh
```

The benchmark now checkpoints to JSON as it runs. If the run is interrupted, the output file still contains partial results plus progress metadata.

After a complete or partial benchmark run, generate the cross-modal diagnosis report:
Expand Down
59 changes: 59 additions & 0 deletions tests/test_video_corpus.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
"""Regression tests for the committed episodic video corpus."""

from __future__ import annotations

import importlib.util
import json
import sys
import unittest
from pathlib import Path


REPO_ROOT = Path(__file__).resolve().parent.parent
GENERATOR_PATH = REPO_ROOT / "tests" / "uat" / "helpers" / "generate_video_corpus.py"
VIDEOS_DIR = REPO_ROOT / "tests" / "uat" / "corpus" / "videos"


def _load_generator():
spec = importlib.util.spec_from_file_location("generate_video_corpus", GENERATOR_PATH)
module = importlib.util.module_from_spec(spec)
assert spec.loader is not None
sys.modules[spec.name] = module
spec.loader.exec_module(module)
return module


class TestEpisodicVideoCorpus(unittest.TestCase):
def test_generator_specs_are_rich_episodic_fixtures(self):
module = _load_generator()

self.assertEqual(len(module.VIDEOS), 5)
for spec in module.VIDEOS:
with self.subTest(video=spec["name"]):
self.assertGreaterEqual(spec["duration"], 9)
self.assertGreaterEqual(len(spec["images"]), 2)
self.assertGreaterEqual(len(spec["transcript"]), 3)
self.assertTrue(spec["scenario"])
self.assertTrue(spec["notes"])
self.assertTrue(spec["related_images"])
self.assertTrue(spec["related_documents"])

def test_committed_sidecars_include_searchable_transcript_text(self):
sidecars = sorted(VIDEOS_DIR.glob("*.transcript.json"))

self.assertEqual(len(sidecars), 5)
for sidecar in sidecars:
with self.subTest(sidecar=sidecar.name):
payload = json.loads(sidecar.read_text(encoding="utf-8"))
self.assertEqual(payload["memory_type"], "episodic_video_fixture")
self.assertTrue(payload["scenario"])
self.assertTrue(payload["description"])
self.assertTrue(payload["notes"])
self.assertTrue(payload["text"])
self.assertGreaterEqual(len(payload["segments"]), 3)
self.assertTrue(payload["related_images"])
self.assertTrue(payload["related_documents"])


if __name__ == "__main__":
unittest.main()
10 changes: 5 additions & 5 deletions tests/uat/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ Manual end-to-end test suite for RecallForge. Validates correctness gates, bench

- **Python 3.12+**
- **RecallForge installed:** `pip install -e .` (from repo root)
- **ffmpeg** (for video frame extraction and synthetic video generation)
- **ffmpeg** (for video frame extraction and regenerating committed video fixtures)
- **Backends:** torch (CPU/CUDA) and/or MLX (macOS ARM64)
- **Memory requirements:**
- MLX 4-bit embed mode: ~1.7GB
Expand Down Expand Up @@ -42,7 +42,7 @@ All tests live in `tests/uat/`. Each is self-contained and can be run independen
| `test_tiered_modes.sh` | Tiered modes (embed/hybrid/full) loading and behavior |
| `test_document_ingest.sh` | Document ingest (PDF/DOCX/PPTX extraction via CLI) |
| `test_video_ingest.sh` | Video ingest (transcript fallback + ffmpeg frame extraction) |
| `test_video_quality.sh` | Video retrieval quality (text/image/video query coverage) |
| `test_video_quality.sh` | Video retrieval quality (deterministic by default; set `UAT_VIDEO_LIVE=1` for live model retrieval) |
| `test_video_query_contract.sh` | Raw video query smoke test |
| `test_cross_modal.sh` | ★ CROSS-MODAL SEARCH (key differentiator) |
| `test_search_quality.sh` | Search quality (recall@5, MRR, edge cases, dedup) |
Expand Down Expand Up @@ -85,7 +85,7 @@ Benchmark tests are **informational** — they report metrics but don't block co
Tests use a committed video corpus and built-in text/image fixtures in `tests/uat/corpus/`:

### Video Corpus
A committed set of test videos with known transcripts and ground-truth frames. Used by `test_video_ingest.sh`, `test_video_quality.sh`, and `test_video_query_contract.sh` to validate cross-modal retrieval on temporal media.
A committed set of compact episodic video fixtures with known transcripts, related-image/document metadata, and ground-truth frames. The clips cover a screen recording, outdoor field clip, architecture walkthrough, kitchen recipe memory, and product-planning whiteboard session. Used by `test_video_ingest.sh`, `test_video_quality.sh`, and `test_video_query_contract.sh` to validate cross-modal retrieval on temporal media.

### Text Documents (15 files)
| Topic | Files |
Expand Down Expand Up @@ -171,7 +171,7 @@ Each test script exits 0 on success, 1 on any failure.

1. **Torch video crash on Qwen3-VL (REC-44):** Known issue where torch backend crashes during video frame processing with Qwen3-VL models. Workaround: use MLX backend on Apple Silicon or skip video tests when using Qwen3-VL with torch.

2. **Synthetic test images:** Generated images are simple drawings, not real photos. Cross-modal accuracy will be lower than with real-world images. This is expected.
2. **Compact local fixtures:** Images and videos are generated/curated to stay small, deterministic, and license-safe. The video corpus now uses episodic memory scenarios with transcripts and related artifacts, but broad public benchmark claims should still be validated against larger real-world datasets.

3. **First run is slow:** Models download on first use (~4GB per model). Subsequent runs use cached models.

Expand All @@ -185,7 +185,7 @@ Each test script exits 0 on success, 1 on any failure.

8. **Video ingest depends on host capabilities:** Transcript sidecars (`.srt`, `.vtt`, `.txt`) are always supported. Frame extraction runs when `ffmpeg` and `ffprobe` are installed; otherwise video UAT validates transcript-only fallback.

9. **Raw video query requires ffmpeg:** `test_video_query_contract.sh` and raw-video portions of CLI/MCP/video-quality UAT require `ffmpeg` to generate valid synthetic video fixtures. Without it, those checks skip cleanly.
9. **Raw video query requires ffmpeg:** `test_video_query_contract.sh` and raw-video portions of CLI/MCP/video-quality UAT require `ffmpeg` for frame extraction and video fixture regeneration. Without it, those checks skip cleanly.

10. **Document ingest is local-first:** DOCX and PPTX fixtures are extracted through built-in OOXML parsing. PDF ingestion uses a lightweight fallback extractor by default and gets richer parsing when optional PDF tooling is installed.

Expand Down
47 changes: 33 additions & 14 deletions tests/uat/corpus/CORPUS_EXPANSION.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,32 @@ This document describes the expanded RecallForge benchmark corpus and what addit

- **Text documents**: 54 files (15 original + 39 new)
- **Images**: 10 files (existing)
- **Videos**: 5 `.mp4` files plus 5 transcript JSON placeholders
- **Videos**: 5 compact episodic `.mp4` fixtures plus 5 rich transcript JSON sidecars
- **Documents**: 8 generated `.docx` / `.pptx` / `.pdf` files
- **Total corpus documents**: 82 registered in `CORPUS_DOCS`
- **Total indexed benchmark items**: 77 searchable items (the transcript JSON placeholders are empty and not indexed)
- **Total indexed benchmark items**: 82 searchable top-level/sidecar items, plus derived video frame and transcript child memories during video ingest
- **Total benchmark queries**: 231 queries across all modalities

## Episodic Video Corpus

REC-153 replaced the earlier tiny toy clips with a license-safe episodic fixture set. The files are still small enough to commit, but each video now resembles a real personal or work memory: a screen recording, a field clip, a walkthrough, a kitchen note, or a product-planning meeting.

| File | Memory Scenario | Primary Signals |
|------|-----------------|-----------------|
| `coding_demo.mp4` | RecallForge debugging screen recording | code editor, architecture board, action notes, reranking and transcript discussion |
| `nature_timelapse.mp4` | Weekend trail scouting phone clip | forest, mountain, coast, route planning, park/climate notes |
| `architecture_walkthrough.mp4` | Office and system architecture walkthrough | floor plan, service diagram, model diagram, milestone narration |
| `cooking_tutorial.mp4` | Weeknight family recipe memory | pasta, recipe substitutions, handwritten cooking notes |
| `whiteboard_session.mp4` | Product planning meeting | brainstorm board, parent/child memory rollups, benchmark scoring, release actions |

Each `.transcript.json` sidecar now includes:

- timed transcript segments used by video ingest
- a top-level `text` field so the sidecar can also be indexed as a searchable transcript artifact
- `scenario`, `notes`, `related_images`, and `related_documents` metadata for benchmark and documentation provenance

The design follows the same broad shape as episodic-memory video benchmarks such as [Ego4D Episodic Memory](https://ego4d-data.org/docs/benchmarks/episodic-memory/): queries should be able to recover an event, scene, moment, transcript detail, or related artifact from a video-backed memory.

## New Text Documents Added (39 files)

### Technology (5 files)
Expand Down Expand Up @@ -120,18 +140,17 @@ To further expand the corpus for more comprehensive cross-modal testing, the fol
39. **travel_yosemite.jpg** - Yosemite National Park
40. **travel_grand_canyon.jpg** - Grand Canyon landscape

### Recommended Videos to Add

1. **tech_quantum_explainer.mp4** - Quantum computing explanation
2. **tech_security_demo.mp4** - Cybersecurity demonstration
3. **science_lab_experiment.mp4** - Science lab experiment
4. **cooking_masterclass.mp4** - Professional cooking demonstration
5. **sports_highlights.mp4** - Sports highlights reel
6. **history_documentary.mp4** - Historical documentary clip
7. **medicine_procedure.mp4** - Medical procedure video
8. **music_concert.mp4** - Live music performance
9. **art_gallery_tour.mp4** - Art gallery walkthrough
10. **travel_vlog.mp4** - Travel destination vlog
### Future Real-World Video Additions

The committed corpus is intentionally compact and license-safe. Future benchmark expansions should add opt-in downloaded fixtures or locally supplied clips in these shapes:

1. **meeting_recording_with_slides.mp4** - transcript-heavy meeting with visible slide/document references
2. **screen_recording_debug_trace.mp4** - developer workflow with code, terminal output, and spoken issue context
3. **mobile_walkthrough_errand.mp4** - personal memory clip with objects, location changes, and follow-up tasks
4. **cooking_or_repair_procedure.mp4** - procedural video with step ordering and recipe/tool notes
5. **document_review_session.mp4** - video that references PDFs, decks, and handwritten annotations
6. **travel_or_field_visit_clip.mp4** - visually rich outdoor clip with route, weather, and place notes
7. **classroom_or_tutorial_clip.mp4** - instructional video with transcript-heavy concepts and whiteboard imagery

## Benchmark Query Distribution

Expand Down
Binary file modified tests/uat/corpus/videos/architecture_walkthrough.mp4
Binary file not shown.
Loading