Flaky CI: RapidOCR model download from modelscope.cn fails intermittently

I ran into some flack CI tests and had Claude do some investigating and share potential solutions. I opened this since I was unsure which fix to apply and as it's just flaky it's low priority.
- https://github.com/generative-computing/mellea/actions/runs/23011038058/job/66821312518
- https://github.com/generative-computing/mellea/actions/runs/23011970171/job/66824765214

## Problem

`test_richdocument` tests fail intermittently in CI with a **502 Bad Gateway** from `modelscope.cn` when RapidOCR tries to download OCR model weights during Docling's PDF pipeline initialization.

This is flaky because:
- CI runners start fresh, so models aren't cached
- `modelscope.cn` (hosted in China) is unreliable from GitHub's US-based runners
- Locally the models are cached in `.venv/lib/.../rapidocr/models/` after the first run, so this never reproduces

The test fixture (`test/stdlib/components/docs/test_richdocument.py`) fetches a PDF from arxiv and processes it through Docling, which initializes RapidOCR → triggers model download → hits `modelscope.cn` → 502.

## Suggested Fix

Disable OCR in the Docling converter for this test. The arxiv PDF (1906.04043) is text-based and doesn't need OCR. This avoids RapidOCR initialization entirely, eliminating the `modelscope.cn` dependency without changing what the test validates.

This would require threading an OCR-disable option through `RichDocument.from_document_file` or configuring the `DocumentConverter` directly in the test fixture.

## Alternatives Considered

- **Pre-download models in CI**: Adds CI complexity, still depends on `modelscope.cn` (just moves the failure point)
- **Skip on network failure**: Masks real regressions when tests silently skip
- **Local test PDF / pre-converted JSON**: Licensing concerns with bundling the paper; pre-converted JSON skips testing `from_document_file`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flaky CI: RapidOCR model download from modelscope.cn fails intermittently #634

Problem

Suggested Fix

Alternatives Considered

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Flaky CI: RapidOCR model download from modelscope.cn fails intermittently #634

Description

Problem

Suggested Fix

Alternatives Considered

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions