Skip to content

Flaky CI: RapidOCR model download from modelscope.cn fails intermittently #634

@ajbozarth

Description

@ajbozarth

I ran into some flack CI tests and had Claude do some investigating and share potential solutions. I opened this since I was unsure which fix to apply and as it's just flaky it's low priority.

Problem

test_richdocument tests fail intermittently in CI with a 502 Bad Gateway from modelscope.cn when RapidOCR tries to download OCR model weights during Docling's PDF pipeline initialization.

This is flaky because:

  • CI runners start fresh, so models aren't cached
  • modelscope.cn (hosted in China) is unreliable from GitHub's US-based runners
  • Locally the models are cached in .venv/lib/.../rapidocr/models/ after the first run, so this never reproduces

The test fixture (test/stdlib/components/docs/test_richdocument.py) fetches a PDF from arxiv and processes it through Docling, which initializes RapidOCR → triggers model download → hits modelscope.cn → 502.

Suggested Fix

Disable OCR in the Docling converter for this test. The arxiv PDF (1906.04043) is text-based and doesn't need OCR. This avoids RapidOCR initialization entirely, eliminating the modelscope.cn dependency without changing what the test validates.

This would require threading an OCR-disable option through RichDocument.from_document_file or configuring the DocumentConverter directly in the test fixture.

Alternatives Considered

  • Pre-download models in CI: Adds CI complexity, still depends on modelscope.cn (just moves the failure point)
  • Skip on network failure: Masks real regressions when tests silently skip
  • Local test PDF / pre-converted JSON: Licensing concerns with bundling the paper; pre-converted JSON skips testing from_document_file

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions