I ran into some flack CI tests and had Claude do some investigating and share potential solutions. I opened this since I was unsure which fix to apply and as it's just flaky it's low priority.
Problem
test_richdocument tests fail intermittently in CI with a 502 Bad Gateway from modelscope.cn when RapidOCR tries to download OCR model weights during Docling's PDF pipeline initialization.
This is flaky because:
- CI runners start fresh, so models aren't cached
modelscope.cn (hosted in China) is unreliable from GitHub's US-based runners
- Locally the models are cached in
.venv/lib/.../rapidocr/models/ after the first run, so this never reproduces
The test fixture (test/stdlib/components/docs/test_richdocument.py) fetches a PDF from arxiv and processes it through Docling, which initializes RapidOCR → triggers model download → hits modelscope.cn → 502.
Suggested Fix
Disable OCR in the Docling converter for this test. The arxiv PDF (1906.04043) is text-based and doesn't need OCR. This avoids RapidOCR initialization entirely, eliminating the modelscope.cn dependency without changing what the test validates.
This would require threading an OCR-disable option through RichDocument.from_document_file or configuring the DocumentConverter directly in the test fixture.
Alternatives Considered
- Pre-download models in CI: Adds CI complexity, still depends on
modelscope.cn (just moves the failure point)
- Skip on network failure: Masks real regressions when tests silently skip
- Local test PDF / pre-converted JSON: Licensing concerns with bundling the paper; pre-converted JSON skips testing
from_document_file
I ran into some flack CI tests and had Claude do some investigating and share potential solutions. I opened this since I was unsure which fix to apply and as it's just flaky it's low priority.
Problem
test_richdocumenttests fail intermittently in CI with a 502 Bad Gateway frommodelscope.cnwhen RapidOCR tries to download OCR model weights during Docling's PDF pipeline initialization.This is flaky because:
modelscope.cn(hosted in China) is unreliable from GitHub's US-based runners.venv/lib/.../rapidocr/models/after the first run, so this never reproducesThe test fixture (
test/stdlib/components/docs/test_richdocument.py) fetches a PDF from arxiv and processes it through Docling, which initializes RapidOCR → triggers model download → hitsmodelscope.cn→ 502.Suggested Fix
Disable OCR in the Docling converter for this test. The arxiv PDF (1906.04043) is text-based and doesn't need OCR. This avoids RapidOCR initialization entirely, eliminating the
modelscope.cndependency without changing what the test validates.This would require threading an OCR-disable option through
RichDocument.from_document_fileor configuring theDocumentConverterdirectly in the test fixture.Alternatives Considered
modelscope.cn(just moves the failure point)from_document_file