How to add, update, and manage shelter policy documents in Retriever.
| Format | Extension | Max Size | Notes |
|---|---|---|---|
| Markdown | .md |
1 MB | Preferred for new docs |
| Plain text | .txt |
1 MB | Simple, universal |
.pdf |
20 MB | ML-powered layout analysis, OCR, tables | |
| Word | .docx |
20 MB | Most existing docs |
| PowerPoint | .pptx |
20 MB | Slides extracted as text |
| Excel | .xlsx |
20 MB | Tables extracted |
| HTML | .html, .htm |
20 MB | Web pages |
| Images | .png, .jpg, .jpeg, .tiff, .bmp |
20 MB | OCR text extraction |
Document processing uses Docling for ML-powered parsing with structure-aware chunking.
Documents live in the documents/ directory at the project root:
retriever/
├── documents/
│ ├── volunteer-handbook.md
│ ├── safety-procedures.docx
│ ├── check-in-guide.txt
│ └── ...
Best practices:
- Use clear section headers (H1, H2, H3)
- Keep paragraphs focused on one topic
- Use bullet points for lists
- Avoid tables with complex formatting
Example structure:
# Volunteer Handbook
## Check-In Procedures
All volunteers must check in at the front desk upon arrival.
### What to Bring
- Valid ID
- Signed waiver (if first visit)
- Comfortable closed-toe shoes
## Safety Guidelines
### Animal Handling
...# Add document to documents/ folder
cp ~/Downloads/new-policy.docx documents/
# Commit to git
git add documents/new-policy.docx
git commit -m "docs: add new policy document"
git pushVia Admin UI:
- Log in as admin
- Go to Admin → Documents
- Click "Reindex All Documents"
Via API:
curl -X POST https://your-app.example.com/admin/reindex \
-H "Authorization: Bearer $ADMIN_TOKEN"Via CLI (development):
python scripts/index_documents.pyAsk a question that should be answered by the new document to confirm it's indexed correctly.
- Edit the document in
documents/ - Commit and push changes
- Trigger reindexing (same as above)
The reindex process will:
- Remove old chunks from the document
- Create new chunks from updated content
- Generate fresh embeddings
- Delete the file from
documents/ - Commit and push
- Trigger reindexing
Old chunks will be automatically removed.
Each document is automatically tagged with:
- Source filename
- Section headers (extracted from structure)
- Chunk position
This metadata appears in answer citations:
(Source: Volunteer Handbook, Section: Check-In Procedures)
Documents are split into chunks using Docling's HybridChunker:
- Token budget: 512 tokens per chunk (aligned with
text-embedding-3-small) - Structure-aware: Respects document headings, paragraphs, and tables
- Heading context: Each chunk includes its heading hierarchy for better retrieval
- Peer merging: Small adjacent chunks under the same heading are merged
What this means:
- Large documents become multiple searchable chunks
- Heading context improves retrieval for section-level queries
- Table content is properly extracted and searchable
- Token-aware splitting aligns with the embedding model's context window
- Use descriptive section headers
- Define acronyms and jargon
- Include common phrasings ("sign in" and "check in")
- Keep policies in one authoritative document
- Use images for important text
- Rely on complex table layouts
- Split related info across many small files
- Use inconsistent terminology
- Check document is in
documents/folder - Verify reindexing completed successfully
- Check admin dashboard for document count
- Try asking a question with exact keywords from document
- Verify you reindexed after updating document
- Clear semantic cache if enabled:
curl -X POST /admin/cache/clear
Review document structure:
- Add more section headers
- Break up very long paragraphs
- Ensure consistent formatting