Adding and Updating Documents

How to add, update, and manage shelter policy documents in Retriever.

Supported Formats

Format	Extension	Max Size	Notes
Markdown	`.md`	1 MB	Preferred for new docs
Plain text	`.txt`	1 MB	Simple, universal
PDF	`.pdf`	20 MB	ML-powered layout analysis, OCR, tables
Word	`.docx`	20 MB	Most existing docs
PowerPoint	`.pptx`	20 MB	Slides extracted as text
Excel	`.xlsx`	20 MB	Tables extracted
HTML	`.html`, `.htm`	20 MB	Web pages
Images	`.png`, `.jpg`, `.jpeg`, `.tiff`, `.bmp`	20 MB	OCR text extraction

Document processing uses Docling for ML-powered parsing with structure-aware chunking.

Document Location

Documents live in the documents/ directory at the project root:

retriever/
├── documents/
│   ├── volunteer-handbook.md
│   ├── safety-procedures.docx
│   ├── check-in-guide.txt
│   └── ...

Adding New Documents

1. Prepare the Document

Best practices:

Use clear section headers (H1, H2, H3)
Keep paragraphs focused on one topic
Use bullet points for lists
Avoid tables with complex formatting

Example structure:

# Volunteer Handbook

## Check-In Procedures

All volunteers must check in at the front desk upon arrival.

### What to Bring
- Valid ID
- Signed waiver (if first visit)
- Comfortable closed-toe shoes

## Safety Guidelines

### Animal Handling
...

2. Add to Repository

# Add document to documents/ folder
cp ~/Downloads/new-policy.docx documents/

# Commit to git
git add documents/new-policy.docx
git commit -m "docs: add new policy document"
git push

3. Trigger Reindexing

Via Admin UI:

Log in as admin
Go to Admin → Documents
Click "Reindex All Documents"

Via API:

curl -X POST https://your-app.example.com/admin/reindex \
     -H "Authorization: Bearer $ADMIN_TOKEN"

Via CLI (development):

python scripts/index_documents.py

4. Verify

Ask a question that should be answered by the new document to confirm it's indexed correctly.

Updating Existing Documents

Edit the document in documents/
Commit and push changes
Trigger reindexing (same as above)

The reindex process will:

Remove old chunks from the document
Create new chunks from updated content
Generate fresh embeddings

Removing Documents

Delete the file from documents/
Commit and push
Trigger reindexing

Old chunks will be automatically removed.

Document Metadata

Each document is automatically tagged with:

Source filename
Section headers (extracted from structure)
Chunk position

This metadata appears in answer citations:

(Source: Volunteer Handbook, Section: Check-In Procedures)

Chunking Details

Documents are split into chunks using Docling's HybridChunker:

Token budget: 512 tokens per chunk (aligned with text-embedding-3-small)
Structure-aware: Respects document headings, paragraphs, and tables
Heading context: Each chunk includes its heading hierarchy for better retrieval
Peer merging: Small adjacent chunks under the same heading are merged

What this means:

Large documents become multiple searchable chunks
Heading context improves retrieval for section-level queries
Table content is properly extracted and searchable
Token-aware splitting aligns with the embedding model's context window

Tips for Better Results

Do:

Use descriptive section headers
Define acronyms and jargon
Include common phrasings ("sign in" and "check in")
Keep policies in one authoritative document

Don't:

Use images for important text
Rely on complex table layouts
Split related info across many small files
Use inconsistent terminology

Troubleshooting

Document not appearing in answers?

Check document is in documents/ folder
Verify reindexing completed successfully
Check admin dashboard for document count
Try asking a question with exact keywords from document

Outdated information in answers?

Verify you reindexed after updating document
Clear semantic cache if enabled:
```
curl -X POST /admin/cache/clear
```

Chunks seem poorly split?

Review document structure:

Add more section headers
Break up very long paragraphs
Ensure consistent formatting

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding and Updating Documents

Supported Formats

Document Location

Adding New Documents

1. Prepare the Document

2. Add to Repository

3. Trigger Reindexing

4. Verify

Updating Existing Documents

Removing Documents

Document Metadata

Chunking Details

Tips for Better Results

Do:

Don't:

Troubleshooting

Document not appearing in answers?

Outdated information in answers?

Chunks seem poorly split?

FilesExpand file tree

adding-documents.md

Latest commit

History

adding-documents.md

File metadata and controls

Adding and Updating Documents

Supported Formats

Document Location

Adding New Documents

1. Prepare the Document

2. Add to Repository

3. Trigger Reindexing

4. Verify

Updating Existing Documents

Removing Documents

Document Metadata

Chunking Details

Tips for Better Results

Do:

Don't:

Troubleshooting

Document not appearing in answers?

Outdated information in answers?

Chunks seem poorly split?