Summary
After discovering and fixing image ingestion issues in Milvus and Weaviate, we should systematically test image ingestion across all supported vector databases to identify and fix any remaining issues.
Current Status
| Database |
Status |
Issues Found |
Resolution |
| Milvus |
❌ Blocked |
64KB JSON field limit (#19) |
Unfixable (Milvus limitation) |
| Weaviate |
✅ Working |
Metadata type mismatch (#20) |
✅ Fixed (v0.8.2-71-g596e682) |
| Chroma |
❓ Untested |
Unknown |
TBD |
| Qdrant |
❓ Untested |
Unknown |
TBD |
| Supabase |
❓ Untested |
Unknown |
TBD |
| MongoDB |
❓ Untested |
Unknown |
TBD |
| Neo4j |
❓ Untested |
Unknown |
TBD |
Test Plan
Test Dataset
File: data/tamarkin/test/2022-tamarkin-auction-catalogue.pdf
- Location: auctionsmax-ai repo
- Size: 4.9MB
- Content: 26 text chunks + 253 images
- Image sizes: 5KB to 81KB (tests size limit handling)
Test Procedure (Per Database)
1. Setup
# Ensure database is running (Docker or cloud)
# Configure in config.yaml
2. Create Collections
weave cols create AuctionListings --text --flat-metadata --<db-name>
weave cols create AuctionImages --image --flat-metadata --<db-name>
Expected: Collections created successfully
3. Ingest PDF
weave docs create AuctionListings \
data/tamarkin/test/2022-tamarkin-auction-catalogue.pdf \
--image-collection AuctionImages \
--<db-name> \
--batch-size 10
Expected:
- ✅ 26/26 text chunks created
- ✅ 253/253 images created
4. Verify Storage
weave docs count AuctionListings --<db-name>
# Expected: 26 documents
weave docs count AuctionImages --<db-name>
# Expected: 253 documents
5. Test Search
weave cols query AuctionImages "Leica M3 camera" --<db-name>
Expected: Returns relevant image results
Issues to Look For
Based on Milvus and Weaviate findings, check for:
-
Size Limits
- JSON field limits
- VARCHAR/String field limits
- Binary field limits
- Does it handle 81KB images?
-
Schema Issues
- Metadata format (string vs object)
- Vector configuration
- Index type specification
- Field type mismatches
-
Storage Issues
- Base64 encoding overhead
- Image data truncation
- Silent failures
-
Performance
- Ingestion speed
- Memory usage
- Batch processing
Databases to Test
Priority 1 (Common Choices)
Chroma (local + cloud)
- Docker available: Yes
- Cloud available: Yes
- Expected issues: Unknown
Qdrant (local + cloud)
- Docker available: Yes
- Cloud available: Yes
- Expected issues: Unknown
Priority 2 (Enterprise)
Supabase (PGVector)
- Docker available: Yes
- Cloud available: Yes
- Expected issues: Possible PG JSON limits
MongoDB Atlas (Vector Search)
- Docker available: Yes
- Cloud available: Yes
- Expected issues: BSON size limits?
Neo4j (Vector + Graph)
- Docker available: Yes
- Cloud available: Yes
- Expected issues: Property size limits?
Success Criteria
For each database:
Deliverables
- Test Results Document showing status for each DB
- Issues created for any bugs found
- Documentation updates with DB recommendations
- Schema fixes where needed
Timeline
- Phase 1 (Priority 1 DBs): Test Chroma + Qdrant
- Phase 2 (Priority 2 DBs): Test Supabase + MongoDB + Neo4j
- Phase 3 (Documentation): Update docs with recommendations
Recommendation Matrix (Goal)
After testing, create a recommendation matrix:
| Use Case |
Recommended DB |
Why |
| Small images (<64KB) |
Milvus |
Fast, mature, widely used |
| Large images (>64KB) |
Weaviate |
No size limits, proven working |
| PG-based stack |
Supabase |
Native PG integration |
| Graph + Vectors |
Neo4j |
Hybrid graph+vector queries |
| Cloud-first |
Qdrant/Chroma |
Managed cloud options |
Status: Planning phase
Next Action: Begin testing Chroma (Priority 1)
Summary
After discovering and fixing image ingestion issues in Milvus and Weaviate, we should systematically test image ingestion across all supported vector databases to identify and fix any remaining issues.
Current Status
Test Plan
Test Dataset
File:
data/tamarkin/test/2022-tamarkin-auction-catalogue.pdfTest Procedure (Per Database)
1. Setup
2. Create Collections
Expected: Collections created successfully
3. Ingest PDF
Expected:
4. Verify Storage
5. Test Search
Expected: Returns relevant image results
Issues to Look For
Based on Milvus and Weaviate findings, check for:
Size Limits
Schema Issues
Storage Issues
Performance
Databases to Test
Priority 1 (Common Choices)
Chroma (local + cloud)
Qdrant (local + cloud)
Priority 2 (Enterprise)
Supabase (PGVector)
MongoDB Atlas (Vector Search)
Neo4j (Vector + Graph)
Success Criteria
For each database:
Deliverables
Timeline
Recommendation Matrix (Goal)
After testing, create a recommendation matrix:
Status: Planning phase
Next Action: Begin testing Chroma (Priority 1)