Skip to content

test: image ingestion across all supported vector databases #21

@maximilien

Description

@maximilien

Summary

After discovering and fixing image ingestion issues in Milvus and Weaviate, we should systematically test image ingestion across all supported vector databases to identify and fix any remaining issues.

Current Status

Database Status Issues Found Resolution
Milvus ❌ Blocked 64KB JSON field limit (#19) Unfixable (Milvus limitation)
Weaviate ✅ Working Metadata type mismatch (#20) ✅ Fixed (v0.8.2-71-g596e682)
Chroma ❓ Untested Unknown TBD
Qdrant ❓ Untested Unknown TBD
Supabase ❓ Untested Unknown TBD
MongoDB ❓ Untested Unknown TBD
Neo4j ❓ Untested Unknown TBD

Test Plan

Test Dataset

File: data/tamarkin/test/2022-tamarkin-auction-catalogue.pdf

  • Location: auctionsmax-ai repo
  • Size: 4.9MB
  • Content: 26 text chunks + 253 images
  • Image sizes: 5KB to 81KB (tests size limit handling)

Test Procedure (Per Database)

1. Setup

# Ensure database is running (Docker or cloud)
# Configure in config.yaml

2. Create Collections

weave cols create AuctionListings --text --flat-metadata --<db-name>
weave cols create AuctionImages --image --flat-metadata --<db-name>

Expected: Collections created successfully

3. Ingest PDF

weave docs create AuctionListings \
  data/tamarkin/test/2022-tamarkin-auction-catalogue.pdf \
  --image-collection AuctionImages \
  --<db-name> \
  --batch-size 10

Expected:

  • ✅ 26/26 text chunks created
  • ✅ 253/253 images created

4. Verify Storage

weave docs count AuctionListings --<db-name>
# Expected: 26 documents

weave docs count AuctionImages --<db-name>
# Expected: 253 documents

5. Test Search

weave cols query AuctionImages "Leica M3 camera" --<db-name>

Expected: Returns relevant image results

Issues to Look For

Based on Milvus and Weaviate findings, check for:

  1. Size Limits

    • JSON field limits
    • VARCHAR/String field limits
    • Binary field limits
    • Does it handle 81KB images?
  2. Schema Issues

    • Metadata format (string vs object)
    • Vector configuration
    • Index type specification
    • Field type mismatches
  3. Storage Issues

    • Base64 encoding overhead
    • Image data truncation
    • Silent failures
  4. Performance

    • Ingestion speed
    • Memory usage
    • Batch processing

Databases to Test

Priority 1 (Common Choices)

Chroma (local + cloud)

  • Docker available: Yes
  • Cloud available: Yes
  • Expected issues: Unknown

Qdrant (local + cloud)

  • Docker available: Yes
  • Cloud available: Yes
  • Expected issues: Unknown

Priority 2 (Enterprise)

Supabase (PGVector)

  • Docker available: Yes
  • Cloud available: Yes
  • Expected issues: Possible PG JSON limits

MongoDB Atlas (Vector Search)

  • Docker available: Yes
  • Cloud available: Yes
  • Expected issues: BSON size limits?

Neo4j (Vector + Graph)

  • Docker available: Yes
  • Cloud available: Yes
  • Expected issues: Property size limits?

Success Criteria

For each database:

  • Collections create successfully
  • All 26 text chunks ingest
  • All 253 images ingest (including 81KB image)
  • Document counts match expected
  • Search queries work
  • No silent failures or truncation

Deliverables

  1. Test Results Document showing status for each DB
  2. Issues created for any bugs found
  3. Documentation updates with DB recommendations
  4. Schema fixes where needed

Timeline

  • Phase 1 (Priority 1 DBs): Test Chroma + Qdrant
  • Phase 2 (Priority 2 DBs): Test Supabase + MongoDB + Neo4j
  • Phase 3 (Documentation): Update docs with recommendations

Recommendation Matrix (Goal)

After testing, create a recommendation matrix:

Use Case Recommended DB Why
Small images (<64KB) Milvus Fast, mature, widely used
Large images (>64KB) Weaviate No size limits, proven working
PG-based stack Supabase Native PG integration
Graph + Vectors Neo4j Hybrid graph+vector queries
Cloud-first Qdrant/Chroma Managed cloud options

Status: Planning phase
Next Action: Begin testing Chroma (Priority 1)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requeststaleNo activity in 7+ days

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions