Skip to content

feat: batch file ingestion (weave docs create-batch) #38

@maximilien

Description

@maximilien

Problem

There is no native multi-file ingestion command. Client0 wrote a 255-line bash script to
process 20 PDFs sequentially with retry, Milvus restart, and skip/resume logic. This
complexity belongs inside weave-cli.

Proposed Solution

New subcommand `weave docs create-batch`:

```bash
weave docs create-batch AuctionListings "data/pdfs/*-catalogue.pdf"
--milvus-local
--embedding text-embedding-3-small
--skip-existing
--timeout 30m
--delay 10s
--max-retries 3
--log-file logs/ingestion.log
--checkpoint-file .ingestion-checkpoint.json
```

Key Features

  1. Glob expansion — already in v0.9.27
  2. Configurable delay between files (`--delay 10s`) — prevents VDB memory buildup
  3. Retry with backoff (`--max-retries 3`, `--retry-delay 30s`) — auto-retry failed files
  4. Checkpoint/resume (`--checkpoint-file`) — save state after each file, resume on crash
  5. Structured log file (`--log-file`) — timestamped append-mode log
  6. Auto-create image collection — if `--image-collection` doesn't exist, create it
  7. Batch summary at completion:
    ```
    ✅ Batch complete: 9/9 succeeded, 0 failed, 2 skipped
    Duration: 1h 23m | Log: logs/ingestion.log
    ```

Checkpoint File Format

```json
{
"collection": "AuctionListings",
"started": "2026-02-17T10:00:00Z",
"completed": [
{"file": "2017-catalogue.pdf", "chunks": 28, "at": "2026-02-17T10:12:33Z"}
],
"failed": [],
"skipped": []
}
```

Client0 Impact

Replaces the entire 255-line bash wrapper. Two `weave docs create-batch` calls replace the
entire pipeline script.

Priority

P1 — v0.9.29 target (~10 hours)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requeststaleNo activity in 7+ days

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions