Skip to content

feat: vision model support — image-to-text analysis capability #35

@stackbilt-admin

Description

@stackbilt-admin

Summary

FoodFiles needs to route vision/image analysis calls through llm-providers for its core photo-to-recipe workflow: a user uploads a photograph of a recipe card or dish, and the system extracts structured recipe data (ingredients, instructions, nutrition estimates) from the image.

Currently the FoodFiles demo uses static/hardcoded recipes. The real processing path needs vision model support in llm-providers so we can send an image (base64 or URL) and get structured text back.

Requested capability

A way to send an image (or image + text prompt) through the existing llm-providers routing/fallback infrastructure and receive a text completion. Essentially: generate() or a new analyzeImage() method that accepts a multimodal message (image + prompt) and routes it to a vision-capable model.

Use case

User uploads photo of handwritten recipe card
  → edge-auth: authn + quota check
  → llm-providers: vision model extracts text + structure from image
  → foodfiles-v2: formats into structured recipe, stores in D1

Models to consider

  • Claude (Anthropic) — vision via messages API, strong at structured extraction
  • GPT-4o (OpenAI) — vision via chat completions, good at OCR-like tasks
  • Gemini (Google) — multimodal native
  • Cloudflare Workers AI — if any vision models are available on the platform

The choice of model/provider can be opaque to FoodFiles — we just need to send an image and a prompt and get structured text back. The fallback chain and cost routing should work the same as text-only calls.

Questions for the llm-providers team

  1. Does the current generate() interface already support multimodal messages (image content parts), and we just need to ensure at least one provider in the chain has a vision-capable model?
  2. Or does this need a new method/interface for image inputs?
  3. Any considerations around image size limits, base64 vs URL, or preprocessing that should happen caller-side vs provider-side?

Context

  • Consumer: foodfiles-v2 worker (Stackbilt-dev/foodfiles)
  • Auth: edge-auth service binding (already wired)
  • Current demo: static recipes in RecipeGeneratorDemo.tsx — no live inference
  • The old implementation used a direct Groq API key in the worker, which was flagged as a security issue (Stackbilt-dev/foodfiles#65) and removed

Filed from: Stackbilt-dev/foodfiles context (editorial design Phase 1 + demo section work)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions