Summary
FoodFiles needs to route vision/image analysis calls through llm-providers for its core photo-to-recipe workflow: a user uploads a photograph of a recipe card or dish, and the system extracts structured recipe data (ingredients, instructions, nutrition estimates) from the image.
Currently the FoodFiles demo uses static/hardcoded recipes. The real processing path needs vision model support in llm-providers so we can send an image (base64 or URL) and get structured text back.
Requested capability
A way to send an image (or image + text prompt) through the existing llm-providers routing/fallback infrastructure and receive a text completion. Essentially: generate() or a new analyzeImage() method that accepts a multimodal message (image + prompt) and routes it to a vision-capable model.
Use case
User uploads photo of handwritten recipe card
→ edge-auth: authn + quota check
→ llm-providers: vision model extracts text + structure from image
→ foodfiles-v2: formats into structured recipe, stores in D1
Models to consider
- Claude (Anthropic) — vision via messages API, strong at structured extraction
- GPT-4o (OpenAI) — vision via chat completions, good at OCR-like tasks
- Gemini (Google) — multimodal native
- Cloudflare Workers AI — if any vision models are available on the platform
The choice of model/provider can be opaque to FoodFiles — we just need to send an image and a prompt and get structured text back. The fallback chain and cost routing should work the same as text-only calls.
Questions for the llm-providers team
- Does the current
generate() interface already support multimodal messages (image content parts), and we just need to ensure at least one provider in the chain has a vision-capable model?
- Or does this need a new method/interface for image inputs?
- Any considerations around image size limits, base64 vs URL, or preprocessing that should happen caller-side vs provider-side?
Context
- Consumer:
foodfiles-v2 worker (Stackbilt-dev/foodfiles)
- Auth: edge-auth service binding (already wired)
- Current demo: static recipes in
RecipeGeneratorDemo.tsx — no live inference
- The old implementation used a direct Groq API key in the worker, which was flagged as a security issue (Stackbilt-dev/foodfiles#65) and removed
Filed from: Stackbilt-dev/foodfiles context (editorial design Phase 1 + demo section work)
Summary
FoodFiles needs to route vision/image analysis calls through
llm-providersfor its core photo-to-recipe workflow: a user uploads a photograph of a recipe card or dish, and the system extracts structured recipe data (ingredients, instructions, nutrition estimates) from the image.Currently the FoodFiles demo uses static/hardcoded recipes. The real processing path needs vision model support in
llm-providersso we can send an image (base64 or URL) and get structured text back.Requested capability
A way to send an image (or image + text prompt) through the existing
llm-providersrouting/fallback infrastructure and receive a text completion. Essentially:generate()or a newanalyzeImage()method that accepts a multimodal message (image + prompt) and routes it to a vision-capable model.Use case
Models to consider
The choice of model/provider can be opaque to FoodFiles — we just need to send an image and a prompt and get structured text back. The fallback chain and cost routing should work the same as text-only calls.
Questions for the llm-providers team
generate()interface already support multimodal messages (image content parts), and we just need to ensure at least one provider in the chain has a vision-capable model?Context
foodfiles-v2worker (Stackbilt-dev/foodfiles)RecipeGeneratorDemo.tsx— no live inferenceFiled from: Stackbilt-dev/foodfiles context (editorial design Phase 1 + demo section work)