Scale-to-Zero RAG Architecture — A production-ready RAG chatbot that costs $0/month when idle and scales instantly. Built with Next.js 16, Gemini, and Upstash.
Note: This is the first iteration (v1.0) focused on PDF document Q&A. Future versions will expand to support multiple file formats (Word, Markdown, text files), web scraping, and more advanced RAG capabilities.
- True Scale-to-Zero: No monthly minimums (unlike Pinecone's $50 floor) — pay only for what you use
- Edge-Optimized: Chat runs on Vercel Edge with sub-50ms cold starts
- Client-Side PDF Parsing: Parse PDFs in the browser to avoid Edge runtime limits
- Semantic Caching: 30-60% cost reduction via intelligent query caching
- Streaming Citations: Citations stream before the answer for instant transparency
- Hybrid Runtime: Edge for chat, Node.js for ingestion (optimal performance)
- Observability: OpenTelemetry (ingest/retrieval spans), Langfuse (LLM traces, token usage)
- CI/CD & Testing: GitHub Actions (lint, build, unit + E2E), Vitest, Playwright
- Docker: Standalone image + docker-compose for local/self-hosted run
- Modern UI: Beautiful, responsive chat interface with real-time streaming
- gRPC Ingestion Gateway: Next.js calls
ingest.IngestService/BulkIngest(e.g. Cloud Run) for production ingest; direct Upstash upsert remains a fallback
graph TB
A[User_Uploads_PDF] --> B[Browser_PDFjs]
B --> C[Client_Chunking]
C --> D[POST_api_ingest]
D -->|INGEST_GRPC_URL_set| G[gRPC_BulkIngest]
D -->|fallback| E[Node_Direct_Upsert]
G --> F[Upstash_Vector]
E --> F
Q[User_Query] --> P[Proxy_Semantic_Cache]
P -->|Hit| R[Upstash_Redis]
P -->|Miss| J[Edge_api_chat]
J --> K[Vector_Search]
K --> F
J --> L[Stream_Citations]
L --> M[Gemini_Stream]
M --> N[Cache_Result]
N --> R
N --> F
- Node.js 18+
- npm or yarn
- Upstash Vector & Redis accounts (free tier available)
- Google Gemini API key
-
Clone the repository
git clone https://github.com/yourusername/serverless-rag-chatbot.git cd serverless-rag-chatbot -
Install dependencies
npm install
-
Create Upstash Vector Index with Built-in Embedding
Go to Upstash Console → Vector → Create Index:
- Name:
serverless-rag(or your choice) - Region: Choose closest to your users
- Embedding Model: Select
BAAI/bge-base-en-v1.5(768 dimensions, FREE!)
⚠️ Important: You MUST select an embedding model when creating the index. This enables Upstash's built-in embedding, which is FREE and avoids external API quotas! - Name:
-
Set up environment variables
cp env.example .env.local
Fill in your
.env.local:GOOGLE_GENERATIVE_AI_API_KEY=your_gemini_api_key UPSTASH_REDIS_REST_URL=your_redis_url UPSTASH_REDIS_REST_TOKEN=your_redis_token UPSTASH_VECTOR_REST_URL=your_vector_url UPSTASH_VECTOR_REST_TOKEN=your_vector_token # Production ingest (recommended): Cloud Run gRPC gateway hostname:443 # INGEST_GRPC_URL=your-ingest-grpc-xxxxx.run.app:443 # INGEST_GRPC_TLS=1
-
Run the development server
npm run dev
- Scripts:
npm run test:e2e(starts the app on port 3100 by default so it does not clash withnext devon 3000),npm run test:e2e:uifor the Playwright UI mode. Override withPLAYWRIGHT_E2E_PORT/PLAYWRIGHT_BASE_URL. - CI: After
npm run build, the workflow installs Chromium and runsnpm run test:e2e, which starts the standalone server (node .next/standalone/server.json port 3100) because this app usesoutput: "standalone". This includes smoke UI tests + a mocked/api/ingestfetch test (no real Upstash required). - Full RAG E2E: Set
E2E_FULL=1and real API keys in the environment used by the Next server, adde2e/fixtures/sample.pdf(orE2E_FIXTURE_PDF), then runnpx playwright test e2e/optional-full-flow.spec.ts. - Preview deployments: Run with
PLAYWRIGHT_SKIP_WEB_SERVER=1andPLAYWRIGHT_BASE_URL=https://your-preview.vercel.app.
- User uploads a PDF via the browser UI
- PDF.js (loaded from
public/pdf.worker.min.mjs) extracts text page-by-page - Text is chunked locally (900 chars, 150 overlap) and sent as JSON to
/api/ingest /api/ingestprefers gRPC (BulkIngest) whenINGEST_GRPC_URLis configured (seeservices/ingest-grpc), otherwise upserts directly to Upstash from Node (transport: "direct"in the JSON response)- Upstash Vector's built-in embedding handles embeddings (FREE - no external API calls!)
Semantic Cache (Fast Path):
src/proxy.tsintercepts/api/chatrequests- Uses Upstash Vector's built-in embedding to search for similar cached queries (threshold: 0.95)
- If found, returns cached answer from Upstash Redis instantly (skips Gemini, saving tokens on hits)
- Adds timing/debug headers on cache hits (
X-Cache-Hit,X-Response-Time, …)
Modeling savings: cache hit rate is workload-dependent. Use npm run benchmark:cache for a simple scenario model; measure production savings with Langfuse + your traffic mix.
RAG Pipeline (Cache Miss):
- Uses Upstash's built-in embedding → searches Upstash Vector (Top-K=8)
- Streams citations first (custom
data-citationspart) - Streams Gemini answer via Server-Sent Events
- Caches result for future similar queries
PdfUploader: Client-side PDF parsing with progress indicatorsSources: Real-time citation display with document sources- Chat Interface: Streaming messages with
@ai-sdk/reacthooks
| Component | Technology | Why |
|---|---|---|
| Framework | Next.js 16 (App Router) | Native Edge support, React Server Components |
| Ingest transport | gRPC BulkIngest + Node fallback |
Production path calls a scale-to-zero gateway (e.g. Cloud Run) when INGEST_GRPC_URL is set |
| LLM | Google Gemini (configurable) | Default: gemini-2.5-flash (free tier, 10 RPM/250 RPD) |
| Embeddings | Upstash Vector (built-in) | FREE! Uses BAAI/bge-base-en-v1.5 - no external API costs |
| Vector DB | Upstash Vector | True scale-to-zero, HTTP-native |
| Cache | Upstash Redis | Semantic caching for cost reduction |
| PDF Parsing | pdfjs-dist |
Client-side to avoid Edge limits |
| UI | React + @ai-sdk/react |
Streaming chat with citation support |
| Observability | OpenTelemetry, Langfuse | Traces (ingest/retrieval), LLM generations & token usage |
| CI/CD | GitHub Actions | Lint, build, unit + E2E on push/PR |
| Testing | Vitest, Playwright | Unit tests; browser E2E (smoke + optional full RAG) |
| Containers | Docker, docker-compose | Local/self-hosted run |
serverless-rag/
├── e2e/ # Playwright tests (smoke + optional full flow)
├── playwright.config.ts # Playwright + webServer (dev local / start in CI)
├── src/
│ ├── app/
│ │ ├── api/
│ │ │ ├── chat/route.ts # Edge chat route (streaming)
│ │ │ └── ingest/route.ts # Node.js ingestion route
│ │ ├── components/
│ │ │ ├── PdfUploader.tsx # Client-side PDF parsing
│ │ │ └── Sources.tsx # Citation display
│ │ └── page.tsx # Main chat UI
│ ├── lib/
│ │ ├── chunking.ts # Text chunking utilities
│ │ ├── observability.ts # OpenTelemetry spans (Node)
│ │ ├── retrieval.ts # Vector search & cache logic
│ │ └── upstash.ts # Upstash client initialization
│ ├── instrumentation.ts # Next.js OTel registration
│ └── proxy.ts # Semantic cache proxy
├── .github/workflows/ci.yml # CI: lint, build, test
├── services/vector-grpc/ # Optional gRPC vector utilities (UpsertChunks / QueryChunks)
├── services/ingest-grpc/ # gRPC ingest gateway (BulkIngest) + Dockerfile for Cloud Run
├── infra/cloud-run/ # Cloud Run notes + optional Terraform scaffold
├── public/
│ └── pdf.worker.min.mjs # PDF.js worker (static asset)
├── Dockerfile # Standalone image
├── docker-compose.yml # Local run with env
└── env.example # Environment variable template
- True Scale-to-Zero: No $50/month minimum (Pinecone Serverless has a floor)
- HTTP-Native: No gRPC connection issues in Edge environments
- Lower Latency: P99 latencies in 26-50ms range
- Edge for Chat: Sub-50ms cold starts, perfect for streaming responses
- Node.js for Ingestion: Larger bundle sizes, better for batch embedding operations
- Push your code to GitHub
- Import project in Vercel
- Add environment variables in Vercel dashboard
- Deploy!
The app will automatically:
- Use Edge runtime for
/api/chat - Use Node.js runtime for
/api/ingest(which calls your gRPC ingest gateway whenINGEST_GRPC_URLis configured) - Serve static assets (including PDF.js worker) from CDN
See infra/cloud-run/README.md for build/push/deploy (min_instances = 0).
Then set Vercel env:
INGEST_GRPC_URL=your-service-xxxx.run.app:443
INGEST_GRPC_TLS=1docker-compose.yml runs ingest-grpc on port 50051 and sets:
INGEST_GRPC_URL=ingest-grpc:50051
INGEST_GRPC_TLS=0For internal services that prefer gRPC/Protobuf over HTTP+JSON, this repo includes a small vector gateway:
- Proto:
services/vector-grpc/vector.proto(VectorServicewithUpsertChunksandQueryChunks) - Server:
services/vector-grpc/server.ts(Node.js, wraps Upstash Vector over HTTP)
To run it locally:
UPSTASH_VECTOR_REST_URL=... \
UPSTASH_VECTOR_REST_TOKEN=... \
VECTOR_GRPC_PORT=50051 \
npm run vector-grpc:serverThis starts a gRPC server exposing a binary-efficient, schema-safe API for document upsert and similarity search. You can point other backend services or batch jobs at this endpoint instead of calling /api/ingest or Upstash HTTP directly.
- Vercel: $0 (Hobby plan)
- Upstash Vector: $0 (no requests)
- Upstash Redis: $0 (no requests)
- Total: $0.00/month
- Vercel: $0 (within free tier)
- Upstash Vector: ~$0.40 (10k requests)
- Upstash Redis: ~$0.20 (10k requests)
- LLM: ~$10-30 (Gemini usage)
- Total: ~$10-30/month
Compare this to a containerized setup: $24-200/month (always-on servers)
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Vercel AI SDK for the excellent streaming primitives
- Upstash for true serverless infrastructure
- PDF.js for client-side PDF parsing
- The Next.js team for Edge runtime support
If this project helped you, please consider giving it a star! ⭐
Built for the serverless community while drinking a lot of 🧃
