Serverless RAG Chatbot

Scale-to-Zero RAG Architecture — A production-ready RAG chatbot that costs $0/month when idle and scales instantly. Built with Next.js 16, Gemini, and Upstash.

Note: This is the first iteration (v1.0) focused on PDF document Q&A. Future versions will expand to support multiple file formats (Word, Markdown, text files), web scraping, and more advanced RAG capabilities.

Features

True Scale-to-Zero: No monthly minimums (unlike Pinecone's $50 floor) — pay only for what you use
Edge-Optimized: Chat runs on Vercel Edge with sub-50ms cold starts
Client-Side PDF Parsing: Parse PDFs in the browser to avoid Edge runtime limits
Semantic Caching: 30-60% cost reduction via intelligent query caching
Streaming Citations: Citations stream before the answer for instant transparency
Hybrid Runtime: Edge for chat, Node.js for ingestion (optimal performance)
Observability: OpenTelemetry (ingest/retrieval spans), Langfuse (LLM traces, token usage)
CI/CD & Testing: GitHub Actions (lint, build, unit + E2E), Vitest, Playwright
Docker: Standalone image + docker-compose for local/self-hosted run
Modern UI: Beautiful, responsive chat interface with real-time streaming
gRPC Ingestion Gateway: Next.js calls ingest.IngestService/BulkIngest (e.g. Cloud Run) for production ingest; direct Upstash upsert remains a fallback

Architecture

graph TB
    A[User_Uploads_PDF] --> B[Browser_PDFjs]
    B --> C[Client_Chunking]
    C --> D[POST_api_ingest]
    D -->|INGEST_GRPC_URL_set| G[gRPC_BulkIngest]
    D -->|fallback| E[Node_Direct_Upsert]
    G --> F[Upstash_Vector]
    E --> F

    Q[User_Query] --> P[Proxy_Semantic_Cache]
    P -->|Hit| R[Upstash_Redis]
    P -->|Miss| J[Edge_api_chat]
    J --> K[Vector_Search]
    K --> F
    J --> L[Stream_Citations]
    L --> M[Gemini_Stream]
    M --> N[Cache_Result]
    N --> R
    N --> F

Quick Start

Prerequisites

Node.js 18+
npm or yarn
Upstash Vector & Redis accounts (free tier available)
Google Gemini API key

Installation

Clone the repository

git clone https://github.com/yourusername/serverless-rag-chatbot.git
cd serverless-rag-chatbot

Install dependencies
```
npm install
```
Create Upstash Vector Index with Built-in Embedding

Go to Upstash Console → Vector → Create Index:
- Name: serverless-rag (or your choice)
- Region: Choose closest to your users
- Embedding Model: Select BAAI/bge-base-en-v1.5 (768 dimensions, FREE!)
⚠️ Important: You MUST select an embedding model when creating the index. This enables Upstash's built-in embedding, which is FREE and avoids external API quotas!

Set up environment variables

cp env.example .env.local

Fill in your .env.local:

GOOGLE_GENERATIVE_AI_API_KEY=your_gemini_api_key
UPSTASH_REDIS_REST_URL=your_redis_url
UPSTASH_REDIS_REST_TOKEN=your_redis_token
UPSTASH_VECTOR_REST_URL=your_vector_url
UPSTASH_VECTOR_REST_TOKEN=your_vector_token

# Production ingest (recommended): Cloud Run gRPC gateway hostname:443
# INGEST_GRPC_URL=your-ingest-grpc-xxxxx.run.app:443
# INGEST_GRPC_TLS=1

Run the development server
```
npm run dev
```
Open http://localhost:3000

E2E tests (Playwright)

Scripts: npm run test:e2e (starts the app on port 3100 by default so it does not clash with next dev on 3000), npm run test:e2e:ui for the Playwright UI mode. Override with PLAYWRIGHT_E2E_PORT / PLAYWRIGHT_BASE_URL.
CI: After npm run build, the workflow installs Chromium and runs npm run test:e2e, which starts the standalone server (node .next/standalone/server.js on port 3100) because this app uses output: "standalone". This includes smoke UI tests + a mocked /api/ingest fetch test (no real Upstash required).
Full RAG E2E: Set E2E_FULL=1 and real API keys in the environment used by the Next server, add e2e/fixtures/sample.pdf (or E2E_FIXTURE_PDF), then run npx playwright test e2e/optional-full-flow.spec.ts.
Preview deployments: Run with PLAYWRIGHT_SKIP_WEB_SERVER=1 and PLAYWRIGHT_BASE_URL=https://your-preview.vercel.app.

How It Works

1. PDF Ingestion (Client-Side)

User uploads a PDF via the browser UI
PDF.js (loaded from public/pdf.worker.min.mjs) extracts text page-by-page
Text is chunked locally (900 chars, 150 overlap) and sent as JSON to /api/ingest
/api/ingest prefers gRPC (BulkIngest) when INGEST_GRPC_URL is configured (see services/ingest-grpc), otherwise upserts directly to Upstash from Node (transport: "direct" in the JSON response)
Upstash Vector's built-in embedding handles embeddings (FREE - no external API calls!)

2. Chat Flow

Semantic Cache (Fast Path):

src/proxy.ts intercepts /api/chat requests
Uses Upstash Vector's built-in embedding to search for similar cached queries (threshold: 0.95)
If found, returns cached answer from Upstash Redis instantly (skips Gemini, saving tokens on hits)
Adds timing/debug headers on cache hits (X-Cache-Hit, X-Response-Time, …)

Modeling savings: cache hit rate is workload-dependent. Use npm run benchmark:cache for a simple scenario model; measure production savings with Langfuse + your traffic mix.

RAG Pipeline (Cache Miss):

Uses Upstash's built-in embedding → searches Upstash Vector (Top-K=8)
Streams citations first (custom data-citations part)
Streams Gemini answer via Server-Sent Events
Caches result for future similar queries

3. UI Components

PdfUploader: Client-side PDF parsing with progress indicators
Sources: Real-time citation display with document sources
Chat Interface: Streaming messages with @ai-sdk/react hooks

Tech Stack

Component	Technology	Why
Framework	Next.js 16 (App Router)	Native Edge support, React Server Components
Ingest transport	gRPC `BulkIngest` + Node fallback	Production path calls a scale-to-zero gateway (e.g. Cloud Run) when `INGEST_GRPC_URL` is set
LLM	Google Gemini (configurable)	Default: `gemini-2.5-flash` (free tier, 10 RPM/250 RPD)
Embeddings	Upstash Vector (built-in)	FREE! Uses `BAAI/bge-base-en-v1.5` - no external API costs
Vector DB	Upstash Vector	True scale-to-zero, HTTP-native
Cache	Upstash Redis	Semantic caching for cost reduction
PDF Parsing	`pdfjs-dist`	Client-side to avoid Edge limits
UI	React + `@ai-sdk/react`	Streaming chat with citation support
Observability	OpenTelemetry, Langfuse	Traces (ingest/retrieval), LLM generations & token usage
CI/CD	GitHub Actions	Lint, build, unit + E2E on push/PR
Testing	Vitest, Playwright	Unit tests; browser E2E (smoke + optional full RAG)
Containers	Docker, docker-compose	Local/self-hosted run

Project Structure

serverless-rag/
├── e2e/                          # Playwright tests (smoke + optional full flow)
├── playwright.config.ts          # Playwright + webServer (dev local / start in CI)
├── src/
│   ├── app/
│   │   ├── api/
│   │   │   ├── chat/route.ts      # Edge chat route (streaming)
│   │   │   └── ingest/route.ts     # Node.js ingestion route
│   │   ├── components/
│   │   │   ├── PdfUploader.tsx    # Client-side PDF parsing
│   │   │   └── Sources.tsx        # Citation display
│   │   └── page.tsx               # Main chat UI
│   ├── lib/
│   │   ├── chunking.ts            # Text chunking utilities
│   │   ├── observability.ts      # OpenTelemetry spans (Node)
│   │   ├── retrieval.ts          # Vector search & cache logic
│   │   └── upstash.ts            # Upstash client initialization
│   ├── instrumentation.ts        # Next.js OTel registration
│   └── proxy.ts                  # Semantic cache proxy
├── .github/workflows/ci.yml      # CI: lint, build, test
├── services/vector-grpc/         # Optional gRPC vector utilities (UpsertChunks / QueryChunks)
├── services/ingest-grpc/         # gRPC ingest gateway (BulkIngest) + Dockerfile for Cloud Run
├── infra/cloud-run/              # Cloud Run notes + optional Terraform scaffold
├── public/
│   └── pdf.worker.min.mjs        # PDF.js worker (static asset)
├── Dockerfile                    # Standalone image
├── docker-compose.yml            # Local run with env
└── env.example                   # Environment variable template

Key Design Decisions

Why Upstash Over Pinecone?

True Scale-to-Zero: No $50/month minimum (Pinecone Serverless has a floor)
HTTP-Native: No gRPC connection issues in Edge environments
Lower Latency: P99 latencies in 26-50ms range

Why Hybrid Runtime?

Edge for Chat: Sub-50ms cold starts, perfect for streaming responses
Node.js for Ingestion: Larger bundle sizes, better for batch embedding operations

Deployment

Deploy to Vercel

Push your code to GitHub
Import project in Vercel
Add environment variables in Vercel dashboard
Deploy!

The app will automatically:

Use Edge runtime for /api/chat
Use Node.js runtime for /api/ingest (which calls your gRPC ingest gateway when INGEST_GRPC_URL is configured)
Serve static assets (including PDF.js worker) from CDN

Deploy the gRPC ingest gateway (Google Cloud Run)

See infra/cloud-run/README.md for build/push/deploy (min_instances = 0).

Then set Vercel env:

INGEST_GRPC_URL=your-service-xxxx.run.app:443
INGEST_GRPC_TLS=1

Docker Compose (local full stack)

docker-compose.yml runs ingest-grpc on port 50051 and sets:

INGEST_GRPC_URL=ingest-grpc:50051
INGEST_GRPC_TLS=0

Optional: Run the gRPC Vector Gateway (utilities)

For internal services that prefer gRPC/Protobuf over HTTP+JSON, this repo includes a small vector gateway:

Proto: services/vector-grpc/vector.proto (VectorService with UpsertChunks and QueryChunks)
Server: services/vector-grpc/server.ts (Node.js, wraps Upstash Vector over HTTP)

To run it locally:

UPSTASH_VECTOR_REST_URL=... \
UPSTASH_VECTOR_REST_TOKEN=... \
VECTOR_GRPC_PORT=50051 \
npm run vector-grpc:server

This starts a gRPC server exposing a binary-efficient, schema-safe API for document upsert and similarity search. You can point other backend services or batch jobs at this endpoint instead of calling /api/ingest or Upstash HTTP directly.

Cost Analysis

Idle State (0 Users)

Vercel: $0 (Hobby plan)
Upstash Vector: $0 (no requests)
Upstash Redis: $0 (no requests)
Total: $0.00/month

Hobbyist Scale (10k queries/month)

Vercel: $0 (within free tier)
Upstash Vector: ~$0.40 (10k requests)
Upstash Redis: ~$0.20 (10k requests)
LLM: ~$10-30 (Gemini usage)
Total: ~$10-30/month

Compare this to a containerized setup: $24-200/month (always-on servers)

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Vercel AI SDK for the excellent streaming primitives
Upstash for true serverless infrastructure
PDF.js for client-side PDF parsing
The Next.js team for Edge runtime support

Star History

If this project helped you, please consider giving it a star! ⭐

Built for the serverless community while drinking a lot of 🧃

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.github		.github
docs		docs
documents		documents
e2e		e2e
infra/cloud-run		infra/cloud-run
public		public
scripts		scripts
services		services
src		src
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
env.example		env.example
eslint.config.mjs		eslint.config.mjs
next.config.ts		next.config.ts
package-lock.json		package-lock.json
package.json		package.json
playwright.config.ts		playwright.config.ts
postcss.config.mjs		postcss.config.mjs
requirements.txt		requirements.txt
tsconfig.json		tsconfig.json
vitest.config.ts		vitest.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Serverless RAG Chatbot

Features

Architecture

Quick Start

Prerequisites

Installation

E2E tests (Playwright)

How It Works

1. PDF Ingestion (Client-Side)

2. Chat Flow

3. UI Components

Tech Stack

Project Structure

Key Design Decisions

Why Upstash Over Pinecone?

Why Hybrid Runtime?

Deployment

Deploy to Vercel

Deploy the gRPC ingest gateway (Google Cloud Run)

Docker Compose (local full stack)

Optional: Run the gRPC Vector Gateway (utilities)

Cost Analysis

Idle State (0 Users)

Hobbyist Scale (10k queries/month)

Contributing

License

Acknowledgments

Star History

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Serverless RAG Chatbot

Features

Architecture

Quick Start

Prerequisites

Installation

E2E tests (Playwright)

How It Works

1. PDF Ingestion (Client-Side)

2. Chat Flow

3. UI Components

Tech Stack

Project Structure

Key Design Decisions

Why Upstash Over Pinecone?

Why Hybrid Runtime?

Deployment

Deploy to Vercel

Deploy the gRPC ingest gateway (Google Cloud Run)

Docker Compose (local full stack)

Optional: Run the gRPC Vector Gateway (utilities)

Cost Analysis

Idle State (0 Users)

Hobbyist Scale (10k queries/month)

Contributing

License

Acknowledgments

Star History

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages