Classifies user queries by complexity and routes them to the right LLM automatically. Fast model for simple questions, heavy model for complex reasoning. Exposes per-request telemetry: model used, latency, and estimated cost.
The default approach to LLM routing is either "send everything to GPT-4" (expensive) or "send everything to a small model" (bad quality). Neither works at scale. This router sits in front of your LLM calls and makes the routing decision for you — a fast classifier determines query complexity, then routes to the appropriate model. You get the quality of a heavy model where it matters and the speed/cost of a light model everywhere else.
I built this while working on AI-powered workflows where inference cost and latency were real constraints in production. The telemetry panel was critical — you can't optimize what you can't measure.
- Automatic query classification — fast model (gpt-4o-mini) classifies complexity before routing
- Smart routing — simple queries go to gpt-4o-mini, complex queries go to gpt-4o
- Streaming responses — real-time streamed output from whichever model handles the query
- Per-request telemetry — see which model was used, response latency, and estimated API cost
- Interactive chat UI — full message history with clean interface
git clone https://github.com/gautamgb/Context-Aware-Semantic-Router.git
cd Context-Aware-Semantic-Router
npm install
cp .env.example .env.local # Add your OPENAI_API_KEY
npm run devOpen http://localhost:3000 and start chatting. The telemetry panel shows routing decisions in real time.
- Push to GitHub and import in Vercel
- Set
OPENAI_API_KEYin environment variables - Optionally customize
FAST_MODELandHEAVY_MODEL - Deploy
- Framework: Next.js 16 (App Router)
- AI: OpenAI API (non-streaming classification, streaming responses)
- Styling: Tailwind CSS
- Language: TypeScript
MIT