Intel NPU LLM Server

Run Large Language Models on your Intel Core Ultra NPU with an OpenAI-compatible API.

🎯 Features

NPU Acceleration: Leverage Intel's Neural Processing Unit for power-efficient AI
OpenAI-Compatible API: Works with any OpenAI client (Open WebUI, LangChain, N8N)
Built-in Chat UI: Beautiful dark-mode interface at http://localhost:8000 — no Docker needed
Multi-Model Support: Load and switch between multiple models from the UI
Conversation History: Full multi-turn context management
Markdown Rendering: Clean formatting for code blocks, lists, and structured output
Real-time Monitoring: Live NPU status, memory usage, and system telemetry
Tool Calling: Function calling support for building AI agents
Local & Private: All processing happens on your device — nothing leaves your machine
Power Efficient: ~3-5x less power than CPU inference

🎥 Demos

Quick Overview & Speed Test: Intel NPU LLM - UI & Performance Demo
Feature Deep Dive: Building with Intel NPU & OpenAI API

📋 Requirements

Processor: Intel Core Ultra (Meteor Lake, Arrow Lake, or Lunar Lake)
OS: Windows 11
NPU Driver: Version 32.0.100.3104 or newer
Python: 3.11 (managed via Miniconda)
Docker Desktop: For Open WebUI frontend (optional)

🚀 Quick Start

1. Install Dependencies (First Time Only)

# Install Miniconda (if not installed)
winget install Anaconda.Miniconda3

# Create Python environment
conda create -n ipex-npu python=3.11 -y
conda activate ipex-npu

# Install ipex-llm with NPU support
pip install --pre --upgrade ipex-llm[npu]

# Install server dependencies
pip install -r intel-npu-llm/requirements.txt

1b. HuggingFace Authentication (For Gated Models)

Some models (Llama 2, Llama 3, Llama 3.2) require HuggingFace authentication:

Create a HuggingFace account at huggingface.co
Accept the model license - Visit the model page (e.g., meta-llama/Llama-3.2-3B-Instruct) and accept the terms
Generate an access token at huggingface.co/settings/tokens
Create a .env file in the project root:

# Create .env file with your token (UTF-8 encoding is important!)
'HF_TOKEN=hf_your_token_here' | Out-File -FilePath .env -Encoding utf8

Or manually create npu-windows/.env:

HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Note: Without this, gated models will fail to download. Non-gated models (Qwen, DeepSeek, MiniCPM, GLM-Edge, Baichuan2) work without authentication.

2. Start the NPU Backend (Multiple Models)

# From the project root - loads 2 models by default
.\start_backend.bat

> **Note**: `start_backend.bat` automatically detects your processor (Meteor Lake vs Arrow/Lunar Lake) and configures the `IPEX_LLM_NPU_MTL` variable for you.

Or load specific models:

.\start_backend.bat --models "qwen1.5-1.8b,llama3.2-1b,qwen1.5-4b"

Or load specific models:

.\start_backend.bat --models "qwen1.5-4b"

List all available models:

.\start_backend.bat --list

Change the server port (if 8000 is occupied):

.\start_backend.bat --port 8001

Mixed usage:

.\start_backend.bat --models "qwen1.5-4b" --port 8080

Or manually:

$env:IPEX_LLM_NPU_MTL = "1"  # For Meteor Lake (Core Ultra Series 1)
conda activate ipex-npu
cd intel-npu-llm
python npu_server.py

3. Start Open WebUI (Optional)

cd intel-npu-llm
docker compose up -d

4. Access the Interface

Built-in Chat UI (No Docker Required)

Open http://localhost:8000 in your browser for a full-featured chat interface:

Real-time NPU status with animated indicators (Connecting, Busy, Idle)
Model selector dropdown (loaded models populate automatically)
Conversation history with multi-turn context
Markdown rendering (code blocks, lists, bold/italic)
Keyboard shortcuts: Enter to send, Shift+Enter for newline, Ctrl+L to clear
Live Telemetry: Real-time NPU busy state, system RAM usage, and model disk footprint (NPU + HuggingFace cache)
Live token counter for the entire session

API Endpoints

/ — Built-in Chat UI
/v1/chat/completions — OpenAI Chat Completions API (Open WebUI, LangChain, curl)
/v1/responses — OpenAI Responses API (N8N)
/v1/models — List loaded models
/v1/system/status — System telemetry (memory, CPU, NPU busy state)
/health — Health check

5. Connect Your Own Open WebUI (Optional)

If you already have Open WebUI running elsewhere (e.g., on a homelab server), configure it to use your NPU server:

In Open WebUI: Go to Settings → Connections → OpenAI API
Add a new connection with these settings:
- API Base URL: http://<YOUR-WINDOWS-PC-IP>:8000/v1
- API Key: sk-dummy (any value works, the NPU server doesn't validate keys)
Save and your NPU models will appear in the model dropdown

Tip: Find your Windows IP with ipconfig in PowerShell. Use your local network IP (e.g., 192.168.1.x).

Firewall Note: You may need to allow port 8000 through Windows Firewall for remote connections.

6. Connect N8N (Optional)

To use your NPU server with N8N workflows:

In N8N: Add an OpenAI node to your workflow
Configure credentials:
- API Key: sk-dummy (any value)
- Base URL: http://<YOUR-WINDOWS-PC-IP>:8000/v1
Select model: Use one of the loaded model IDs (e.g., qwen1.5-1.8b)

Note: N8N uses the /v1/responses API endpoint, which is fully supported.

7. Tool Calling / Function Calling (Agents)

The server supports OpenAI-compatible tool/function calling for building AI agents:

{
    "model": "qwen2.5-7b",
    "messages": [{"role": "user", "content": "What's the weather in NYC?"}],
    "tools": [{
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get weather for a location",
            "parameters": {
                "type": "object",
                "properties": {"location": {"type": "string"}},
                "required": ["location"]
            }
        }
    }],
    "tool_choice": "auto"
}

Tool Choice Options

`tool_choice`	Behavior
`"auto"`	Model decides when to use tools (default)
`"none"`	Disable tool calling, respond normally
`"required"`	Force the model to call at least one tool
`{"type": "function", "function": {"name": "get_weather"}}`	Force specific tool

Advanced Features

Parallel tool calls: Model can call multiple tools in one response
Streaming tool calls: Tool calls are detected and emitted at end of stream
Retry logic: Malformed tool calls are automatically retried (max 2 attempts)
Tool validation: Only defined tools are parsed, invalid calls are ignored

Recommended models: qwen2.5-7b, qwen2.5-3b (larger models work better)

Note: Tool calling works best with 3B+ parameter models. Smaller models may struggle.

🤖 Supported Models

All models below are officially verified for Intel NPU via ipex-llm:

Qwen Series (Recommended)

Model ID	Size	NPU Speed	Notes
`qwen1.5-1.8b`	1.8B	~8 tok/s	✅ Default - Verified working
`qwen1.5-4b`	4B	~5 tok/s	Better quality
`qwen1.5-7b`	7B	~3 tok/s	Best Qwen1.5
`qwen2-1.5b`	1.5B	~10 tok/s	Official NPU verified
`qwen2-7b`	7B	~3 tok/s	Official NPU verified
`qwen2.5-3b`	3B	~8 tok/s	🔥 Latest Qwen
`qwen2.5-7b`	7B	~3 tok/s	🔥 Best Qwen 2.5

Llama Series

Model ID	Size	NPU Speed	Notes
`llama2-7b`	7B	~3 tok/s	Classic, requires HF login
`llama3-8b`	8B	~2 tok/s	Powerful, requires HF login
`llama3.2-1b`	1B	~15 tok/s	⚡ Fastest Llama, requires HF login
`llama3.2-3b`	3B	~10 tok/s	Fast & capable, requires HF login

DeepSeek R1 (Reasoning)

Model ID	Size	NPU Speed	Notes
`deepseek-1.5b`	1.5B	~10 tok/s	Fast reasoning
`deepseek-7b`	7B	~3 tok/s	Best reasoning

GLM-Edge (Bilingual)

Model ID	Size	NPU Speed	Notes
`glm-edge-1.5b`	1.5B	~10 tok/s	Chinese/English bilingual
`glm-edge-4b`	4B	~5 tok/s	Larger bilingual model

MiniCPM (Ultra-Compact)

Model ID	Size	NPU Speed	Notes
`minicpm-1b`	1B	~15 tok/s	Ultra-compact, efficient
`minicpm-2b`	2B	~10 tok/s	Small but capable

Baichuan2 (Chinese)

Model ID	Size	NPU Speed	Notes
`baichuan2-7b`	7B	~3 tok/s	Chinese-focused LLM

Load Multiple Models

.\start_backend.bat --models "qwen2.5-3b,llama3.2-1b,minicpm-2b"

Note: First run downloads and compiles each model (1-3 min). Subsequent loads are instant from cache.

⚡ NPU vs CPU/GPU

Metric	NPU	CPU	iGPU
Power Draw	~5-10W	15-45W	20-35W
TOPS (INT8)	11 TOPS	~2-3 TOPS	~8 TOPS
Battery Life	Hours	~1 hour	~2 hours
Best For	Efficiency	Fallback	Larger models

🔧 Configuration

Environment Variables

Variable	Value	Description
`IPEX_LLM_NPU_MTL`	`1`	Required for Meteor Lake (Core Ultra Series 1)
`HF_HOME`	path	Hugging Face cache directory
`PORT`	`8001`	Default port for the server

Processor-Specific Settings

Processor Series	Environment Variable
Core Ultra Series 1 (Meteor Lake)	`IPEX_LLM_NPU_MTL=1`
Core Ultra Series 2 (Arrow Lake)	None required
Core Ultra (Lunar Lake)	None required

🐛 Troubleshooting

NPU Not Detected

Check Device Manager → Neural processors → Intel(R) AI Boost
Update NPU driver to latest version
Ensure IPEX_LLM_NPU_MTL=1 is set for Meteor Lake

Generation Hangs

First generation takes 1-3 minutes for NPU warmup
Subsequent generations are fast (~1 second)

Port Already in Use

# Kill existing Python processes
Get-Process python* | Stop-Process -Force

.env File Encoding Error (`ValueError: embedded null character`)

This happens when the .env file was saved in UTF-16 (the default for PowerShell's > redirect).

Fix: Re-create the file using the UTF-8 safe command:

'HF_TOKEN=hf_your_token_here' | Out-File -FilePath .env -Encoding utf8

Or open the file in Notepad → File > Save As → set Encoding: UTF-8.

💾 Model Storage

Models are stored in two locations:

Location	Contents	Path
HuggingFace Cache	Original downloaded models	`%USERPROFILE%\.cache\huggingface\hub\`
NPU Cache	Compiled NPU-optimized models	`intel-npu-llm\npu_model_cache\`

Tip: The built-in chat UI at http://localhost:8000 shows total model disk usage live in the header (disk icon chip).

Space Usage (Approximate)

Model Size	HF Cache	NPU Cache	Total
1-2B models	~2-4 GB	~1-2 GB	~3-6 GB
3-4B models	~6-8 GB	~2-4 GB	~8-12 GB
7-8B models	~14-16 GB	~4-8 GB	~18-24 GB

🧹 Cache Management Commands

Check How Much Space Caches Are Using

# NPU cache size (compiled models)
"{0:N2} GB" -f ((Get-ChildItem -Recurse .\intel-npu-llm\npu_model_cache\ -ErrorAction SilentlyContinue | Measure-Object -Property Length -Sum).Sum / 1GB)

# HuggingFace cache size (downloaded weights)
"{0:N2} GB" -f ((Get-ChildItem -Recurse "$env:USERPROFILE\.cache\huggingface\hub\" -ErrorAction SilentlyContinue | Measure-Object -Property Length -Sum).Sum / 1GB)

# Both combined
$npu = (Get-ChildItem -Recurse .\intel-npu-llm\npu_model_cache\ -ErrorAction SilentlyContinue | Measure-Object -Property Length -Sum).Sum / 1GB
$hf  = (Get-ChildItem -Recurse "$env:USERPROFILE\.cache\huggingface\hub\" -ErrorAction SilentlyContinue | Measure-Object -Property Length -Sum).Sum / 1GB
"NPU cache: {0:N2} GB  |  HF cache: {1:N2} GB  |  Total: {2:N2} GB" -f $npu, $hf, ($npu + $hf)

Clear NPU Cache (Keeps HF Downloads — Fastest to Rebuild)

# Clear ALL compiled NPU models (recompiles on next run, no re-download needed)
Remove-Item -Recurse -Force .\intel-npu-llm\npu_model_cache\

Clear a Single Model's NPU Cache

# Example: remove only Qwen2.5-7B compiled cache
Remove-Item -Recurse -Force ".\intel-npu-llm\npu_model_cache\Qwen_Qwen2.5-7B-Instruct\"

# List all compiled NPU model folders to find the right name
Get-ChildItem .\intel-npu-llm\npu_model_cache\

Clear HuggingFace Download Cache

⚠️ This will force a full re-download on next use. Only do this if you need to free maximum disk space.

# Clear ALL HuggingFace downloads
Remove-Item -Recurse -Force "$env:USERPROFILE\.cache\huggingface\hub\"

# Clear a specific model from HF cache (example: Qwen2.5-7B)
Remove-Item -Recurse -Force "$env:USERPROFILE\.cache\huggingface\hub\models--Qwen--Qwen2.5-7B-Instruct\"

# List all downloaded HF models
Get-ChildItem "$env:USERPROFILE\.cache\huggingface\hub\" -Directory

Nuclear Option — Clear Everything

# Remove both NPU compiled cache AND HuggingFace downloads
Remove-Item -Recurse -Force .\intel-npu-llm\npu_model_cache\ -ErrorAction SilentlyContinue
Remove-Item -Recurse -Force "$env:USERPROFILE\.cache\huggingface\hub\" -ErrorAction SilentlyContinue
Write-Host "All model caches cleared. Models will re-download and recompile on next run."

Custom Cache Location

Set in your .env file to store HuggingFace models on a different drive (great for SSDs with limited C: space):

HF_HOME=D:\models\huggingface

The NPU cache location is fixed at intel-npu-llm\npu_model_cache\ relative to the project directory.

📁 Project Structure

npu-windows/
├── start_backend.bat             # One-click startup with auto CPU detection
├── QUICKSTART.md                 # 5-minute getting started guide
├── README.md                     # Full documentation
└── intel-npu-llm/
    ├── npu_server.py             # NPU-accelerated LLM server (FastAPI)
    ├── index.html                # Built-in dark-mode chat UI
    ├── models.json               # Model registry (add custom models here)
    ├── docker-compose.yml        # Open WebUI frontend (optional)
    ├── requirements.txt          # Python dependencies
    ├── .env.example              # Environment variable template
    └── npu_model_cache/          # Compiled NPU models (auto-created on first run)

📄 License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.github		.github
intel-npu-llm		intel-npu-llm
.gitignore		.gitignore
MCP_WALKTHROUGH.md		MCP_WALKTHROUGH.md
QUICKSTART.md		QUICKSTART.md
README.md		README.md
start_backend.bat		start_backend.bat

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Intel NPU LLM Server

🎯 Features

🎥 Demos

📋 Requirements

🚀 Quick Start

1. Install Dependencies (First Time Only)

1b. HuggingFace Authentication (For Gated Models)

2. Start the NPU Backend (Multiple Models)

3. Start Open WebUI (Optional)

4. Access the Interface

Built-in Chat UI (No Docker Required)

API Endpoints

5. Connect Your Own Open WebUI (Optional)

6. Connect N8N (Optional)

7. Tool Calling / Function Calling (Agents)

Tool Choice Options

Advanced Features

🤖 Supported Models

Qwen Series (Recommended)

Llama Series

DeepSeek R1 (Reasoning)

GLM-Edge (Bilingual)

MiniCPM (Ultra-Compact)

Baichuan2 (Chinese)

Load Multiple Models

⚡ NPU vs CPU/GPU

🔧 Configuration

Environment Variables

Processor-Specific Settings

🐛 Troubleshooting

NPU Not Detected

Generation Hangs

Port Already in Use

.env File Encoding Error (ValueError: embedded null character)

💾 Model Storage

Space Usage (Approximate)

🧹 Cache Management Commands

Check How Much Space Caches Are Using

Clear NPU Cache (Keeps HF Downloads — Fastest to Rebuild)

Clear a Single Model's NPU Cache

Clear HuggingFace Download Cache

Nuclear Option — Clear Everything

Custom Cache Location

📁 Project Structure

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

.env File Encoding Error (`ValueError: embedded null character`)

Packages