AirLLM Docker Setup

A complete Dockerized environment for running AirLLM, optimized for running massive Large Language Models (LLMs) on single GPUs using layer swapping, 4-bit quantization, and NVMe optimization.

This repository is specifically tailored to run on consumer hardware while providing a seamless, OpenAI-compatible streaming API.

Features

OpenAI-Compatible API: Streaming endpoint (/v1/chat/completions) and models list (/v1/models) ready for drop-in integration with various UIs and CLI tools.
NVMe Layer Swapping Supported: Built-in support for blazing-fast model inference using NVMe RAM drives when massive models won't fit entirely in simple GPU VRAM.
Optimized for Consumer Hardware: Built-in environment optimizations including BitsAndBytes (nf4), Flash Attention 2 (where supported), and OMP CPU pre-fetching logic.
Graceful Loading: API responds immediately on startup with a loading notice while the model initialises in the background — no connection refused errors.

Prerequisites

Docker Engine
NVIDIA GPU Drivers (CUDA 12.4+ supported)
NVIDIA Container Toolkit (Required to pass GPUs into the container)
Optional but Highly Recommended: An NVMe Drive for faster layer swapping.

Quick Start Configuration

1. Model Configuration

By default, this repository is configured to serve the Qwen/Qwen2.5-Coder-7B-Instruct model (as defined in config.json). You can customize this by editing the config.json file inside the repository before starting the server.

2. NVMe Drive Setup (Recommended)

To achieve the best inference performance via layer swapping, it is highly recommended to dedicate an NVMe drive on your host machine to /mnt/nvme_ram.

If you have a dedicated drive available (e.g., /dev/nvme0n1 or /dev/sde), you can format and mount it using the following commands:

WARNING: Formatting a drive will erase all its existing data! Be absolutely sure you have the correct drive identifier.

# Wipe any existing filesystem signatures
sudo wipefs -a <your_nvme_device>

# Format the drive
sudo mkfs.ext4 -F <your_nvme_device>

# Create mount point
sudo mkdir -p /mnt/nvme_ram

# Mount the drive
sudo mount -o noatime <your_nvme_device> /mnt/nvme_ram

# Set permissions
sudo chown -R $USER:$USER /mnt/nvme_ram
sudo chmod -R 755 /mnt/nvme_ram

Place your config.json inside /mnt/nvme_ram. When starting the server, if the script detects /mnt/nvme_ram/config.json, it will use the NVMe drive automatically. Otherwise, it will fall back to using a local ./models directory for the model cache.

3. Usage

Use the provided airllm.sh control script to manage the container lifecycle.

Command	Description
`./airllm.sh start`	Build image (if needed) and start the container
`./airllm.sh stop`	Stop and remove the running container
`./airllm.sh restart`	Stop and restart without rebuilding the image
`./airllm.sh rebuild`	Stop, rebuild the image, and restart
`./airllm.sh logs`	Follow live container logs
`./airllm.sh status`	Check NVMe mount and container running state

Warning

Startup Delay: Loading a 7B model from NVMe into memory with 4-bit quantization takes 5 to 7 minutes on consumer hardware. During this time, the API (port 11434) will refuse connections (e.g. from the Continue CLI). Use ./airllm.sh logs and wait for the message Application startup complete before trying to connect.

Continue CLI Integration

This server is specifically designed to be fully compatible with the Continue CLI as an OpenAI-compatible custom provider.

Installing Continue CLI

To use the Continue CLI, you can install it via npm:

npm install -g @continuedev/cli

Configuration

To connect the Continue CLI to your AirLLM server, create or modify your ~/.continue/config.json with the following entry:

{
  "models": [
    {
      "title": "AirLLM Qwen2.5-Coder",
      "provider": "openai",
      "model": "qwen2.5-coder-7b",
      "apiBase": "http://localhost:11434/v1"
    }
  ]
}

Now you can run the CLI tool and use your locally hosted LLM!

cn --config ~/.continue/config.yaml

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
ai		ai
models		models
.gitignore		.gitignore
AGENTS.md		AGENTS.md
Dockerfile		Dockerfile
README.md		README.md
airllm.sh		airllm.sh
config.json.example		config.json.example
requirements.txt		requirements.txt
server.py		server.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AirLLM Docker Setup

Features

Prerequisites

Quick Start Configuration

1. Model Configuration

2. NVMe Drive Setup (Recommended)

3. Usage

Continue CLI Integration

Installing Continue CLI

Configuration

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AirLLM Docker Setup

Features

Prerequisites

Quick Start Configuration

1. Model Configuration

2. NVMe Drive Setup (Recommended)

3. Usage

Continue CLI Integration

Installing Continue CLI

Configuration

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages