Skip to content

xencon/airllm-docker

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

67 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AirLLM Docker Setup

A complete Dockerized environment for running AirLLM, optimized for running massive Large Language Models (LLMs) on single GPUs using layer swapping, 4-bit quantization, and NVMe optimization.

This repository is specifically tailored to run on consumer hardware while providing a seamless, OpenAI-compatible streaming API.

Features

  • OpenAI-Compatible API: Streaming endpoint (/v1/chat/completions) and models list (/v1/models) ready for drop-in integration with various UIs and CLI tools.
  • NVMe Layer Swapping Supported: Built-in support for blazing-fast model inference using NVMe RAM drives when massive models won't fit entirely in simple GPU VRAM.
  • Optimized for Consumer Hardware: Built-in environment optimizations including BitsAndBytes (nf4), Flash Attention 2 (where supported), and OMP CPU pre-fetching logic.
  • Graceful Loading: API responds immediately on startup with a loading notice while the model initialises in the background — no connection refused errors.

Prerequisites

  • Docker Engine
  • NVIDIA GPU Drivers (CUDA 12.4+ supported)
  • NVIDIA Container Toolkit (Required to pass GPUs into the container)
  • Optional but Highly Recommended: An NVMe Drive for faster layer swapping.

Quick Start Configuration

1. Model Configuration

By default, this repository is configured to serve the Qwen/Qwen2.5-Coder-7B-Instruct model (as defined in config.json). You can customize this by editing the config.json file inside the repository before starting the server.

2. NVMe Drive Setup (Recommended)

To achieve the best inference performance via layer swapping, it is highly recommended to dedicate an NVMe drive on your host machine to /mnt/nvme_ram.

If you have a dedicated drive available (e.g., /dev/nvme0n1 or /dev/sde), you can format and mount it using the following commands:

WARNING: Formatting a drive will erase all its existing data! Be absolutely sure you have the correct drive identifier.

# Wipe any existing filesystem signatures
sudo wipefs -a <your_nvme_device>

# Format the drive
sudo mkfs.ext4 -F <your_nvme_device>

# Create mount point
sudo mkdir -p /mnt/nvme_ram

# Mount the drive
sudo mount -o noatime <your_nvme_device> /mnt/nvme_ram

# Set permissions
sudo chown -R $USER:$USER /mnt/nvme_ram
sudo chmod -R 755 /mnt/nvme_ram

Place your config.json inside /mnt/nvme_ram. When starting the server, if the script detects /mnt/nvme_ram/config.json, it will use the NVMe drive automatically. Otherwise, it will fall back to using a local ./models directory for the model cache.

3. Usage

Use the provided airllm.sh control script to manage the container lifecycle.

Command Description
./airllm.sh start Build image (if needed) and start the container
./airllm.sh stop Stop and remove the running container
./airllm.sh restart Stop and restart without rebuilding the image
./airllm.sh rebuild Stop, rebuild the image, and restart
./airllm.sh logs Follow live container logs
./airllm.sh status Check NVMe mount and container running state

Warning

Startup Delay: Loading a 7B model from NVMe into memory with 4-bit quantization takes 5 to 7 minutes on consumer hardware. During this time, the API (port 11434) will refuse connections (e.g. from the Continue CLI). Use ./airllm.sh logs and wait for the message Application startup complete before trying to connect.

Continue CLI Integration

This server is specifically designed to be fully compatible with the Continue CLI as an OpenAI-compatible custom provider.

Installing Continue CLI

To use the Continue CLI, you can install it via npm:

npm install -g @continuedev/cli

Configuration

To connect the Continue CLI to your AirLLM server, create or modify your ~/.continue/config.json with the following entry:

{
  "models": [
    {
      "title": "AirLLM Qwen2.5-Coder",
      "provider": "openai",
      "model": "qwen2.5-coder-7b",
      "apiBase": "http://localhost:11434/v1"
    }
  ]
}

Now you can run the CLI tool and use your locally hosted LLM!

cn --config ~/.continue/config.yaml

About

AirLLM Docker container

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors