From c9c96007a6679aebedd1664f39adaffc1ab1091f Mon Sep 17 00:00:00 2001 From: WashingtonKK Date: Fri, 20 Feb 2026 13:11:19 +0300 Subject: [PATCH 1/4] CD-29 - Add deploying custom models docs Signed-off-by: WashingtonKK --- .../custom-model-deployment.md | 406 ++++++++++++++++++ docs/developer-guide/index.md | 1 + sidebars.ts | 1 + 3 files changed, 408 insertions(+) create mode 100644 docs/developer-guide/custom-model-deployment.md diff --git a/docs/developer-guide/custom-model-deployment.md b/docs/developer-guide/custom-model-deployment.md new file mode 100644 index 0000000..a73d17d --- /dev/null +++ b/docs/developer-guide/custom-model-deployment.md @@ -0,0 +1,406 @@ +--- +id: custom-model-deployment +title: Deploying Custom Models +sidebar_position: 6 +--- + +## Deploying Custom Models with HAL and Cloud-Init + +Cube AI supports deploying custom LLM models into Confidential VMs (CVMs) through two approaches: **Buildroot HAL images** and **Ubuntu cloud-init**. This guide covers both paths for Ollama and vLLM backends. + +:::info +For basic model file transfer into a running CVM, see [Private Model Upload](/developer-guide/private-model-upload). This guide covers full deployment workflows including build-time configuration and automated provisioning. +::: + +--- + +## Approach 1: Buildroot HAL (Build-Time) + +The Buildroot HAL embeds model configuration directly into the CVM image. Models are pulled automatically on first boot based on settings configured during the build. + +### Ollama Custom Models + +#### Configure via Menuconfig + +During HAL image configuration (see [HAL guide](/developer-guide/hal)), navigate to: + +**Target packages → Cube packages → ollama** + +Enable these options: + +- **Install default models** — Pulls `llama3.2:3b`, `starcoder2:3b`, and `nomic-embed-text:v1.5` on first boot +- **Custom models to install** — Space-separated list of additional Ollama models + +For example, to add `llama2:7b` and `mistral:7b`: + +``` +Custom models to install: llama2:7b mistral:7b codellama:13b +``` + +#### Configure via Defconfig + +Alternatively, set models directly in the Buildroot defconfig or via `make menuconfig` save: + +```bash +BR2_PACKAGE_OLLAMA_MODELS=y +BR2_PACKAGE_OLLAMA_CUSTOM_MODELS="llama2:7b mistral:7b codellama:13b" +``` + +Then rebuild the image: + +```bash +make -j$(nproc) +``` + +#### How It Works + +The Ollama package installs a model-pull script at `/usr/libexec/ollama/pull-models.sh` that runs after the Ollama service starts. The script retries each model pull up to 20 times with 5-second intervals, handling temporary network issues during boot. + +#### GPU Support + +Enable GPU acceleration in menuconfig under **ollama → Enable GPU support**, then select the GPU type: + +- **NVIDIA GPU** — Requires NVIDIA drivers and CUDA +- **AMD GPU (ROCm)** — Requires ROCm drivers + +### vLLM Custom Models + +#### HuggingFace Models + +Set the model identifier in menuconfig under **Target packages → Cube packages → vllm**: + +- **Model to use** — HuggingFace model ID (e.g., `meta-llama/Llama-2-7b-hf`) +- **GPU Memory Utilization** — Fraction of GPU memory (default: `0.85`) +- **Maximum Model Length** — Max sequence length (default: `1024`) + +Or via defconfig: + +```bash +BR2_PACKAGE_VLLM_MODEL="meta-llama/Llama-2-7b-hf" +BR2_PACKAGE_VLLM_GPU_MEMORY="0.90" +BR2_PACKAGE_VLLM_MAX_MODEL_LEN="2048" +``` + +The model is downloaded from HuggingFace on first boot and cached at `/var/lib/vllm/`. + +#### Local Model Files + +To embed model files directly into the image instead of downloading them: + +1. Set the **Custom model path** to a directory on your build machine containing the model files: + +```bash +BR2_PACKAGE_VLLM_CUSTOM_MODEL_PATH="/path/to/local/model" +``` + +2. The build system copies the model files into `/var/lib/vllm/models/` in the image and automatically configures vLLM to use the local path. + +### Cube Agent Backend Selection + +The Cube Agent must be configured to point to the correct backend. In menuconfig under **Target packages → Cube packages → cube-agent → LLM Backend**: + +| Backend | Target URL | When to Use | +| --- | --- | --- | +| Ollama | `http://localhost:11434` | Default, lightweight models | +| vLLM | `http://localhost:8000` | GPU-accelerated production workloads | +| Custom URL | User-defined | External or custom backend | + +--- + +## Approach 2: Ubuntu Cloud-Init + +The Ubuntu cloud-init approach uses a `user-data` configuration file to provision a VM with custom models during first boot. This is the recommended path for development and when using Ubuntu-based CVMs. + +### Overview + +The cloud-init script in `hal/ubuntu/qemu.sh` generates a full VM that: + +1. Installs Ollama from the official installer +2. Builds the Cube Agent from source +3. Creates systemd services for both +4. Pulls configured models on first boot + +### Customizing Models in Cloud-Init + +Edit the `user-data` section in `hal/ubuntu/qemu.sh` to change which models are pulled. + +#### Default Models + +The default configuration pulls these models: + +```yaml +write_files: + - path: /usr/local/bin/pull-ollama-models.sh + content: | + #!/bin/bash + for i in $(seq 1 60); do + if curl -s http://localhost:11434/api/version > /dev/null 2>&1; then + break + fi + sleep 2 + done + /usr/local/bin/ollama pull tinyllama:1.1b + /usr/local/bin/ollama pull starcoder2:3b + /usr/local/bin/ollama pull nomic-embed-text:v1.5 + permissions: '0755' +``` + +#### Adding Custom Models + +To deploy different models, modify the `pull-ollama-models.sh` content in the `write_files` section: + +```yaml +write_files: + - path: /usr/local/bin/pull-ollama-models.sh + content: | + #!/bin/bash + for i in $(seq 1 60); do + if curl -s http://localhost:11434/api/version > /dev/null 2>&1; then + break + fi + sleep 2 + done + # Default models + /usr/local/bin/ollama pull tinyllama:1.1b + # Custom models + /usr/local/bin/ollama pull llama2:7b + /usr/local/bin/ollama pull mistral:7b + /usr/local/bin/ollama pull codellama:13b + permissions: '0755' +``` + +#### Using a Custom Modelfile + +To deploy a model from a custom Modelfile (for fine-tuned or customized models), add it to the `write_files` section and create it during `runcmd`: + +```yaml +write_files: + - path: /etc/cube/custom-model/Modelfile + content: | + FROM llama2:7b + PARAMETER temperature 0.7 + PARAMETER top_p 0.9 + SYSTEM "You are a helpful coding assistant." + permissions: '0644' + +runcmd: + # ... (after ollama is installed and running) + - /usr/local/bin/ollama create my-custom-model -f /etc/cube/custom-model/Modelfile +``` + +#### Configuring the Cube Agent + +The agent configuration is set via cloud-init in `/etc/cube/agent.env`: + +```yaml +write_files: + - path: /etc/cube/agent.env + content: | + UV_CUBE_AGENT_LOG_LEVEL=info + UV_CUBE_AGENT_HOST=0.0.0.0 + UV_CUBE_AGENT_PORT=7001 + UV_CUBE_AGENT_INSTANCE_ID=cube-agent-01 + UV_CUBE_AGENT_TARGET_URL=http://localhost:11434 + UV_CUBE_AGENT_SERVER_CERT=/etc/cube/certs/server.crt + UV_CUBE_AGENT_SERVER_KEY=/etc/cube/certs/server.key + UV_CUBE_AGENT_SERVER_CA_CERTS=/etc/cube/certs/ca.crt + UV_CUBE_AGENT_CA_URL=https://prism.ultraviolet.rs/am-certs + permissions: '0644' +``` + +To use vLLM instead of Ollama, change the target URL: + +```bash +UV_CUBE_AGENT_TARGET_URL=http://localhost:8000 +``` + +### Launching the Cloud-Init VM + +Run the script from the `hal/ubuntu/` directory: + +```bash +cd cube/hal/ubuntu +sudo bash qemu.sh +``` + +The script: + +1. Downloads the Ubuntu Noble cloud image (if not already present) +2. Creates a QCOW2 overlay disk +3. Generates a cloud-init seed image from the `user-data` configuration +4. Detects TDX support and launches the VM accordingly + +**Port mappings:** + +| Host Port | Guest Port | Service | +| --- | --- | --- | +| 6190 | 22 | SSH | +| 6191 | 80 | HTTP | +| 6192 | 443 | HTTPS | +| 6193 | 7001 | Cube Agent | + +**TDX mode control:** + +```bash +# Auto-detect (default) +sudo ENABLE_CVM=auto bash qemu.sh + +# Force TDX +sudo ENABLE_CVM=tdx bash qemu.sh + +# Disable CVM (regular VM) +sudo ENABLE_CVM=none bash qemu.sh +``` + +--- + +## Runtime Model Deployment + +After a CVM is running (regardless of which approach was used to create it), you can deploy additional models at runtime. + +### SSH into the CVM + +```bash +# Buildroot CVM +ssh -p 6190 root@localhost + +# Ubuntu cloud-init CVM +ssh -p 6190 ultraviolet@localhost +# Password: password +``` + +### Pull Ollama Models at Runtime + +```bash +# List current models +ollama list + +# Pull a new model +ollama pull llama2:7b + +# Create a model from a Modelfile +cat > /tmp/Modelfile << 'EOF' +FROM llama2:7b +PARAMETER temperature 0.8 +SYSTEM "You are a domain-specific assistant." +EOF +ollama create my-model -f /tmp/Modelfile + +# Verify the model is available +ollama list +``` + +### Upload Model Files via SCP + +For models not available in registries: + +```bash +# From the host, copy model files into the CVM +scp -P 6190 model-weights.tar.gz root@localhost:~ + +# Inside the CVM, extract and register +tar -xzf model-weights.tar.gz +# For Ollama, copy to model directory +cp -r extracted-model /var/lib/ollama/models/ +``` + +### Verify Model Availability + +Test that the model is accessible through the Cube Agent: + +```bash +# From the host +curl http://localhost:6193/v1/models + +# Or make a chat completion request +curl http://localhost:6193/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "llama2:7b", + "messages": [{"role": "user", "content": "Hello"}] + }' +``` + +--- + +## Comparison of Deployment Approaches + +| Feature | Buildroot HAL | Cloud-Init (Ubuntu) | +| --- | --- | --- | +| Base OS | Minimal Buildroot Linux | Ubuntu Noble | +| Image size | Small (~hundreds of MB) | Larger (~GB+) | +| Build time | ~1 hour | Minutes (download-based) | +| Model config | Build-time via menuconfig | Cloud-init user-data | +| Model pull | On first boot (auto) | On first boot (auto) | +| Customization | Requires rebuild | Edit user-data file | +| GPU support | Via Buildroot packages | Via Ubuntu packages | +| Best for | Production, minimal images | Development, rapid iteration | +| TEE support | AMD SEV-SNP, Intel TDX | Intel TDX | +| Init system | SysV or systemd | systemd | + +--- + +## Troubleshooting + +### Models Fail to Pull on Boot + +Check network connectivity inside the CVM: + +```bash +# Test DNS resolution +ping -c 1 ollama.com + +# Check Ollama service status +systemctl status ollama +# or +/etc/init.d/S96ollama status +``` + +For Buildroot images, the pull script retries 20 times. Check the logs: + +```bash +journalctl -u ollama -f +``` + +### Ollama Reports Insufficient Disk Space + +The default Buildroot rootfs is limited in size. Increase it during the build: + +```bash +# In menuconfig: Filesystem images → ext4 root filesystem → size +# Or in defconfig: +BR2_TARGET_ROOTFS_EXT2_SIZE="30G" +``` + +For cloud-init VMs, the disk is controlled by `DISK_SIZE` in `qemu.sh` (default: `35G`). + +### vLLM Fails to Load Model + +Verify GPU is available and the model fits in memory: + +```bash +# Check GPU +nvidia-smi + +# Check vLLM config +cat /etc/vllm/vllm.env + +# Restart with adjusted settings +systemctl restart vllm +``` + +### Agent Cannot Reach Backend + +Verify the backend service is running and the agent's target URL matches: + +```bash +# Check agent config +cat /etc/cube/agent.env + +# Test backend directly +curl http://localhost:11434/api/tags # Ollama +curl http://localhost:8000/v1/models # vLLM + +# Restart agent +systemctl restart cube-agent +``` diff --git a/docs/developer-guide/index.md b/docs/developer-guide/index.md index db8d483..3fba229 100644 --- a/docs/developer-guide/index.md +++ b/docs/developer-guide/index.md @@ -18,6 +18,7 @@ as private model upload and fine-tuning. - **Chat UI** - **Hardware Abstraction Layer (HAL)** - **CVM Management** +- **Deploying Custom Models** - Deploy custom models via HAL build-time config or cloud-init - **Private Model Upload** - **Fine-Tuning Models** - **Guardrails** - AI safety controls for input validation and output sanitization diff --git a/sidebars.ts b/sidebars.ts index 1c9eeb2..f6f1dba 100644 --- a/sidebars.ts +++ b/sidebars.ts @@ -59,6 +59,7 @@ const sidebars: SidebarsConfig = { 'developer-guide/private-model-upload', 'developer-guide/hal', 'developer-guide/cvm-management', + 'developer-guide/custom-model-deployment', 'developer-guide/fine-tuning', 'developer-guide/auth-and-request-flow', ], From 83341686551d389cc42b57cb0594deefc6593ee4 Mon Sep 17 00:00:00 2001 From: WashingtonKK Date: Fri, 20 Feb 2026 13:24:50 +0300 Subject: [PATCH 2/4] lint doc Signed-off-by: WashingtonKK --- docs/developer-guide/custom-model-deployment.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/developer-guide/custom-model-deployment.md b/docs/developer-guide/custom-model-deployment.md index a73d17d..1db0268 100644 --- a/docs/developer-guide/custom-model-deployment.md +++ b/docs/developer-guide/custom-model-deployment.md @@ -24,7 +24,7 @@ The Buildroot HAL embeds model configuration directly into the CVM image. Models During HAL image configuration (see [HAL guide](/developer-guide/hal)), navigate to: -**Target packages → Cube packages → ollama** +Menu path: **Target packages → Cube packages → ollama** Enable these options: @@ -33,7 +33,7 @@ Enable these options: For example, to add `llama2:7b` and `mistral:7b`: -``` +```bash Custom models to install: llama2:7b mistral:7b codellama:13b ``` @@ -93,7 +93,7 @@ To embed model files directly into the image instead of downloading them: BR2_PACKAGE_VLLM_CUSTOM_MODEL_PATH="/path/to/local/model" ``` -2. The build system copies the model files into `/var/lib/vllm/models/` in the image and automatically configures vLLM to use the local path. +1. The build system copies the model files into `/var/lib/vllm/models/` in the image and automatically configures vLLM to use the local path. ### Cube Agent Backend Selection From 3a2a49b0a3017e2f6f3c5895402c926de5c6ae2d Mon Sep 17 00:00:00 2001 From: WashingtonKK Date: Tue, 24 Feb 2026 11:35:42 +0300 Subject: [PATCH 3/4] update custom model upload Signed-off-by: WashingtonKK --- .../custom-model-deployment.md | 406 ------------------ docs/developer-guide/index.md | 3 +- docs/developer-guide/private-model-upload.md | 276 +++++++++++- sidebars.ts | 1 - 4 files changed, 268 insertions(+), 418 deletions(-) delete mode 100644 docs/developer-guide/custom-model-deployment.md diff --git a/docs/developer-guide/custom-model-deployment.md b/docs/developer-guide/custom-model-deployment.md deleted file mode 100644 index 1db0268..0000000 --- a/docs/developer-guide/custom-model-deployment.md +++ /dev/null @@ -1,406 +0,0 @@ ---- -id: custom-model-deployment -title: Deploying Custom Models -sidebar_position: 6 ---- - -## Deploying Custom Models with HAL and Cloud-Init - -Cube AI supports deploying custom LLM models into Confidential VMs (CVMs) through two approaches: **Buildroot HAL images** and **Ubuntu cloud-init**. This guide covers both paths for Ollama and vLLM backends. - -:::info -For basic model file transfer into a running CVM, see [Private Model Upload](/developer-guide/private-model-upload). This guide covers full deployment workflows including build-time configuration and automated provisioning. -::: - ---- - -## Approach 1: Buildroot HAL (Build-Time) - -The Buildroot HAL embeds model configuration directly into the CVM image. Models are pulled automatically on first boot based on settings configured during the build. - -### Ollama Custom Models - -#### Configure via Menuconfig - -During HAL image configuration (see [HAL guide](/developer-guide/hal)), navigate to: - -Menu path: **Target packages → Cube packages → ollama** - -Enable these options: - -- **Install default models** — Pulls `llama3.2:3b`, `starcoder2:3b`, and `nomic-embed-text:v1.5` on first boot -- **Custom models to install** — Space-separated list of additional Ollama models - -For example, to add `llama2:7b` and `mistral:7b`: - -```bash -Custom models to install: llama2:7b mistral:7b codellama:13b -``` - -#### Configure via Defconfig - -Alternatively, set models directly in the Buildroot defconfig or via `make menuconfig` save: - -```bash -BR2_PACKAGE_OLLAMA_MODELS=y -BR2_PACKAGE_OLLAMA_CUSTOM_MODELS="llama2:7b mistral:7b codellama:13b" -``` - -Then rebuild the image: - -```bash -make -j$(nproc) -``` - -#### How It Works - -The Ollama package installs a model-pull script at `/usr/libexec/ollama/pull-models.sh` that runs after the Ollama service starts. The script retries each model pull up to 20 times with 5-second intervals, handling temporary network issues during boot. - -#### GPU Support - -Enable GPU acceleration in menuconfig under **ollama → Enable GPU support**, then select the GPU type: - -- **NVIDIA GPU** — Requires NVIDIA drivers and CUDA -- **AMD GPU (ROCm)** — Requires ROCm drivers - -### vLLM Custom Models - -#### HuggingFace Models - -Set the model identifier in menuconfig under **Target packages → Cube packages → vllm**: - -- **Model to use** — HuggingFace model ID (e.g., `meta-llama/Llama-2-7b-hf`) -- **GPU Memory Utilization** — Fraction of GPU memory (default: `0.85`) -- **Maximum Model Length** — Max sequence length (default: `1024`) - -Or via defconfig: - -```bash -BR2_PACKAGE_VLLM_MODEL="meta-llama/Llama-2-7b-hf" -BR2_PACKAGE_VLLM_GPU_MEMORY="0.90" -BR2_PACKAGE_VLLM_MAX_MODEL_LEN="2048" -``` - -The model is downloaded from HuggingFace on first boot and cached at `/var/lib/vllm/`. - -#### Local Model Files - -To embed model files directly into the image instead of downloading them: - -1. Set the **Custom model path** to a directory on your build machine containing the model files: - -```bash -BR2_PACKAGE_VLLM_CUSTOM_MODEL_PATH="/path/to/local/model" -``` - -1. The build system copies the model files into `/var/lib/vllm/models/` in the image and automatically configures vLLM to use the local path. - -### Cube Agent Backend Selection - -The Cube Agent must be configured to point to the correct backend. In menuconfig under **Target packages → Cube packages → cube-agent → LLM Backend**: - -| Backend | Target URL | When to Use | -| --- | --- | --- | -| Ollama | `http://localhost:11434` | Default, lightweight models | -| vLLM | `http://localhost:8000` | GPU-accelerated production workloads | -| Custom URL | User-defined | External or custom backend | - ---- - -## Approach 2: Ubuntu Cloud-Init - -The Ubuntu cloud-init approach uses a `user-data` configuration file to provision a VM with custom models during first boot. This is the recommended path for development and when using Ubuntu-based CVMs. - -### Overview - -The cloud-init script in `hal/ubuntu/qemu.sh` generates a full VM that: - -1. Installs Ollama from the official installer -2. Builds the Cube Agent from source -3. Creates systemd services for both -4. Pulls configured models on first boot - -### Customizing Models in Cloud-Init - -Edit the `user-data` section in `hal/ubuntu/qemu.sh` to change which models are pulled. - -#### Default Models - -The default configuration pulls these models: - -```yaml -write_files: - - path: /usr/local/bin/pull-ollama-models.sh - content: | - #!/bin/bash - for i in $(seq 1 60); do - if curl -s http://localhost:11434/api/version > /dev/null 2>&1; then - break - fi - sleep 2 - done - /usr/local/bin/ollama pull tinyllama:1.1b - /usr/local/bin/ollama pull starcoder2:3b - /usr/local/bin/ollama pull nomic-embed-text:v1.5 - permissions: '0755' -``` - -#### Adding Custom Models - -To deploy different models, modify the `pull-ollama-models.sh` content in the `write_files` section: - -```yaml -write_files: - - path: /usr/local/bin/pull-ollama-models.sh - content: | - #!/bin/bash - for i in $(seq 1 60); do - if curl -s http://localhost:11434/api/version > /dev/null 2>&1; then - break - fi - sleep 2 - done - # Default models - /usr/local/bin/ollama pull tinyllama:1.1b - # Custom models - /usr/local/bin/ollama pull llama2:7b - /usr/local/bin/ollama pull mistral:7b - /usr/local/bin/ollama pull codellama:13b - permissions: '0755' -``` - -#### Using a Custom Modelfile - -To deploy a model from a custom Modelfile (for fine-tuned or customized models), add it to the `write_files` section and create it during `runcmd`: - -```yaml -write_files: - - path: /etc/cube/custom-model/Modelfile - content: | - FROM llama2:7b - PARAMETER temperature 0.7 - PARAMETER top_p 0.9 - SYSTEM "You are a helpful coding assistant." - permissions: '0644' - -runcmd: - # ... (after ollama is installed and running) - - /usr/local/bin/ollama create my-custom-model -f /etc/cube/custom-model/Modelfile -``` - -#### Configuring the Cube Agent - -The agent configuration is set via cloud-init in `/etc/cube/agent.env`: - -```yaml -write_files: - - path: /etc/cube/agent.env - content: | - UV_CUBE_AGENT_LOG_LEVEL=info - UV_CUBE_AGENT_HOST=0.0.0.0 - UV_CUBE_AGENT_PORT=7001 - UV_CUBE_AGENT_INSTANCE_ID=cube-agent-01 - UV_CUBE_AGENT_TARGET_URL=http://localhost:11434 - UV_CUBE_AGENT_SERVER_CERT=/etc/cube/certs/server.crt - UV_CUBE_AGENT_SERVER_KEY=/etc/cube/certs/server.key - UV_CUBE_AGENT_SERVER_CA_CERTS=/etc/cube/certs/ca.crt - UV_CUBE_AGENT_CA_URL=https://prism.ultraviolet.rs/am-certs - permissions: '0644' -``` - -To use vLLM instead of Ollama, change the target URL: - -```bash -UV_CUBE_AGENT_TARGET_URL=http://localhost:8000 -``` - -### Launching the Cloud-Init VM - -Run the script from the `hal/ubuntu/` directory: - -```bash -cd cube/hal/ubuntu -sudo bash qemu.sh -``` - -The script: - -1. Downloads the Ubuntu Noble cloud image (if not already present) -2. Creates a QCOW2 overlay disk -3. Generates a cloud-init seed image from the `user-data` configuration -4. Detects TDX support and launches the VM accordingly - -**Port mappings:** - -| Host Port | Guest Port | Service | -| --- | --- | --- | -| 6190 | 22 | SSH | -| 6191 | 80 | HTTP | -| 6192 | 443 | HTTPS | -| 6193 | 7001 | Cube Agent | - -**TDX mode control:** - -```bash -# Auto-detect (default) -sudo ENABLE_CVM=auto bash qemu.sh - -# Force TDX -sudo ENABLE_CVM=tdx bash qemu.sh - -# Disable CVM (regular VM) -sudo ENABLE_CVM=none bash qemu.sh -``` - ---- - -## Runtime Model Deployment - -After a CVM is running (regardless of which approach was used to create it), you can deploy additional models at runtime. - -### SSH into the CVM - -```bash -# Buildroot CVM -ssh -p 6190 root@localhost - -# Ubuntu cloud-init CVM -ssh -p 6190 ultraviolet@localhost -# Password: password -``` - -### Pull Ollama Models at Runtime - -```bash -# List current models -ollama list - -# Pull a new model -ollama pull llama2:7b - -# Create a model from a Modelfile -cat > /tmp/Modelfile << 'EOF' -FROM llama2:7b -PARAMETER temperature 0.8 -SYSTEM "You are a domain-specific assistant." -EOF -ollama create my-model -f /tmp/Modelfile - -# Verify the model is available -ollama list -``` - -### Upload Model Files via SCP - -For models not available in registries: - -```bash -# From the host, copy model files into the CVM -scp -P 6190 model-weights.tar.gz root@localhost:~ - -# Inside the CVM, extract and register -tar -xzf model-weights.tar.gz -# For Ollama, copy to model directory -cp -r extracted-model /var/lib/ollama/models/ -``` - -### Verify Model Availability - -Test that the model is accessible through the Cube Agent: - -```bash -# From the host -curl http://localhost:6193/v1/models - -# Or make a chat completion request -curl http://localhost:6193/v1/chat/completions \ - -H "Content-Type: application/json" \ - -d '{ - "model": "llama2:7b", - "messages": [{"role": "user", "content": "Hello"}] - }' -``` - ---- - -## Comparison of Deployment Approaches - -| Feature | Buildroot HAL | Cloud-Init (Ubuntu) | -| --- | --- | --- | -| Base OS | Minimal Buildroot Linux | Ubuntu Noble | -| Image size | Small (~hundreds of MB) | Larger (~GB+) | -| Build time | ~1 hour | Minutes (download-based) | -| Model config | Build-time via menuconfig | Cloud-init user-data | -| Model pull | On first boot (auto) | On first boot (auto) | -| Customization | Requires rebuild | Edit user-data file | -| GPU support | Via Buildroot packages | Via Ubuntu packages | -| Best for | Production, minimal images | Development, rapid iteration | -| TEE support | AMD SEV-SNP, Intel TDX | Intel TDX | -| Init system | SysV or systemd | systemd | - ---- - -## Troubleshooting - -### Models Fail to Pull on Boot - -Check network connectivity inside the CVM: - -```bash -# Test DNS resolution -ping -c 1 ollama.com - -# Check Ollama service status -systemctl status ollama -# or -/etc/init.d/S96ollama status -``` - -For Buildroot images, the pull script retries 20 times. Check the logs: - -```bash -journalctl -u ollama -f -``` - -### Ollama Reports Insufficient Disk Space - -The default Buildroot rootfs is limited in size. Increase it during the build: - -```bash -# In menuconfig: Filesystem images → ext4 root filesystem → size -# Or in defconfig: -BR2_TARGET_ROOTFS_EXT2_SIZE="30G" -``` - -For cloud-init VMs, the disk is controlled by `DISK_SIZE` in `qemu.sh` (default: `35G`). - -### vLLM Fails to Load Model - -Verify GPU is available and the model fits in memory: - -```bash -# Check GPU -nvidia-smi - -# Check vLLM config -cat /etc/vllm/vllm.env - -# Restart with adjusted settings -systemctl restart vllm -``` - -### Agent Cannot Reach Backend - -Verify the backend service is running and the agent's target URL matches: - -```bash -# Check agent config -cat /etc/cube/agent.env - -# Test backend directly -curl http://localhost:11434/api/tags # Ollama -curl http://localhost:8000/v1/models # vLLM - -# Restart agent -systemctl restart cube-agent -``` diff --git a/docs/developer-guide/index.md b/docs/developer-guide/index.md index 3fba229..1628285 100644 --- a/docs/developer-guide/index.md +++ b/docs/developer-guide/index.md @@ -18,8 +18,7 @@ as private model upload and fine-tuning. - **Chat UI** - **Hardware Abstraction Layer (HAL)** - **CVM Management** -- **Deploying Custom Models** - Deploy custom models via HAL build-time config or cloud-init -- **Private Model Upload** +- **Private Model Upload** - Deploy custom models via HAL build-time config, cloud-init, or runtime upload - **Fine-Tuning Models** - **Guardrails** - AI safety controls for input validation and output sanitization diff --git a/docs/developer-guide/private-model-upload.md b/docs/developer-guide/private-model-upload.md index c20afc8..5cf549d 100644 --- a/docs/developer-guide/private-model-upload.md +++ b/docs/developer-guide/private-model-upload.md @@ -6,24 +6,282 @@ sidebar_position: 3 ## Uploading Private Models to Cube AI -This guide explains how to upload private models into the Ollama runtime inside a confidential VM. +This guide explains how to upload and deploy private or custom models into a Cube AI Confidential VM (CVM). Private models are models that are not available in public registries (Ollama library, HuggingFace) — for example, fine-tuned models, proprietary weights, or models with restricted access. -## 1. Package Model Files +--- + +## Ollama Backend + +### Upload Model Files to a Running CVM + +#### 1. Package Model Files + +Prepare your model weights and any associated files into an archive: + +```bash +tar -czvf my-model.tar.gz /path/to/model/files +``` + +#### 2. Transfer to the CVM + +Copy the archive into the CVM via SCP using the forwarded SSH port: + +```bash +# Buildroot CVM +scp -P 6190 my-model.tar.gz root@localhost:/var/lib/ollama/ + +# Ubuntu cloud-init CVM +scp -P 6190 my-model.tar.gz ultraviolet@localhost:/var/lib/ollama/ +``` + +#### 3. Extract and Register the Model + +SSH into the CVM and create an Ollama model from the uploaded files using a [Modelfile](https://github.com/ollama/ollama/blob/main/docs/modelfile.md): + +```bash +ssh -p 6190 root@localhost + +cd /var/lib/ollama +tar -xzvf my-model.tar.gz +``` + +Create a Modelfile that references the uploaded weights: + +```bash +cat > /tmp/Modelfile << 'EOF' +FROM /var/lib/ollama/my-model/weights.gguf +PARAMETER temperature 0.7 +PARAMETER top_p 0.9 +SYSTEM "You are a helpful assistant." +EOF + +ollama create my-custom-model -f /tmp/Modelfile +``` + +#### 4. Verify the Model ```bash -tar -czvf model-name.tar.gz /path/to/model/files +ollama list ``` -## 2. Transfer and Extract Model in CVM +Test inference: ```bash -scp model-name.tar.gz user@:~ -gunzip model-name.tar.gz -tar -xvf model-name.tar +curl http://localhost:7001/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "my-custom-model", + "messages": [{"role": "user", "content": "Hello"}] + }' ``` -## 3. Copy Into Ollama +### Embed a Private Model in a Buildroot HAL Image + +To include a private model directly in the HAL image at build time, use the Buildroot filesystem overlay: + +#### 1. Place Model Files in the Overlay + +Create a directory for the model in the overlay structure: + +```bash +mkdir -p cube/hal/buildroot/linux/board/cube/overlay/var/lib/ollama/custom-models/ +cp /path/to/weights.gguf cube/hal/buildroot/linux/board/cube/overlay/var/lib/ollama/custom-models/ +``` + +#### 2. Add a Modelfile to the Overlay + +```bash +mkdir -p cube/hal/buildroot/linux/board/cube/overlay/etc/cube/modelfiles/ +cat > cube/hal/buildroot/linux/board/cube/overlay/etc/cube/modelfiles/my-model.Modelfile << 'EOF' +FROM /var/lib/ollama/custom-models/weights.gguf +PARAMETER temperature 0.7 +SYSTEM "You are a domain-specific assistant." +EOF +``` + +#### 3. Register the Model on First Boot + +Add a startup script in the overlay that creates the Ollama model after the service starts: + +```bash +mkdir -p cube/hal/buildroot/linux/board/cube/overlay/usr/libexec/ollama/ +cat > cube/hal/buildroot/linux/board/cube/overlay/usr/libexec/ollama/register-custom-models.sh << 'SCRIPT' +#!/bin/sh +# Wait for Ollama to be ready +for i in $(seq 1 30); do + if curl -s http://localhost:11434/api/version > /dev/null 2>&1; then + break + fi + sleep 2 +done + +# Register custom models from Modelfiles +for mf in /etc/cube/modelfiles/*.Modelfile; do + [ -f "$mf" ] || continue + name=$(basename "$mf" .Modelfile) + ollama create "$name" -f "$mf" +done +SCRIPT +chmod +x cube/hal/buildroot/linux/board/cube/overlay/usr/libexec/ollama/register-custom-models.sh +``` + +#### 4. Build the Image ```bash -docker cp /path/to/extracted ollama:/models/ +cd buildroot +make -j$(nproc) ``` + +The model weights are embedded in the rootfs and registered automatically on first boot. + +### Embed a Private Model via Cloud-Init + +To deploy a private model in an Ubuntu cloud-init CVM, modify the `user-data` section in `hal/ubuntu/qemu.sh`: + +#### 1. Pre-Stage Model Files on the Host + +Place model files in a directory accessible to the QEMU VM. The simplest approach is to transfer them after boot via the `runcmd` section. + +#### 2. Add a Modelfile and Registration to Cloud-Init + +Add the Modelfile and a registration command to the `write_files` and `runcmd` sections: + +```yaml +write_files: + - path: /etc/cube/modelfiles/my-model.Modelfile + content: | + FROM /var/lib/ollama/custom-models/weights.gguf + PARAMETER temperature 0.7 + SYSTEM "You are a domain-specific assistant." + permissions: '0644' + +runcmd: + # ... (existing commands) + # After ollama is installed and running, register the custom model + - | + for i in $(seq 1 60); do + if curl -s http://localhost:11434/api/version > /dev/null 2>&1; then + break + fi + sleep 2 + done + ollama create my-model -f /etc/cube/modelfiles/my-model.Modelfile +``` + +If the model weights need to be downloaded from a private source during provisioning, add a download step before registration: + +```yaml +runcmd: + # Download private model weights (e.g., from a private S3 bucket or internal server) + - mkdir -p /var/lib/ollama/custom-models + - curl -o /var/lib/ollama/custom-models/weights.gguf https://internal-server/models/weights.gguf + # Then register + - ollama create my-model -f /etc/cube/modelfiles/my-model.Modelfile +``` + +--- + +## vLLM Backend + +### Upload Custom Model Files to a Running CVM + +#### 1. Transfer Model Directory + +vLLM expects a HuggingFace-format model directory. Transfer the entire directory: + +```bash +scp -r -P 6190 /path/to/my-hf-model/ root@localhost:/var/lib/vllm/models/ +``` + +#### 2. Update vLLM Configuration + +SSH into the CVM and update the vLLM environment to point to the uploaded model: + +```bash +ssh -p 6190 root@localhost + +# Edit the vLLM config +sed -i 's|^VLLM_MODEL=.*|VLLM_MODEL=/var/lib/vllm/models/my-hf-model|' /etc/vllm/vllm.env + +# Restart vLLM +systemctl restart vllm +# or for SysV init: +/etc/init.d/S96vllm restart +``` + +#### 3. Verify + +```bash +curl http://localhost:8000/v1/models +``` + +### Embed a Custom Model in a Buildroot HAL Image + +Use the `BR2_PACKAGE_VLLM_CUSTOM_MODEL_PATH` option to embed model files at build time. + +#### 1. Configure the Model Path + +In `menuconfig`, navigate to **Target packages → Cube packages → vllm** and set: + +- **Custom model path** — Absolute path to the model directory on your build machine + +Or set it in the defconfig: + +```bash +BR2_PACKAGE_VLLM_CUSTOM_MODEL_PATH="/path/to/my-hf-model" +``` + +#### 2. Build + +```bash +make -j$(nproc) +``` + +The build system copies the model files into `/var/lib/vllm/models/` in the image and configures vLLM to use the local path automatically. + +### Embed a Custom Model via Cloud-Init + +Add a model download or transfer step to the `runcmd` section in `hal/ubuntu/qemu.sh`: + +```yaml +runcmd: + # Install vLLM + - pip install vllm + # Download private model + - mkdir -p /var/lib/vllm/models + - | + # Option A: Download from a private registry (requires HF token for gated models) + HF_TOKEN="your-token-here" + huggingface-cli download my-org/my-private-model \ + --local-dir /var/lib/vllm/models/my-private-model \ + --token "$HF_TOKEN" + # Configure and start vLLM + - | + cat > /etc/vllm/vllm.env << 'ENVEOF' + VLLM_MODEL=/var/lib/vllm/models/my-private-model + VLLM_GPU_MEMORY_UTILIZATION=0.85 + VLLM_MAX_MODEL_LEN=2048 + ENVEOF + - systemctl restart vllm +``` + +--- + +## Verifying Model Availability Through the Proxy + +After deploying a custom model, verify it is accessible end-to-end through the Cube Agent: + +```bash +# List available models +curl http://localhost:6193/v1/models + +# Test chat completions +curl http://localhost:6193/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "my-custom-model", + "messages": [{"role": "user", "content": "Hello"}] + }' +``` + +Port `6193` is the default host-side forwarded port for the Cube Agent (maps to port `7001` inside the CVM). diff --git a/sidebars.ts b/sidebars.ts index f6f1dba..1c9eeb2 100644 --- a/sidebars.ts +++ b/sidebars.ts @@ -59,7 +59,6 @@ const sidebars: SidebarsConfig = { 'developer-guide/private-model-upload', 'developer-guide/hal', 'developer-guide/cvm-management', - 'developer-guide/custom-model-deployment', 'developer-guide/fine-tuning', 'developer-guide/auth-and-request-flow', ], From cfe4dc85727b11c70b42291086f65f9a84e26b1e Mon Sep 17 00:00:00 2001 From: WashingtonKK Date: Thu, 26 Feb 2026 13:50:36 +0300 Subject: [PATCH 4/4] update docs Signed-off-by: WashingtonKK --- docs/developer-guide/private-model-upload.md | 264 ++++++++++--------- 1 file changed, 133 insertions(+), 131 deletions(-) diff --git a/docs/developer-guide/private-model-upload.md b/docs/developer-guide/private-model-upload.md index 5cf549d..0315231 100644 --- a/docs/developer-guide/private-model-upload.md +++ b/docs/developer-guide/private-model-upload.md @@ -8,87 +8,70 @@ sidebar_position: 3 This guide explains how to upload and deploy private or custom models into a Cube AI Confidential VM (CVM). Private models are models that are not available in public registries (Ollama library, HuggingFace) — for example, fine-tuned models, proprietary weights, or models with restricted access. ---- - -## Ollama Backend +### Port Reference -### Upload Model Files to a Running CVM +CVM network access uses QEMU user-mode port forwarding. The following host-to-guest port mappings are configured in the QEMU launch scripts (`hal/buildroot/qemu.sh` and `hal/ubuntu/qemu.sh`): -#### 1. Package Model Files +| Host Port | Guest Port | Service | +| --- | --- | --- | +| 6190 | 22 | SSH | +| 6193 | 7001 | Cube Agent API | -Prepare your model weights and any associated files into an archive: +Inside the CVM, the LLM backends listen on their own ports (not directly exposed to the host): -```bash -tar -czvf my-model.tar.gz /path/to/model/files -``` +| Port | Service | +| --- | --- | +| 11434 | Ollama API | +| 8000 | vLLM OpenAI-compatible API | -#### 2. Transfer to the CVM +The Cube Agent (port 7001 inside the CVM, 6193 on the host) acts as a reverse proxy to whichever LLM backend is configured, so all model inference requests go through the agent. -Copy the archive into the CVM via SCP using the forwarded SSH port: +--- -```bash -# Buildroot CVM -scp -P 6190 my-model.tar.gz root@localhost:/var/lib/ollama/ +## Build-Time Model Embedding (Buildroot HAL) -# Ubuntu cloud-init CVM -scp -P 6190 my-model.tar.gz ultraviolet@localhost:/var/lib/ollama/ -``` +The Buildroot HAL supports embedding custom model configuration directly into the CVM image via `menuconfig`. This is the recommended approach for production deployments where models should be available immediately after boot. -#### 3. Extract and Register the Model +### Ollama -SSH into the CVM and create an Ollama model from the uploaded files using a [Modelfile](https://github.com/ollama/ollama/blob/main/docs/modelfile.md): +#### Using menuconfig -```bash -ssh -p 6190 root@localhost +During HAL image configuration (see [HAL guide](/developer-guide/hal)), navigate to: -cd /var/lib/ollama -tar -xzvf my-model.tar.gz -``` +**Target packages → Cube packages → ollama** -Create a Modelfile that references the uploaded weights: +Set the **Custom models to install** field with a space-separated list of Ollama model tags: -```bash -cat > /tmp/Modelfile << 'EOF' -FROM /var/lib/ollama/my-model/weights.gguf -PARAMETER temperature 0.7 -PARAMETER top_p 0.9 -SYSTEM "You are a helpful assistant." -EOF - -ollama create my-custom-model -f /tmp/Modelfile +```text +llama2:7b mistral:7b codellama:13b ``` -#### 4. Verify the Model +These models are pulled automatically on first boot by a script installed at `/usr/libexec/ollama/pull-models.sh`. + +Or set it directly in the Buildroot defconfig: ```bash -ollama list +BR2_PACKAGE_OLLAMA_CUSTOM_MODELS="llama2:7b mistral:7b codellama:13b" ``` -Test inference: +Then rebuild: ```bash -curl http://localhost:7001/v1/chat/completions \ - -H "Content-Type: application/json" \ - -d '{ - "model": "my-custom-model", - "messages": [{"role": "user", "content": "Hello"}] - }' +make -j$(nproc) ``` -### Embed a Private Model in a Buildroot HAL Image - -To include a private model directly in the HAL image at build time, use the Buildroot filesystem overlay: +#### Embedding GGUF Weights in the Image -#### 1. Place Model Files in the Overlay +For models not available in the Ollama registry (e.g., your own fine-tuned GGUF weights), use the Buildroot filesystem overlay to embed the files directly: -Create a directory for the model in the overlay structure: +1. Place the model weights in the overlay: ```bash mkdir -p cube/hal/buildroot/linux/board/cube/overlay/var/lib/ollama/custom-models/ cp /path/to/weights.gguf cube/hal/buildroot/linux/board/cube/overlay/var/lib/ollama/custom-models/ ``` -#### 2. Add a Modelfile to the Overlay +2. Add a [Modelfile](https://github.com/ollama/ollama/blob/main/docs/modelfile.md) to the overlay: ```bash mkdir -p cube/hal/buildroot/linux/board/cube/overlay/etc/cube/modelfiles/ @@ -99,9 +82,7 @@ SYSTEM "You are a domain-specific assistant." EOF ``` -#### 3. Register the Model on First Boot - -Add a startup script in the overlay that creates the Ollama model after the service starts: +3. Add a startup script in the overlay to register the model after Ollama starts: ```bash mkdir -p cube/hal/buildroot/linux/board/cube/overlay/usr/libexec/ollama/ @@ -125,26 +106,52 @@ SCRIPT chmod +x cube/hal/buildroot/linux/board/cube/overlay/usr/libexec/ollama/register-custom-models.sh ``` -#### 4. Build the Image +4. Build the image: ```bash cd buildroot make -j$(nproc) ``` -The model weights are embedded in the rootfs and registered automatically on first boot. +### vLLM -### Embed a Private Model via Cloud-Init +#### Using menuconfig -To deploy a private model in an Ubuntu cloud-init CVM, modify the `user-data` section in `hal/ubuntu/qemu.sh`: +Navigate to **Target packages → Cube packages → vllm** and set: -#### 1. Pre-Stage Model Files on the Host +- **Custom model path** — Absolute path to a HuggingFace-format model directory on your build machine -Place model files in a directory accessible to the QEMU VM. The simplest approach is to transfer them after boot via the `runcmd` section. +Or in the defconfig: -#### 2. Add a Modelfile and Registration to Cloud-Init +```bash +BR2_PACKAGE_VLLM_CUSTOM_MODEL_PATH="/path/to/my-hf-model" +``` + +The build system copies the model files into `/var/lib/vllm/models/` in the image and configures vLLM to serve from that local path automatically. The vLLM service configuration is written to `/etc/vllm/vllm.env`. + +You can also configure inference parameters at build time: + +```bash +BR2_PACKAGE_VLLM_MODEL="meta-llama/Llama-2-7b-hf" +BR2_PACKAGE_VLLM_GPU_MEMORY="0.90" +BR2_PACKAGE_VLLM_MAX_MODEL_LEN="2048" +``` + +Then rebuild: + +```bash +make -j$(nproc) +``` + +--- -Add the Modelfile and a registration command to the `write_files` and `runcmd` sections: +## Cloud-Init Model Provisioning (Ubuntu) + +For Ubuntu-based CVMs using cloud-init, custom models are configured in the `user-data` section of `hal/ubuntu/qemu.sh`. Models are provisioned during the first boot. + +### Ollama + +Add a Modelfile and registration commands to the `write_files` and `runcmd` sections of the cloud-init `user-data`: ```yaml write_files: @@ -156,8 +163,11 @@ write_files: permissions: '0644' runcmd: - # ... (existing commands) - # After ollama is installed and running, register the custom model + # ... (existing commands that install ollama and start it) + # Download private model weights from an internal server + - mkdir -p /var/lib/ollama/custom-models + - curl -o /var/lib/ollama/custom-models/weights.gguf https://internal-server/models/weights.gguf + # Wait for Ollama and register the custom model - | for i in $(seq 1 60); do if curl -s http://localhost:11434/api/version > /dev/null 2>&1; then @@ -168,114 +178,108 @@ runcmd: ollama create my-model -f /etc/cube/modelfiles/my-model.Modelfile ``` -If the model weights need to be downloaded from a private source during provisioning, add a download step before registration: +### vLLM + +Add a model download and vLLM configuration step to `runcmd`: ```yaml runcmd: - # Download private model weights (e.g., from a private S3 bucket or internal server) - - mkdir -p /var/lib/ollama/custom-models - - curl -o /var/lib/ollama/custom-models/weights.gguf https://internal-server/models/weights.gguf - # Then register - - ollama create my-model -f /etc/cube/modelfiles/my-model.Modelfile + - pip install vllm + - mkdir -p /var/lib/vllm/models + # Download from a private HuggingFace registry (requires token for gated models) + - | + HF_TOKEN="your-token-here" + huggingface-cli download my-org/my-private-model \ + --local-dir /var/lib/vllm/models/my-private-model \ + --token "$HF_TOKEN" + # Configure vLLM to use the downloaded model + - | + cat > /etc/vllm/vllm.env << 'ENVEOF' + VLLM_MODEL=/var/lib/vllm/models/my-private-model + VLLM_GPU_MEMORY_UTILIZATION=0.85 + VLLM_MAX_MODEL_LEN=2048 + ENVEOF + - systemctl restart vllm ``` --- -## vLLM Backend +## Runtime Model Upload -### Upload Custom Model Files to a Running CVM +After a CVM is running (regardless of which approach was used to create it), you can deploy additional models over SSH. -#### 1. Transfer Model Directory +### Ollama -vLLM expects a HuggingFace-format model directory. Transfer the entire directory: +#### 1. Transfer and Register ```bash -scp -r -P 6190 /path/to/my-hf-model/ root@localhost:/var/lib/vllm/models/ -``` - -#### 2. Update vLLM Configuration +# Package model files on the host +tar -czvf my-model.tar.gz /path/to/model/files -SSH into the CVM and update the vLLM environment to point to the uploaded model: +# Copy into the CVM (port 6190 forwards to SSH port 22 inside the CVM) +scp -P 6190 my-model.tar.gz root@localhost:/var/lib/ollama/ -```bash +# SSH into the CVM and register the model ssh -p 6190 root@localhost +cd /var/lib/ollama && tar -xzvf my-model.tar.gz -# Edit the vLLM config -sed -i 's|^VLLM_MODEL=.*|VLLM_MODEL=/var/lib/vllm/models/my-hf-model|' /etc/vllm/vllm.env +cat > /tmp/Modelfile << 'EOF' +FROM /var/lib/ollama/my-model/weights.gguf +PARAMETER temperature 0.7 +PARAMETER top_p 0.9 +SYSTEM "You are a helpful assistant." +EOF -# Restart vLLM -systemctl restart vllm -# or for SysV init: -/etc/init.d/S96vllm restart +ollama create my-custom-model -f /tmp/Modelfile ``` -#### 3. Verify +:::note +For Ubuntu cloud-init CVMs, the default SSH user is `ultraviolet` (password: `password`). For Buildroot CVMs, the default user is `root`. +::: + +#### 2. Verify ```bash -curl http://localhost:8000/v1/models +ollama list ``` -### Embed a Custom Model in a Buildroot HAL Image - -Use the `BR2_PACKAGE_VLLM_CUSTOM_MODEL_PATH` option to embed model files at build time. +### vLLM -#### 1. Configure the Model Path +#### 1. Transfer and Configure -In `menuconfig`, navigate to **Target packages → Cube packages → vllm** and set: - -- **Custom model path** — Absolute path to the model directory on your build machine - -Or set it in the defconfig: +vLLM expects a HuggingFace-format model directory: ```bash -BR2_PACKAGE_VLLM_CUSTOM_MODEL_PATH="/path/to/my-hf-model" -``` +# Copy the model directory into the CVM +scp -r -P 6190 /path/to/my-hf-model/ root@localhost:/var/lib/vllm/models/ -#### 2. Build +# SSH in and update the vLLM config to point to the new model +ssh -p 6190 root@localhost +sed -i 's|^VLLM_MODEL=.*|VLLM_MODEL=/var/lib/vllm/models/my-hf-model|' /etc/vllm/vllm.env -```bash -make -j$(nproc) +# Restart vLLM to load the new model +systemctl restart vllm +# or for SysV init: +/etc/init.d/S96vllm restart ``` -The build system copies the model files into `/var/lib/vllm/models/` in the image and configures vLLM to use the local path automatically. - -### Embed a Custom Model via Cloud-Init +#### 2. Verify -Add a model download or transfer step to the `runcmd` section in `hal/ubuntu/qemu.sh`: - -```yaml -runcmd: - # Install vLLM - - pip install vllm - # Download private model - - mkdir -p /var/lib/vllm/models - - | - # Option A: Download from a private registry (requires HF token for gated models) - HF_TOKEN="your-token-here" - huggingface-cli download my-org/my-private-model \ - --local-dir /var/lib/vllm/models/my-private-model \ - --token "$HF_TOKEN" - # Configure and start vLLM - - | - cat > /etc/vllm/vllm.env << 'ENVEOF' - VLLM_MODEL=/var/lib/vllm/models/my-private-model - VLLM_GPU_MEMORY_UTILIZATION=0.85 - VLLM_MAX_MODEL_LEN=2048 - ENVEOF - - systemctl restart vllm +```bash +curl http://localhost:8000/v1/models ``` --- -## Verifying Model Availability Through the Proxy +## Verifying Model Availability -After deploying a custom model, verify it is accessible end-to-end through the Cube Agent: +After deploying a custom model, verify it is accessible end-to-end through the Cube Agent. From the host: ```bash -# List available models +# List available models (port 6193 forwards to the Cube Agent on port 7001 inside the CVM) curl http://localhost:6193/v1/models -# Test chat completions +# Test a chat completion request curl http://localhost:6193/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ @@ -283,5 +287,3 @@ curl http://localhost:6193/v1/chat/completions \ "messages": [{"role": "user", "content": "Hello"}] }' ``` - -Port `6193` is the default host-side forwarded port for the Cube Agent (maps to port `7001` inside the CVM).