From c9c96007a6679aebedd1664f39adaffc1ab1091f Mon Sep 17 00:00:00 2001
From: WashingtonKK <washingtonkigan@gmail.com>
Date: Fri, 20 Feb 2026 13:11:19 +0300
Subject: [PATCH 1/4] CD-29 - Add deploying custom models docs

Signed-off-by: WashingtonKK <washingtonkigan@gmail.com>
---
 .../custom-model-deployment.md                | 406 ++++++++++++++++++
 docs/developer-guide/index.md                 |   1 +
 sidebars.ts                                   |   1 +
 3 files changed, 408 insertions(+)
 create mode 100644 docs/developer-guide/custom-model-deployment.md

diff --git a/docs/developer-guide/custom-model-deployment.md b/docs/developer-guide/custom-model-deployment.md
new file mode 100644
index 0000000..a73d17d
--- /dev/null
+++ b/docs/developer-guide/custom-model-deployment.md
@@ -0,0 +1,406 @@
+---
+id: custom-model-deployment
+title: Deploying Custom Models
+sidebar_position: 6
+---
+
+## Deploying Custom Models with HAL and Cloud-Init
+
+Cube AI supports deploying custom LLM models into Confidential VMs (CVMs) through two approaches: **Buildroot HAL images** and **Ubuntu cloud-init**. This guide covers both paths for Ollama and vLLM backends.
+
+:::info
+For basic model file transfer into a running CVM, see [Private Model Upload](/developer-guide/private-model-upload). This guide covers full deployment workflows including build-time configuration and automated provisioning.
+:::
+
+---
+
+## Approach 1: Buildroot HAL (Build-Time)
+
+The Buildroot HAL embeds model configuration directly into the CVM image. Models are pulled automatically on first boot based on settings configured during the build.
+
+### Ollama Custom Models
+
+#### Configure via Menuconfig
+
+During HAL image configuration (see [HAL guide](/developer-guide/hal)), navigate to:
+
+**Target packages → Cube packages → ollama**
+
+Enable these options:
+
+- **Install default models** — Pulls `llama3.2:3b`, `starcoder2:3b`, and `nomic-embed-text:v1.5` on first boot
+- **Custom models to install** — Space-separated list of additional Ollama models
+
+For example, to add `llama2:7b` and `mistral:7b`:
+
+```
+Custom models to install: llama2:7b mistral:7b codellama:13b
+```
+
+#### Configure via Defconfig
+
+Alternatively, set models directly in the Buildroot defconfig or via `make menuconfig` save:
+
+```bash
+BR2_PACKAGE_OLLAMA_MODELS=y
+BR2_PACKAGE_OLLAMA_CUSTOM_MODELS="llama2:7b mistral:7b codellama:13b"
+```
+
+Then rebuild the image:
+
+```bash
+make -j$(nproc)
+```
+
+#### How It Works
+
+The Ollama package installs a model-pull script at `/usr/libexec/ollama/pull-models.sh` that runs after the Ollama service starts. The script retries each model pull up to 20 times with 5-second intervals, handling temporary network issues during boot.
+
+#### GPU Support
+
+Enable GPU acceleration in menuconfig under **ollama → Enable GPU support**, then select the GPU type:
+
+- **NVIDIA GPU** — Requires NVIDIA drivers and CUDA
+- **AMD GPU (ROCm)** — Requires ROCm drivers
+
+### vLLM Custom Models
+
+#### HuggingFace Models
+
+Set the model identifier in menuconfig under **Target packages → Cube packages → vllm**:
+
+- **Model to use** — HuggingFace model ID (e.g., `meta-llama/Llama-2-7b-hf`)
+- **GPU Memory Utilization** — Fraction of GPU memory (default: `0.85`)
+- **Maximum Model Length** — Max sequence length (default: `1024`)
+
+Or via defconfig:
+
+```bash
+BR2_PACKAGE_VLLM_MODEL="meta-llama/Llama-2-7b-hf"
+BR2_PACKAGE_VLLM_GPU_MEMORY="0.90"
+BR2_PACKAGE_VLLM_MAX_MODEL_LEN="2048"
+```
+
+The model is downloaded from HuggingFace on first boot and cached at `/var/lib/vllm/`.
+
+#### Local Model Files
+
+To embed model files directly into the image instead of downloading them:
+
+1. Set the **Custom model path** to a directory on your build machine containing the model files:
+
+```bash
+BR2_PACKAGE_VLLM_CUSTOM_MODEL_PATH="/path/to/local/model"
+```
+
+2. The build system copies the model files into `/var/lib/vllm/models/` in the image and automatically configures vLLM to use the local path.
+
+### Cube Agent Backend Selection
+
+The Cube Agent must be configured to point to the correct backend. In menuconfig under **Target packages → Cube packages → cube-agent → LLM Backend**:
+
+| Backend | Target URL | When to Use |
+| --- | --- | --- |
+| Ollama | `http://localhost:11434` | Default, lightweight models |
+| vLLM | `http://localhost:8000` | GPU-accelerated production workloads |
+| Custom URL | User-defined | External or custom backend |
+
+---
+
+## Approach 2: Ubuntu Cloud-Init
+
+The Ubuntu cloud-init approach uses a `user-data` configuration file to provision a VM with custom models during first boot. This is the recommended path for development and when using Ubuntu-based CVMs.
+
+### Overview
+
+The cloud-init script in `hal/ubuntu/qemu.sh` generates a full VM that:
+
+1. Installs Ollama from the official installer
+2. Builds the Cube Agent from source
+3. Creates systemd services for both
+4. Pulls configured models on first boot
+
+### Customizing Models in Cloud-Init
+
+Edit the `user-data` section in `hal/ubuntu/qemu.sh` to change which models are pulled.
+
+#### Default Models
+
+The default configuration pulls these models:
+
+```yaml
+write_files:
+  - path: /usr/local/bin/pull-ollama-models.sh
+    content: |
+      #!/bin/bash
+      for i in $(seq 1 60); do
+        if curl -s http://localhost:11434/api/version > /dev/null 2>&1; then
+          break
+        fi
+        sleep 2
+      done
+      /usr/local/bin/ollama pull tinyllama:1.1b
+      /usr/local/bin/ollama pull starcoder2:3b
+      /usr/local/bin/ollama pull nomic-embed-text:v1.5
+    permissions: '0755'
+```
+
+#### Adding Custom Models
+
+To deploy different models, modify the `pull-ollama-models.sh` content in the `write_files` section:
+
+```yaml
+write_files:
+  - path: /usr/local/bin/pull-ollama-models.sh
+    content: |
+      #!/bin/bash
+      for i in $(seq 1 60); do
+        if curl -s http://localhost:11434/api/version > /dev/null 2>&1; then
+          break
+        fi
+        sleep 2
+      done
+      # Default models
+      /usr/local/bin/ollama pull tinyllama:1.1b
+      # Custom models
+      /usr/local/bin/ollama pull llama2:7b
+      /usr/local/bin/ollama pull mistral:7b
+      /usr/local/bin/ollama pull codellama:13b
+    permissions: '0755'
+```
+
+#### Using a Custom Modelfile
+
+To deploy a model from a custom Modelfile (for fine-tuned or customized models), add it to the `write_files` section and create it during `runcmd`:
+
+```yaml
+write_files:
+  - path: /etc/cube/custom-model/Modelfile
+    content: |
+      FROM llama2:7b
+      PARAMETER temperature 0.7
+      PARAMETER top_p 0.9
+      SYSTEM "You are a helpful coding assistant."
+    permissions: '0644'
+
+runcmd:
+  # ... (after ollama is installed and running)
+  - /usr/local/bin/ollama create my-custom-model -f /etc/cube/custom-model/Modelfile
+```
+
+#### Configuring the Cube Agent
+
+The agent configuration is set via cloud-init in `/etc/cube/agent.env`:
+
+```yaml
+write_files:
+  - path: /etc/cube/agent.env
+    content: |
+      UV_CUBE_AGENT_LOG_LEVEL=info
+      UV_CUBE_AGENT_HOST=0.0.0.0
+      UV_CUBE_AGENT_PORT=7001
+      UV_CUBE_AGENT_INSTANCE_ID=cube-agent-01
+      UV_CUBE_AGENT_TARGET_URL=http://localhost:11434
+      UV_CUBE_AGENT_SERVER_CERT=/etc/cube/certs/server.crt
+      UV_CUBE_AGENT_SERVER_KEY=/etc/cube/certs/server.key
+      UV_CUBE_AGENT_SERVER_CA_CERTS=/etc/cube/certs/ca.crt
+      UV_CUBE_AGENT_CA_URL=https://prism.ultraviolet.rs/am-certs
+    permissions: '0644'
+```
+
+To use vLLM instead of Ollama, change the target URL:
+
+```bash
+UV_CUBE_AGENT_TARGET_URL=http://localhost:8000
+```
+
+### Launching the Cloud-Init VM
+
+Run the script from the `hal/ubuntu/` directory:
+
+```bash
+cd cube/hal/ubuntu
+sudo bash qemu.sh
+```
+
+The script:
+
+1. Downloads the Ubuntu Noble cloud image (if not already present)
+2. Creates a QCOW2 overlay disk
+3. Generates a cloud-init seed image from the `user-data` configuration
+4. Detects TDX support and launches the VM accordingly
+
+**Port mappings:**
+
+| Host Port | Guest Port | Service |
+| --- | --- | --- |
+| 6190 | 22 | SSH |
+| 6191 | 80 | HTTP |
+| 6192 | 443 | HTTPS |
+| 6193 | 7001 | Cube Agent |
+
+**TDX mode control:**
+
+```bash
+# Auto-detect (default)
+sudo ENABLE_CVM=auto bash qemu.sh
+
+# Force TDX
+sudo ENABLE_CVM=tdx bash qemu.sh
+
+# Disable CVM (regular VM)
+sudo ENABLE_CVM=none bash qemu.sh
+```
+
+---
+
+## Runtime Model Deployment
+
+After a CVM is running (regardless of which approach was used to create it), you can deploy additional models at runtime.
+
+### SSH into the CVM
+
+```bash
+# Buildroot CVM
+ssh -p 6190 root@localhost
+
+# Ubuntu cloud-init CVM
+ssh -p 6190 ultraviolet@localhost
+# Password: password
+```
+
+### Pull Ollama Models at Runtime
+
+```bash
+# List current models
+ollama list
+
+# Pull a new model
+ollama pull llama2:7b
+
+# Create a model from a Modelfile
+cat > /tmp/Modelfile << 'EOF'
+FROM llama2:7b
+PARAMETER temperature 0.8
+SYSTEM "You are a domain-specific assistant."
+EOF
+ollama create my-model -f /tmp/Modelfile
+
+# Verify the model is available
+ollama list
+```
+
+### Upload Model Files via SCP
+
+For models not available in registries:
+
+```bash
+# From the host, copy model files into the CVM
+scp -P 6190 model-weights.tar.gz root@localhost:~
+
+# Inside the CVM, extract and register
+tar -xzf model-weights.tar.gz
+# For Ollama, copy to model directory
+cp -r extracted-model /var/lib/ollama/models/
+```
+
+### Verify Model Availability
+
+Test that the model is accessible through the Cube Agent:
+
+```bash
+# From the host
+curl http://localhost:6193/v1/models
+
+# Or make a chat completion request
+curl http://localhost:6193/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "llama2:7b",
+    "messages": [{"role": "user", "content": "Hello"}]
+  }'
+```
+
+---
+
+## Comparison of Deployment Approaches
+
+| Feature | Buildroot HAL | Cloud-Init (Ubuntu) |
+| --- | --- | --- |
+| Base OS | Minimal Buildroot Linux | Ubuntu Noble |
+| Image size | Small (~hundreds of MB) | Larger (~GB+) |
+| Build time | ~1 hour | Minutes (download-based) |
+| Model config | Build-time via menuconfig | Cloud-init user-data |
+| Model pull | On first boot (auto) | On first boot (auto) |
+| Customization | Requires rebuild | Edit user-data file |
+| GPU support | Via Buildroot packages | Via Ubuntu packages |
+| Best for | Production, minimal images | Development, rapid iteration |
+| TEE support | AMD SEV-SNP, Intel TDX | Intel TDX |
+| Init system | SysV or systemd | systemd |
+
+---
+
+## Troubleshooting
+
+### Models Fail to Pull on Boot
+
+Check network connectivity inside the CVM:
+
+```bash
+# Test DNS resolution
+ping -c 1 ollama.com
+
+# Check Ollama service status
+systemctl status ollama
+# or
+/etc/init.d/S96ollama status
+```
+
+For Buildroot images, the pull script retries 20 times. Check the logs:
+
+```bash
+journalctl -u ollama -f
+```
+
+### Ollama Reports Insufficient Disk Space
+
+The default Buildroot rootfs is limited in size. Increase it during the build:
+
+```bash
+# In menuconfig: Filesystem images → ext4 root filesystem → size
+# Or in defconfig:
+BR2_TARGET_ROOTFS_EXT2_SIZE="30G"
+```
+
+For cloud-init VMs, the disk is controlled by `DISK_SIZE` in `qemu.sh` (default: `35G`).
+
+### vLLM Fails to Load Model
+
+Verify GPU is available and the model fits in memory:
+
+```bash
+# Check GPU
+nvidia-smi
+
+# Check vLLM config
+cat /etc/vllm/vllm.env
+
+# Restart with adjusted settings
+systemctl restart vllm
+```
+
+### Agent Cannot Reach Backend
+
+Verify the backend service is running and the agent's target URL matches:
+
+```bash
+# Check agent config
+cat /etc/cube/agent.env
+
+# Test backend directly
+curl http://localhost:11434/api/tags   # Ollama
+curl http://localhost:8000/v1/models   # vLLM
+
+# Restart agent
+systemctl restart cube-agent
+```
diff --git a/docs/developer-guide/index.md b/docs/developer-guide/index.md
index db8d483..3fba229 100644
--- a/docs/developer-guide/index.md
+++ b/docs/developer-guide/index.md
@@ -18,6 +18,7 @@ as private model upload and fine-tuning.
 - **Chat UI**
 - **Hardware Abstraction Layer (HAL)**
 - **CVM Management**
+- **Deploying Custom Models** - Deploy custom models via HAL build-time config or cloud-init
 - **Private Model Upload**
 - **Fine-Tuning Models**
 - **Guardrails** - AI safety controls for input validation and output sanitization
diff --git a/sidebars.ts b/sidebars.ts
index 1c9eeb2..f6f1dba 100644
--- a/sidebars.ts
+++ b/sidebars.ts
@@ -59,6 +59,7 @@ const sidebars: SidebarsConfig = {
         'developer-guide/private-model-upload',
         'developer-guide/hal',
         'developer-guide/cvm-management',
+        'developer-guide/custom-model-deployment',
         'developer-guide/fine-tuning',
         'developer-guide/auth-and-request-flow',
       ],

From 83341686551d389cc42b57cb0594deefc6593ee4 Mon Sep 17 00:00:00 2001
From: WashingtonKK <washingtonkigan@gmail.com>
Date: Fri, 20 Feb 2026 13:24:50 +0300
Subject: [PATCH 2/4] lint doc

Signed-off-by: WashingtonKK <washingtonkigan@gmail.com>
---
 docs/developer-guide/custom-model-deployment.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/docs/developer-guide/custom-model-deployment.md b/docs/developer-guide/custom-model-deployment.md
index a73d17d..1db0268 100644
--- a/docs/developer-guide/custom-model-deployment.md
+++ b/docs/developer-guide/custom-model-deployment.md
@@ -24,7 +24,7 @@ The Buildroot HAL embeds model configuration directly into the CVM image. Models
 
 During HAL image configuration (see [HAL guide](/developer-guide/hal)), navigate to:
 
-**Target packages → Cube packages → ollama**
+Menu path: **Target packages → Cube packages → ollama**
 
 Enable these options:
 
@@ -33,7 +33,7 @@ Enable these options:
 
 For example, to add `llama2:7b` and `mistral:7b`:
 
-```
+```bash
 Custom models to install: llama2:7b mistral:7b codellama:13b
 ```
 
@@ -93,7 +93,7 @@ To embed model files directly into the image instead of downloading them:
 BR2_PACKAGE_VLLM_CUSTOM_MODEL_PATH="/path/to/local/model"
 ```
 
-2. The build system copies the model files into `/var/lib/vllm/models/` in the image and automatically configures vLLM to use the local path.
+1. The build system copies the model files into `/var/lib/vllm/models/` in the image and automatically configures vLLM to use the local path.
 
 ### Cube Agent Backend Selection
 

From 3a2a49b0a3017e2f6f3c5895402c926de5c6ae2d Mon Sep 17 00:00:00 2001
From: WashingtonKK <washingtonkigan@gmail.com>
Date: Tue, 24 Feb 2026 11:35:42 +0300
Subject: [PATCH 3/4] update custom model upload

Signed-off-by: WashingtonKK <washingtonkigan@gmail.com>
---
 .../custom-model-deployment.md                | 406 ------------------
 docs/developer-guide/index.md                 |   3 +-
 docs/developer-guide/private-model-upload.md  | 276 +++++++++++-
 sidebars.ts                                   |   1 -
 4 files changed, 268 insertions(+), 418 deletions(-)
 delete mode 100644 docs/developer-guide/custom-model-deployment.md

diff --git a/docs/developer-guide/custom-model-deployment.md b/docs/developer-guide/custom-model-deployment.md
deleted file mode 100644
index 1db0268..0000000
--- a/docs/developer-guide/custom-model-deployment.md
+++ /dev/null
@@ -1,406 +0,0 @@
----
-id: custom-model-deployment
-title: Deploying Custom Models
-sidebar_position: 6
----
-
-## Deploying Custom Models with HAL and Cloud-Init
-
-Cube AI supports deploying custom LLM models into Confidential VMs (CVMs) through two approaches: **Buildroot HAL images** and **Ubuntu cloud-init**. This guide covers both paths for Ollama and vLLM backends.
-
-:::info
-For basic model file transfer into a running CVM, see [Private Model Upload](/developer-guide/private-model-upload). This guide covers full deployment workflows including build-time configuration and automated provisioning.
-:::
-
----
-
-## Approach 1: Buildroot HAL (Build-Time)
-
-The Buildroot HAL embeds model configuration directly into the CVM image. Models are pulled automatically on first boot based on settings configured during the build.
-
-### Ollama Custom Models
-
-#### Configure via Menuconfig
-
-During HAL image configuration (see [HAL guide](/developer-guide/hal)), navigate to:
-
-Menu path: **Target packages → Cube packages → ollama**
-
-Enable these options:
-
-- **Install default models** — Pulls `llama3.2:3b`, `starcoder2:3b`, and `nomic-embed-text:v1.5` on first boot
-- **Custom models to install** — Space-separated list of additional Ollama models
-
-For example, to add `llama2:7b` and `mistral:7b`:
-
-```bash
-Custom models to install: llama2:7b mistral:7b codellama:13b
-```
-
-#### Configure via Defconfig
-
-Alternatively, set models directly in the Buildroot defconfig or via `make menuconfig` save:
-
-```bash
-BR2_PACKAGE_OLLAMA_MODELS=y
-BR2_PACKAGE_OLLAMA_CUSTOM_MODELS="llama2:7b mistral:7b codellama:13b"
-```
-
-Then rebuild the image:
-
-```bash
-make -j$(nproc)
-```
-
-#### How It Works
-
-The Ollama package installs a model-pull script at `/usr/libexec/ollama/pull-models.sh` that runs after the Ollama service starts. The script retries each model pull up to 20 times with 5-second intervals, handling temporary network issues during boot.
-
-#### GPU Support
-
-Enable GPU acceleration in menuconfig under **ollama → Enable GPU support**, then select the GPU type:
-
-- **NVIDIA GPU** — Requires NVIDIA drivers and CUDA
-- **AMD GPU (ROCm)** — Requires ROCm drivers
-
-### vLLM Custom Models
-
-#### HuggingFace Models
-
-Set the model identifier in menuconfig under **Target packages → Cube packages → vllm**:
-
-- **Model to use** — HuggingFace model ID (e.g., `meta-llama/Llama-2-7b-hf`)
-- **GPU Memory Utilization** — Fraction of GPU memory (default: `0.85`)
-- **Maximum Model Length** — Max sequence length (default: `1024`)
-
-Or via defconfig:
-
-```bash
-BR2_PACKAGE_VLLM_MODEL="meta-llama/Llama-2-7b-hf"
-BR2_PACKAGE_VLLM_GPU_MEMORY="0.90"
-BR2_PACKAGE_VLLM_MAX_MODEL_LEN="2048"
-```
-
-The model is downloaded from HuggingFace on first boot and cached at `/var/lib/vllm/`.
-
-#### Local Model Files
-
-To embed model files directly into the image instead of downloading them:
-
-1. Set the **Custom model path** to a directory on your build machine containing the model files:
-
-```bash
-BR2_PACKAGE_VLLM_CUSTOM_MODEL_PATH="/path/to/local/model"
-```
-
-1. The build system copies the model files into `/var/lib/vllm/models/` in the image and automatically configures vLLM to use the local path.
-
-### Cube Agent Backend Selection
-
-The Cube Agent must be configured to point to the correct backend. In menuconfig under **Target packages → Cube packages → cube-agent → LLM Backend**:
-
-| Backend | Target URL | When to Use |
-| --- | --- | --- |
-| Ollama | `http://localhost:11434` | Default, lightweight models |
-| vLLM | `http://localhost:8000` | GPU-accelerated production workloads |
-| Custom URL | User-defined | External or custom backend |
-
----
-
-## Approach 2: Ubuntu Cloud-Init
-
-The Ubuntu cloud-init approach uses a `user-data` configuration file to provision a VM with custom models during first boot. This is the recommended path for development and when using Ubuntu-based CVMs.
-
-### Overview
-
-The cloud-init script in `hal/ubuntu/qemu.sh` generates a full VM that:
-
-1. Installs Ollama from the official installer
-2. Builds the Cube Agent from source
-3. Creates systemd services for both
-4. Pulls configured models on first boot
-
-### Customizing Models in Cloud-Init
-
-Edit the `user-data` section in `hal/ubuntu/qemu.sh` to change which models are pulled.
-
-#### Default Models
-
-The default configuration pulls these models:
-
-```yaml
-write_files:
-  - path: /usr/local/bin/pull-ollama-models.sh
-    content: |
-      #!/bin/bash
-      for i in $(seq 1 60); do
-        if curl -s http://localhost:11434/api/version > /dev/null 2>&1; then
-          break
-        fi
-        sleep 2
-      done
-      /usr/local/bin/ollama pull tinyllama:1.1b
-      /usr/local/bin/ollama pull starcoder2:3b
-      /usr/local/bin/ollama pull nomic-embed-text:v1.5
-    permissions: '0755'
-```
-
-#### Adding Custom Models
-
-To deploy different models, modify the `pull-ollama-models.sh` content in the `write_files` section:
-
-```yaml
-write_files:
-  - path: /usr/local/bin/pull-ollama-models.sh
-    content: |
-      #!/bin/bash
-      for i in $(seq 1 60); do
-        if curl -s http://localhost:11434/api/version > /dev/null 2>&1; then
-          break
-        fi
-        sleep 2
-      done
-      # Default models
-      /usr/local/bin/ollama pull tinyllama:1.1b
-      # Custom models
-      /usr/local/bin/ollama pull llama2:7b
-      /usr/local/bin/ollama pull mistral:7b
-      /usr/local/bin/ollama pull codellama:13b
-    permissions: '0755'
-```
-
-#### Using a Custom Modelfile
-
-To deploy a model from a custom Modelfile (for fine-tuned or customized models), add it to the `write_files` section and create it during `runcmd`:
-
-```yaml
-write_files:
-  - path: /etc/cube/custom-model/Modelfile
-    content: |
-      FROM llama2:7b
-      PARAMETER temperature 0.7
-      PARAMETER top_p 0.9
-      SYSTEM "You are a helpful coding assistant."
-    permissions: '0644'
-
-runcmd:
-  # ... (after ollama is installed and running)
-  - /usr/local/bin/ollama create my-custom-model -f /etc/cube/custom-model/Modelfile
-```
-
-#### Configuring the Cube Agent
-
-The agent configuration is set via cloud-init in `/etc/cube/agent.env`:
-
-```yaml
-write_files:
-  - path: /etc/cube/agent.env
-    content: |
-      UV_CUBE_AGENT_LOG_LEVEL=info
-      UV_CUBE_AGENT_HOST=0.0.0.0
-      UV_CUBE_AGENT_PORT=7001
-      UV_CUBE_AGENT_INSTANCE_ID=cube-agent-01
-      UV_CUBE_AGENT_TARGET_URL=http://localhost:11434
-      UV_CUBE_AGENT_SERVER_CERT=/etc/cube/certs/server.crt
-      UV_CUBE_AGENT_SERVER_KEY=/etc/cube/certs/server.key
-      UV_CUBE_AGENT_SERVER_CA_CERTS=/etc/cube/certs/ca.crt
-      UV_CUBE_AGENT_CA_URL=https://prism.ultraviolet.rs/am-certs
-    permissions: '0644'
-```
-
-To use vLLM instead of Ollama, change the target URL:
-
-```bash
-UV_CUBE_AGENT_TARGET_URL=http://localhost:8000
-```
-
-### Launching the Cloud-Init VM
-
-Run the script from the `hal/ubuntu/` directory:
-
-```bash
-cd cube/hal/ubuntu
-sudo bash qemu.sh
-```
-
-The script:
-
-1. Downloads the Ubuntu Noble cloud image (if not already present)
-2. Creates a QCOW2 overlay disk
-3. Generates a cloud-init seed image from the `user-data` configuration
-4. Detects TDX support and launches the VM accordingly
-
-**Port mappings:**
-
-| Host Port | Guest Port | Service |
-| --- | --- | --- |
-| 6190 | 22 | SSH |
-| 6191 | 80 | HTTP |
-| 6192 | 443 | HTTPS |
-| 6193 | 7001 | Cube Agent |
-
-**TDX mode control:**
-
-```bash
-# Auto-detect (default)
-sudo ENABLE_CVM=auto bash qemu.sh
-
-# Force TDX
-sudo ENABLE_CVM=tdx bash qemu.sh
-
-# Disable CVM (regular VM)
-sudo ENABLE_CVM=none bash qemu.sh
-```
-
----
-
-## Runtime Model Deployment
-
-After a CVM is running (regardless of which approach was used to create it), you can deploy additional models at runtime.
-
-### SSH into the CVM
-
-```bash
-# Buildroot CVM
-ssh -p 6190 root@localhost
-
-# Ubuntu cloud-init CVM
-ssh -p 6190 ultraviolet@localhost
-# Password: password
-```
-
-### Pull Ollama Models at Runtime
-
-```bash
-# List current models
-ollama list
-
-# Pull a new model
-ollama pull llama2:7b
-
-# Create a model from a Modelfile
-cat > /tmp/Modelfile << 'EOF'
-FROM llama2:7b
-PARAMETER temperature 0.8
-SYSTEM "You are a domain-specific assistant."
-EOF
-ollama create my-model -f /tmp/Modelfile
-
-# Verify the model is available
-ollama list
-```
-
-### Upload Model Files via SCP
-
-For models not available in registries:
-
-```bash
-# From the host, copy model files into the CVM
-scp -P 6190 model-weights.tar.gz root@localhost:~
-
-# Inside the CVM, extract and register
-tar -xzf model-weights.tar.gz
-# For Ollama, copy to model directory
-cp -r extracted-model /var/lib/ollama/models/
-```
-
-### Verify Model Availability
-
-Test that the model is accessible through the Cube Agent:
-
-```bash
-# From the host
-curl http://localhost:6193/v1/models
-
-# Or make a chat completion request
-curl http://localhost:6193/v1/chat/completions \
-  -H "Content-Type: application/json" \
-  -d '{
-    "model": "llama2:7b",
-    "messages": [{"role": "user", "content": "Hello"}]
-  }'
-```
-
----
-
-## Comparison of Deployment Approaches
-
-| Feature | Buildroot HAL | Cloud-Init (Ubuntu) |
-| --- | --- | --- |
-| Base OS | Minimal Buildroot Linux | Ubuntu Noble |
-| Image size | Small (~hundreds of MB) | Larger (~GB+) |
-| Build time | ~1 hour | Minutes (download-based) |
-| Model config | Build-time via menuconfig | Cloud-init user-data |
-| Model pull | On first boot (auto) | On first boot (auto) |
-| Customization | Requires rebuild | Edit user-data file |
-| GPU support | Via Buildroot packages | Via Ubuntu packages |
-| Best for | Production, minimal images | Development, rapid iteration |
-| TEE support | AMD SEV-SNP, Intel TDX | Intel TDX |
-| Init system | SysV or systemd | systemd |
-
----
-
-## Troubleshooting
-
-### Models Fail to Pull on Boot
-
-Check network connectivity inside the CVM:
-
-```bash
-# Test DNS resolution
-ping -c 1 ollama.com
-
-# Check Ollama service status
-systemctl status ollama
-# or
-/etc/init.d/S96ollama status
-```
-
-For Buildroot images, the pull script retries 20 times. Check the logs:
-
-```bash
-journalctl -u ollama -f
-```
-
-### Ollama Reports Insufficient Disk Space
-
-The default Buildroot rootfs is limited in size. Increase it during the build:
-
-```bash
-# In menuconfig: Filesystem images → ext4 root filesystem → size
-# Or in defconfig:
-BR2_TARGET_ROOTFS_EXT2_SIZE="30G"
-```
-
-For cloud-init VMs, the disk is controlled by `DISK_SIZE` in `qemu.sh` (default: `35G`).
-
-### vLLM Fails to Load Model
-
-Verify GPU is available and the model fits in memory:
-
-```bash
-# Check GPU
-nvidia-smi
-
-# Check vLLM config
-cat /etc/vllm/vllm.env
-
-# Restart with adjusted settings
-systemctl restart vllm
-```
-
-### Agent Cannot Reach Backend
-
-Verify the backend service is running and the agent's target URL matches:
-
-```bash
-# Check agent config
-cat /etc/cube/agent.env
-
-# Test backend directly
-curl http://localhost:11434/api/tags   # Ollama
-curl http://localhost:8000/v1/models   # vLLM
-
-# Restart agent
-systemctl restart cube-agent
-```
diff --git a/docs/developer-guide/index.md b/docs/developer-guide/index.md
index 3fba229..1628285 100644
--- a/docs/developer-guide/index.md
+++ b/docs/developer-guide/index.md
@@ -18,8 +18,7 @@ as private model upload and fine-tuning.
 - **Chat UI**
 - **Hardware Abstraction Layer (HAL)**
 - **CVM Management**
-- **Deploying Custom Models** - Deploy custom models via HAL build-time config or cloud-init
-- **Private Model Upload**
+- **Private Model Upload** - Deploy custom models via HAL build-time config, cloud-init, or runtime upload
 - **Fine-Tuning Models**
 - **Guardrails** - AI safety controls for input validation and output sanitization
 
diff --git a/docs/developer-guide/private-model-upload.md b/docs/developer-guide/private-model-upload.md
index c20afc8..5cf549d 100644
--- a/docs/developer-guide/private-model-upload.md
+++ b/docs/developer-guide/private-model-upload.md
@@ -6,24 +6,282 @@ sidebar_position: 3
 
 ## Uploading Private Models to Cube AI
 
-This guide explains how to upload private models into the Ollama runtime inside a confidential VM.
+This guide explains how to upload and deploy private or custom models into a Cube AI Confidential VM (CVM). Private models are models that are not available in public registries (Ollama library, HuggingFace) — for example, fine-tuned models, proprietary weights, or models with restricted access.
 
-## 1. Package Model Files
+---
+
+## Ollama Backend
+
+### Upload Model Files to a Running CVM
+
+#### 1. Package Model Files
+
+Prepare your model weights and any associated files into an archive:
+
+```bash
+tar -czvf my-model.tar.gz /path/to/model/files
+```
+
+#### 2. Transfer to the CVM
+
+Copy the archive into the CVM via SCP using the forwarded SSH port:
+
+```bash
+# Buildroot CVM
+scp -P 6190 my-model.tar.gz root@localhost:/var/lib/ollama/
+
+# Ubuntu cloud-init CVM
+scp -P 6190 my-model.tar.gz ultraviolet@localhost:/var/lib/ollama/
+```
+
+#### 3. Extract and Register the Model
+
+SSH into the CVM and create an Ollama model from the uploaded files using a [Modelfile](https://github.com/ollama/ollama/blob/main/docs/modelfile.md):
+
+```bash
+ssh -p 6190 root@localhost
+
+cd /var/lib/ollama
+tar -xzvf my-model.tar.gz
+```
+
+Create a Modelfile that references the uploaded weights:
+
+```bash
+cat > /tmp/Modelfile << 'EOF'
+FROM /var/lib/ollama/my-model/weights.gguf
+PARAMETER temperature 0.7
+PARAMETER top_p 0.9
+SYSTEM "You are a helpful assistant."
+EOF
+
+ollama create my-custom-model -f /tmp/Modelfile
+```
+
+#### 4. Verify the Model
 
 ```bash
-tar -czvf model-name.tar.gz /path/to/model/files
+ollama list
 ```
 
-## 2. Transfer and Extract Model in CVM
+Test inference:
 
 ```bash
-scp model-name.tar.gz user@<cvm-ip>:~
-gunzip model-name.tar.gz
-tar -xvf model-name.tar
+curl http://localhost:7001/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "my-custom-model",
+    "messages": [{"role": "user", "content": "Hello"}]
+  }'
 ```
 
-## 3. Copy Into Ollama
+### Embed a Private Model in a Buildroot HAL Image
+
+To include a private model directly in the HAL image at build time, use the Buildroot filesystem overlay:
+
+#### 1. Place Model Files in the Overlay
+
+Create a directory for the model in the overlay structure:
+
+```bash
+mkdir -p cube/hal/buildroot/linux/board/cube/overlay/var/lib/ollama/custom-models/
+cp /path/to/weights.gguf cube/hal/buildroot/linux/board/cube/overlay/var/lib/ollama/custom-models/
+```
+
+#### 2. Add a Modelfile to the Overlay
+
+```bash
+mkdir -p cube/hal/buildroot/linux/board/cube/overlay/etc/cube/modelfiles/
+cat > cube/hal/buildroot/linux/board/cube/overlay/etc/cube/modelfiles/my-model.Modelfile << 'EOF'
+FROM /var/lib/ollama/custom-models/weights.gguf
+PARAMETER temperature 0.7
+SYSTEM "You are a domain-specific assistant."
+EOF
+```
+
+#### 3. Register the Model on First Boot
+
+Add a startup script in the overlay that creates the Ollama model after the service starts:
+
+```bash
+mkdir -p cube/hal/buildroot/linux/board/cube/overlay/usr/libexec/ollama/
+cat > cube/hal/buildroot/linux/board/cube/overlay/usr/libexec/ollama/register-custom-models.sh << 'SCRIPT'
+#!/bin/sh
+# Wait for Ollama to be ready
+for i in $(seq 1 30); do
+  if curl -s http://localhost:11434/api/version > /dev/null 2>&1; then
+    break
+  fi
+  sleep 2
+done
+
+# Register custom models from Modelfiles
+for mf in /etc/cube/modelfiles/*.Modelfile; do
+  [ -f "$mf" ] || continue
+  name=$(basename "$mf" .Modelfile)
+  ollama create "$name" -f "$mf"
+done
+SCRIPT
+chmod +x cube/hal/buildroot/linux/board/cube/overlay/usr/libexec/ollama/register-custom-models.sh
+```
+
+#### 4. Build the Image
 
 ```bash
-docker cp /path/to/extracted ollama:/models/
+cd buildroot
+make -j$(nproc)
 ```
+
+The model weights are embedded in the rootfs and registered automatically on first boot.
+
+### Embed a Private Model via Cloud-Init
+
+To deploy a private model in an Ubuntu cloud-init CVM, modify the `user-data` section in `hal/ubuntu/qemu.sh`:
+
+#### 1. Pre-Stage Model Files on the Host
+
+Place model files in a directory accessible to the QEMU VM. The simplest approach is to transfer them after boot via the `runcmd` section.
+
+#### 2. Add a Modelfile and Registration to Cloud-Init
+
+Add the Modelfile and a registration command to the `write_files` and `runcmd` sections:
+
+```yaml
+write_files:
+  - path: /etc/cube/modelfiles/my-model.Modelfile
+    content: |
+      FROM /var/lib/ollama/custom-models/weights.gguf
+      PARAMETER temperature 0.7
+      SYSTEM "You are a domain-specific assistant."
+    permissions: '0644'
+
+runcmd:
+  # ... (existing commands)
+  # After ollama is installed and running, register the custom model
+  - |
+    for i in $(seq 1 60); do
+      if curl -s http://localhost:11434/api/version > /dev/null 2>&1; then
+        break
+      fi
+      sleep 2
+    done
+    ollama create my-model -f /etc/cube/modelfiles/my-model.Modelfile
+```
+
+If the model weights need to be downloaded from a private source during provisioning, add a download step before registration:
+
+```yaml
+runcmd:
+  # Download private model weights (e.g., from a private S3 bucket or internal server)
+  - mkdir -p /var/lib/ollama/custom-models
+  - curl -o /var/lib/ollama/custom-models/weights.gguf https://internal-server/models/weights.gguf
+  # Then register
+  - ollama create my-model -f /etc/cube/modelfiles/my-model.Modelfile
+```
+
+---
+
+## vLLM Backend
+
+### Upload Custom Model Files to a Running CVM
+
+#### 1. Transfer Model Directory
+
+vLLM expects a HuggingFace-format model directory. Transfer the entire directory:
+
+```bash
+scp -r -P 6190 /path/to/my-hf-model/ root@localhost:/var/lib/vllm/models/
+```
+
+#### 2. Update vLLM Configuration
+
+SSH into the CVM and update the vLLM environment to point to the uploaded model:
+
+```bash
+ssh -p 6190 root@localhost
+
+# Edit the vLLM config
+sed -i 's|^VLLM_MODEL=.*|VLLM_MODEL=/var/lib/vllm/models/my-hf-model|' /etc/vllm/vllm.env
+
+# Restart vLLM
+systemctl restart vllm
+# or for SysV init:
+/etc/init.d/S96vllm restart
+```
+
+#### 3. Verify
+
+```bash
+curl http://localhost:8000/v1/models
+```
+
+### Embed a Custom Model in a Buildroot HAL Image
+
+Use the `BR2_PACKAGE_VLLM_CUSTOM_MODEL_PATH` option to embed model files at build time.
+
+#### 1. Configure the Model Path
+
+In `menuconfig`, navigate to **Target packages → Cube packages → vllm** and set:
+
+- **Custom model path** — Absolute path to the model directory on your build machine
+
+Or set it in the defconfig:
+
+```bash
+BR2_PACKAGE_VLLM_CUSTOM_MODEL_PATH="/path/to/my-hf-model"
+```
+
+#### 2. Build
+
+```bash
+make -j$(nproc)
+```
+
+The build system copies the model files into `/var/lib/vllm/models/` in the image and configures vLLM to use the local path automatically.
+
+### Embed a Custom Model via Cloud-Init
+
+Add a model download or transfer step to the `runcmd` section in `hal/ubuntu/qemu.sh`:
+
+```yaml
+runcmd:
+  # Install vLLM
+  - pip install vllm
+  # Download private model
+  - mkdir -p /var/lib/vllm/models
+  - |
+    # Option A: Download from a private registry (requires HF token for gated models)
+    HF_TOKEN="your-token-here"
+    huggingface-cli download my-org/my-private-model \
+      --local-dir /var/lib/vllm/models/my-private-model \
+      --token "$HF_TOKEN"
+  # Configure and start vLLM
+  - |
+    cat > /etc/vllm/vllm.env << 'ENVEOF'
+    VLLM_MODEL=/var/lib/vllm/models/my-private-model
+    VLLM_GPU_MEMORY_UTILIZATION=0.85
+    VLLM_MAX_MODEL_LEN=2048
+    ENVEOF
+  - systemctl restart vllm
+```
+
+---
+
+## Verifying Model Availability Through the Proxy
+
+After deploying a custom model, verify it is accessible end-to-end through the Cube Agent:
+
+```bash
+# List available models
+curl http://localhost:6193/v1/models
+
+# Test chat completions
+curl http://localhost:6193/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "my-custom-model",
+    "messages": [{"role": "user", "content": "Hello"}]
+  }'
+```
+
+Port `6193` is the default host-side forwarded port for the Cube Agent (maps to port `7001` inside the CVM).
diff --git a/sidebars.ts b/sidebars.ts
index f6f1dba..1c9eeb2 100644
--- a/sidebars.ts
+++ b/sidebars.ts
@@ -59,7 +59,6 @@ const sidebars: SidebarsConfig = {
         'developer-guide/private-model-upload',
         'developer-guide/hal',
         'developer-guide/cvm-management',
-        'developer-guide/custom-model-deployment',
         'developer-guide/fine-tuning',
         'developer-guide/auth-and-request-flow',
       ],

From cfe4dc85727b11c70b42291086f65f9a84e26b1e Mon Sep 17 00:00:00 2001
From: WashingtonKK <washingtonkigan@gmail.com>
Date: Thu, 26 Feb 2026 13:50:36 +0300
Subject: [PATCH 4/4] update docs

Signed-off-by: WashingtonKK <washingtonkigan@gmail.com>
---
 docs/developer-guide/private-model-upload.md | 264 ++++++++++---------
 1 file changed, 133 insertions(+), 131 deletions(-)

diff --git a/docs/developer-guide/private-model-upload.md b/docs/developer-guide/private-model-upload.md
index 5cf549d..0315231 100644
--- a/docs/developer-guide/private-model-upload.md
+++ b/docs/developer-guide/private-model-upload.md
@@ -8,87 +8,70 @@ sidebar_position: 3
 
 This guide explains how to upload and deploy private or custom models into a Cube AI Confidential VM (CVM). Private models are models that are not available in public registries (Ollama library, HuggingFace) — for example, fine-tuned models, proprietary weights, or models with restricted access.
 
----
-
-## Ollama Backend
+### Port Reference
 
-### Upload Model Files to a Running CVM
+CVM network access uses QEMU user-mode port forwarding. The following host-to-guest port mappings are configured in the QEMU launch scripts (`hal/buildroot/qemu.sh` and `hal/ubuntu/qemu.sh`):
 
-#### 1. Package Model Files
+| Host Port | Guest Port | Service |
+| --- | --- | --- |
+| 6190 | 22 | SSH |
+| 6193 | 7001 | Cube Agent API |
 
-Prepare your model weights and any associated files into an archive:
+Inside the CVM, the LLM backends listen on their own ports (not directly exposed to the host):
 
-```bash
-tar -czvf my-model.tar.gz /path/to/model/files
-```
+| Port | Service |
+| --- | --- |
+| 11434 | Ollama API |
+| 8000 | vLLM OpenAI-compatible API |
 
-#### 2. Transfer to the CVM
+The Cube Agent (port 7001 inside the CVM, 6193 on the host) acts as a reverse proxy to whichever LLM backend is configured, so all model inference requests go through the agent.
 
-Copy the archive into the CVM via SCP using the forwarded SSH port:
+---
 
-```bash
-# Buildroot CVM
-scp -P 6190 my-model.tar.gz root@localhost:/var/lib/ollama/
+## Build-Time Model Embedding (Buildroot HAL)
 
-# Ubuntu cloud-init CVM
-scp -P 6190 my-model.tar.gz ultraviolet@localhost:/var/lib/ollama/
-```
+The Buildroot HAL supports embedding custom model configuration directly into the CVM image via `menuconfig`. This is the recommended approach for production deployments where models should be available immediately after boot.
 
-#### 3. Extract and Register the Model
+### Ollama
 
-SSH into the CVM and create an Ollama model from the uploaded files using a [Modelfile](https://github.com/ollama/ollama/blob/main/docs/modelfile.md):
+#### Using menuconfig
 
-```bash
-ssh -p 6190 root@localhost
+During HAL image configuration (see [HAL guide](/developer-guide/hal)), navigate to:
 
-cd /var/lib/ollama
-tar -xzvf my-model.tar.gz
-```
+**Target packages → Cube packages → ollama**
 
-Create a Modelfile that references the uploaded weights:
+Set the **Custom models to install** field with a space-separated list of Ollama model tags:
 
-```bash
-cat > /tmp/Modelfile << 'EOF'
-FROM /var/lib/ollama/my-model/weights.gguf
-PARAMETER temperature 0.7
-PARAMETER top_p 0.9
-SYSTEM "You are a helpful assistant."
-EOF
-
-ollama create my-custom-model -f /tmp/Modelfile
+```text
+llama2:7b mistral:7b codellama:13b
 ```
 
-#### 4. Verify the Model
+These models are pulled automatically on first boot by a script installed at `/usr/libexec/ollama/pull-models.sh`. 
+
+Or set it directly in the Buildroot defconfig:
 
 ```bash
-ollama list
+BR2_PACKAGE_OLLAMA_CUSTOM_MODELS="llama2:7b mistral:7b codellama:13b"
 ```
 
-Test inference:
+Then rebuild:
 
 ```bash
-curl http://localhost:7001/v1/chat/completions \
-  -H "Content-Type: application/json" \
-  -d '{
-    "model": "my-custom-model",
-    "messages": [{"role": "user", "content": "Hello"}]
-  }'
+make -j$(nproc)
 ```
 
-### Embed a Private Model in a Buildroot HAL Image
-
-To include a private model directly in the HAL image at build time, use the Buildroot filesystem overlay:
+#### Embedding GGUF Weights in the Image
 
-#### 1. Place Model Files in the Overlay
+For models not available in the Ollama registry (e.g., your own fine-tuned GGUF weights), use the Buildroot filesystem overlay to embed the files directly:
 
-Create a directory for the model in the overlay structure:
+1. Place the model weights in the overlay:
 
 ```bash
 mkdir -p cube/hal/buildroot/linux/board/cube/overlay/var/lib/ollama/custom-models/
 cp /path/to/weights.gguf cube/hal/buildroot/linux/board/cube/overlay/var/lib/ollama/custom-models/
 ```
 
-#### 2. Add a Modelfile to the Overlay
+2. Add a [Modelfile](https://github.com/ollama/ollama/blob/main/docs/modelfile.md) to the overlay:
 
 ```bash
 mkdir -p cube/hal/buildroot/linux/board/cube/overlay/etc/cube/modelfiles/
@@ -99,9 +82,7 @@ SYSTEM "You are a domain-specific assistant."
 EOF
 ```
 
-#### 3. Register the Model on First Boot
-
-Add a startup script in the overlay that creates the Ollama model after the service starts:
+3. Add a startup script in the overlay to register the model after Ollama starts:
 
 ```bash
 mkdir -p cube/hal/buildroot/linux/board/cube/overlay/usr/libexec/ollama/
@@ -125,26 +106,52 @@ SCRIPT
 chmod +x cube/hal/buildroot/linux/board/cube/overlay/usr/libexec/ollama/register-custom-models.sh
 ```
 
-#### 4. Build the Image
+4. Build the image:
 
 ```bash
 cd buildroot
 make -j$(nproc)
 ```
 
-The model weights are embedded in the rootfs and registered automatically on first boot.
+### vLLM
 
-### Embed a Private Model via Cloud-Init
+#### Using menuconfig
 
-To deploy a private model in an Ubuntu cloud-init CVM, modify the `user-data` section in `hal/ubuntu/qemu.sh`:
+Navigate to **Target packages → Cube packages → vllm** and set:
 
-#### 1. Pre-Stage Model Files on the Host
+- **Custom model path** — Absolute path to a HuggingFace-format model directory on your build machine
 
-Place model files in a directory accessible to the QEMU VM. The simplest approach is to transfer them after boot via the `runcmd` section.
+Or in the defconfig:
 
-#### 2. Add a Modelfile and Registration to Cloud-Init
+```bash
+BR2_PACKAGE_VLLM_CUSTOM_MODEL_PATH="/path/to/my-hf-model"
+```
+
+The build system copies the model files into `/var/lib/vllm/models/` in the image and configures vLLM to serve from that local path automatically. The vLLM service configuration is written to `/etc/vllm/vllm.env`.
+
+You can also configure inference parameters at build time:
+
+```bash
+BR2_PACKAGE_VLLM_MODEL="meta-llama/Llama-2-7b-hf"
+BR2_PACKAGE_VLLM_GPU_MEMORY="0.90"
+BR2_PACKAGE_VLLM_MAX_MODEL_LEN="2048"
+```
+
+Then rebuild:
+
+```bash
+make -j$(nproc)
+```
+
+---
 
-Add the Modelfile and a registration command to the `write_files` and `runcmd` sections:
+## Cloud-Init Model Provisioning (Ubuntu)
+
+For Ubuntu-based CVMs using cloud-init, custom models are configured in the `user-data` section of `hal/ubuntu/qemu.sh`. Models are provisioned during the first boot.
+
+### Ollama
+
+Add a Modelfile and registration commands to the `write_files` and `runcmd` sections of the cloud-init `user-data`:
 
 ```yaml
 write_files:
@@ -156,8 +163,11 @@ write_files:
     permissions: '0644'
 
 runcmd:
-  # ... (existing commands)
-  # After ollama is installed and running, register the custom model
+  # ... (existing commands that install ollama and start it)
+  # Download private model weights from an internal server
+  - mkdir -p /var/lib/ollama/custom-models
+  - curl -o /var/lib/ollama/custom-models/weights.gguf https://internal-server/models/weights.gguf
+  # Wait for Ollama and register the custom model
   - |
     for i in $(seq 1 60); do
       if curl -s http://localhost:11434/api/version > /dev/null 2>&1; then
@@ -168,114 +178,108 @@ runcmd:
     ollama create my-model -f /etc/cube/modelfiles/my-model.Modelfile
 ```
 
-If the model weights need to be downloaded from a private source during provisioning, add a download step before registration:
+### vLLM
+
+Add a model download and vLLM configuration step to `runcmd`:
 
 ```yaml
 runcmd:
-  # Download private model weights (e.g., from a private S3 bucket or internal server)
-  - mkdir -p /var/lib/ollama/custom-models
-  - curl -o /var/lib/ollama/custom-models/weights.gguf https://internal-server/models/weights.gguf
-  # Then register
-  - ollama create my-model -f /etc/cube/modelfiles/my-model.Modelfile
+  - pip install vllm
+  - mkdir -p /var/lib/vllm/models
+  # Download from a private HuggingFace registry (requires token for gated models)
+  - |
+    HF_TOKEN="your-token-here"
+    huggingface-cli download my-org/my-private-model \
+      --local-dir /var/lib/vllm/models/my-private-model \
+      --token "$HF_TOKEN"
+  # Configure vLLM to use the downloaded model
+  - |
+    cat > /etc/vllm/vllm.env << 'ENVEOF'
+    VLLM_MODEL=/var/lib/vllm/models/my-private-model
+    VLLM_GPU_MEMORY_UTILIZATION=0.85
+    VLLM_MAX_MODEL_LEN=2048
+    ENVEOF
+  - systemctl restart vllm
 ```
 
 ---
 
-## vLLM Backend
+## Runtime Model Upload
 
-### Upload Custom Model Files to a Running CVM
+After a CVM is running (regardless of which approach was used to create it), you can deploy additional models over SSH.
 
-#### 1. Transfer Model Directory
+### Ollama
 
-vLLM expects a HuggingFace-format model directory. Transfer the entire directory:
+#### 1. Transfer and Register
 
 ```bash
-scp -r -P 6190 /path/to/my-hf-model/ root@localhost:/var/lib/vllm/models/
-```
-
-#### 2. Update vLLM Configuration
+# Package model files on the host
+tar -czvf my-model.tar.gz /path/to/model/files
 
-SSH into the CVM and update the vLLM environment to point to the uploaded model:
+# Copy into the CVM (port 6190 forwards to SSH port 22 inside the CVM)
+scp -P 6190 my-model.tar.gz root@localhost:/var/lib/ollama/
 
-```bash
+# SSH into the CVM and register the model
 ssh -p 6190 root@localhost
+cd /var/lib/ollama && tar -xzvf my-model.tar.gz
 
-# Edit the vLLM config
-sed -i 's|^VLLM_MODEL=.*|VLLM_MODEL=/var/lib/vllm/models/my-hf-model|' /etc/vllm/vllm.env
+cat > /tmp/Modelfile << 'EOF'
+FROM /var/lib/ollama/my-model/weights.gguf
+PARAMETER temperature 0.7
+PARAMETER top_p 0.9
+SYSTEM "You are a helpful assistant."
+EOF
 
-# Restart vLLM
-systemctl restart vllm
-# or for SysV init:
-/etc/init.d/S96vllm restart
+ollama create my-custom-model -f /tmp/Modelfile
 ```
 
-#### 3. Verify
+:::note
+For Ubuntu cloud-init CVMs, the default SSH user is `ultraviolet` (password: `password`). For Buildroot CVMs, the default user is `root`.
+:::
+
+#### 2. Verify
 
 ```bash
-curl http://localhost:8000/v1/models
+ollama list
 ```
 
-### Embed a Custom Model in a Buildroot HAL Image
-
-Use the `BR2_PACKAGE_VLLM_CUSTOM_MODEL_PATH` option to embed model files at build time.
+### vLLM
 
-#### 1. Configure the Model Path
+#### 1. Transfer and Configure
 
-In `menuconfig`, navigate to **Target packages → Cube packages → vllm** and set:
-
-- **Custom model path** — Absolute path to the model directory on your build machine
-
-Or set it in the defconfig:
+vLLM expects a HuggingFace-format model directory:
 
 ```bash
-BR2_PACKAGE_VLLM_CUSTOM_MODEL_PATH="/path/to/my-hf-model"
-```
+# Copy the model directory into the CVM
+scp -r -P 6190 /path/to/my-hf-model/ root@localhost:/var/lib/vllm/models/
 
-#### 2. Build
+# SSH in and update the vLLM config to point to the new model
+ssh -p 6190 root@localhost
+sed -i 's|^VLLM_MODEL=.*|VLLM_MODEL=/var/lib/vllm/models/my-hf-model|' /etc/vllm/vllm.env
 
-```bash
-make -j$(nproc)
+# Restart vLLM to load the new model
+systemctl restart vllm
+# or for SysV init:
+/etc/init.d/S96vllm restart
 ```
 
-The build system copies the model files into `/var/lib/vllm/models/` in the image and configures vLLM to use the local path automatically.
-
-### Embed a Custom Model via Cloud-Init
+#### 2. Verify
 
-Add a model download or transfer step to the `runcmd` section in `hal/ubuntu/qemu.sh`:
-
-```yaml
-runcmd:
-  # Install vLLM
-  - pip install vllm
-  # Download private model
-  - mkdir -p /var/lib/vllm/models
-  - |
-    # Option A: Download from a private registry (requires HF token for gated models)
-    HF_TOKEN="your-token-here"
-    huggingface-cli download my-org/my-private-model \
-      --local-dir /var/lib/vllm/models/my-private-model \
-      --token "$HF_TOKEN"
-  # Configure and start vLLM
-  - |
-    cat > /etc/vllm/vllm.env << 'ENVEOF'
-    VLLM_MODEL=/var/lib/vllm/models/my-private-model
-    VLLM_GPU_MEMORY_UTILIZATION=0.85
-    VLLM_MAX_MODEL_LEN=2048
-    ENVEOF
-  - systemctl restart vllm
+```bash
+curl http://localhost:8000/v1/models
 ```
 
 ---
 
-## Verifying Model Availability Through the Proxy
+## Verifying Model Availability
 
-After deploying a custom model, verify it is accessible end-to-end through the Cube Agent:
+After deploying a custom model, verify it is accessible end-to-end through the Cube Agent. From the host:
 
 ```bash
-# List available models
+# List available models (port 6193 forwards to the Cube Agent on port 7001 inside the CVM)
 curl http://localhost:6193/v1/models
 
-# Test chat completions
+# Test a chat completion request
 curl http://localhost:6193/v1/chat/completions \
   -H "Content-Type: application/json" \
   -d '{
@@ -283,5 +287,3 @@ curl http://localhost:6193/v1/chat/completions \
     "messages": [{"role": "user", "content": "Hello"}]
   }'
 ```
-
-Port `6193` is the default host-side forwarded port for the Cube Agent (maps to port `7001` inside the CVM).