Skip to content

capetron/private-ai-deployment-guide

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Private AI Deployment Guide | By Petronella Technology Group

A comprehensive guide to deploying private, on-premise AI and Large Language Models (LLMs) for organizations that need AI capabilities without sending sensitive data to cloud providers. Includes hardware selection, model comparison, security hardening, and governance frameworks.

License: MIT Petronella Technology Group


Table of Contents


Why Private AI?

The AI revolution is transforming every industry, but for organizations handling sensitive data -- healthcare providers, defense contractors, financial institutions, legal firms -- sending confidential information to cloud AI providers creates unacceptable risks.

Private AI deployment keeps your data within your control while delivering the full power of modern language models, image generation, and AI-assisted workflows.

Key Benefits

  • Data sovereignty: Your data never leaves your infrastructure. No third-party access, no cloud provider data retention policies to worry about
  • Regulatory compliance: Meet HIPAA, CMMC, SOC 2, ITAR, and other frameworks that restrict data handling to authorized systems
  • Zero data leakage: Eliminate the risk of sensitive information being used to train external models or appearing in other users' outputs
  • Predictable costs: No per-token pricing surprises. One-time hardware investment plus electricity delivers unlimited inference
  • Customization: Fine-tune models on your proprietary data for domain-specific accuracy
  • Availability: No dependency on external API availability, rate limits, or service outages
  • Air-gap capable: Deploy in fully air-gapped environments for classified or highly sensitive workloads

The Risk of Cloud AI

Organizations using cloud AI services face these documented risks:

Risk Impact Example
Data retention Provider stores your prompts and outputs Samsung engineers leaked proprietary code via ChatGPT (2023)
Model training Your data may train the provider's models OpenAI's default data usage policy includes training
Breach exposure Provider breach exposes your conversations ChatGPT conversation history leak (March 2023)
Compliance violations Sending CUI/PHI to unauthorized systems HIPAA violation for PHI sent to cloud AI without BAA
Vendor lock-in Dependency on single provider's pricing and availability OpenAI API pricing changes, Anthropic rate limits
Shadow AI Employees use unauthorized cloud AI tools 60% of employees admit using AI tools without IT approval

Use Cases for On-Premise AI

Healthcare

  • Clinical documentation: AI-assisted note generation from patient encounters
  • Medical coding: Automated ICD-10/CPT code suggestion from clinical notes
  • Research analysis: Processing research data with PHI without cloud exposure
  • Patient communication: AI-generated summaries and explanations in plain language

Defense and Government

  • Document analysis: Processing classified or CUI documents with AI assistance
  • Intelligence analysis: Pattern recognition across large document sets
  • Code generation: AI-assisted development for controlled environments
  • Training materials: Generating scenario-based training content

Legal

  • Contract analysis: Reviewing and summarizing legal documents
  • Discovery support: Processing large document sets for relevant information
  • Brief generation: AI-assisted legal writing with client confidentiality
  • Compliance review: Automated policy and regulation analysis

Financial Services

  • Risk analysis: Processing proprietary financial models and data
  • Report generation: Automated financial reporting from internal data
  • Customer communication: Personalized communication drafting
  • Fraud detection: Pattern analysis on transaction data

Architecture Overview

A production private AI deployment typically follows this architecture:

                          +------------------+
                          |   Load Balancer  |
                          +--------+---------+
                                   |
                    +--------------+--------------+
                    |                             |
            +-------+-------+            +-------+-------+
            | Inference      |            | Inference      |
            | Server 1       |            | Server 2       |
            | (GPU Node)     |            | (GPU Node)     |
            +-------+-------+            +-------+-------+
                    |                             |
            +-------+-------+            +-------+-------+
            | Model Storage  |            | Vector DB      |
            | (NFS/S3)       |            | (RAG Pipeline) |
            +----------------+            +----------------+
                    |                             |
            +-------+-----------------------------+-------+
            |            Internal Network                  |
            |     (Isolated VLAN / Air-gapped)            |
            +---------------------------------------------+

Core Components

  1. Inference server(s): GPU-equipped machines running the LLM inference engine
  2. Model storage: Shared storage for model weights (10-100+ GB per model)
  3. API gateway: Manages authentication, rate limiting, and request routing
  4. Vector database: Enables Retrieval-Augmented Generation (RAG) for domain-specific knowledge
  5. Monitoring: GPU utilization, latency, throughput, and error tracking
  6. Network isolation: Dedicated VLAN or air-gapped network for the AI infrastructure

Hardware Requirements

Hardware selection is the most critical decision in private AI deployment. The right hardware depends on your model size, concurrent user count, and latency requirements.

See guides/hardware-selection.md for the complete hardware selection guide.

Quick Reference: GPU Recommendations

Use Case Model Size Recommended GPU VRAM Est. Cost
Small team (5-10 users) 7-8B params NVIDIA RTX 4090 (24GB) 24 GB $1,600
Department (10-30 users) 13-14B params NVIDIA RTX 5090 (32GB) 32 GB $2,000
Department (30-50 users) 32-70B params NVIDIA A6000 (48GB) 48 GB $4,500
Enterprise (50-200 users) 70B+ params NVIDIA A100 (80GB) 80 GB $15,000
Enterprise (200+ users) 70B+ params Multi-GPU (2-8x A100/H100) 160-640 GB $30,000-$250,000
Budget option 7-8B params AMD RX 7900 XTX (24GB) 24 GB $900

Minimum System Requirements

Component Minimum Recommended
CPU 8 cores, AVX2 support 16+ cores, AVX-512
RAM 32 GB 64-128 GB
GPU VRAM 16 GB 24-48 GB
Storage 500 GB NVMe SSD 2+ TB NVMe SSD
Network 1 Gbps 10 Gbps (multi-GPU)
Power 700W PSU 1000W+ PSU

Inference Frameworks

Three major open-source frameworks dominate private AI inference. See guides/model-comparison.md for detailed comparison.

Ollama

Best for: Getting started quickly, small teams, development environments

Ollama provides the simplest deployment experience with a Docker-like pull-and-run model.

# Install
curl -fsSL https://ollama.com/install.sh | sh

# Pull and run a model
ollama pull llama3.1:8b
ollama run llama3.1:8b "Summarize this contract clause..."

# Serve API
ollama serve  # Runs on localhost:11434

Pros: Easy setup, model management, automatic GPU detection, cross-platform Cons: Single-GPU only, limited batching, higher latency at scale

vLLM

Best for: Production deployments, high-throughput serving, multi-GPU

vLLM uses PagedAttention for efficient memory management and supports continuous batching for maximum throughput.

# Install
pip install vllm

# Serve a model
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 --port 8000 \
  --tensor-parallel-size 2  # Multi-GPU

Pros: Highest throughput, multi-GPU support, OpenAI-compatible API, production-ready Cons: More complex setup, NVIDIA-focused (ROCm support improving)

llama.cpp

Best for: CPU inference, edge deployment, maximum hardware flexibility

llama.cpp is a highly optimized C++ implementation supporting CPU, CUDA, ROCm, Metal, and Vulkan backends.

# Build from source
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp && cmake -B build -DGGML_CUDA=ON && cmake --build build

# Serve a model
./build/bin/llama-server \
  -m models/llama-3.1-8b-instruct.gguf \
  --host 0.0.0.0 --port 8080

Pros: Runs everywhere (CPU, any GPU), smallest memory footprint (quantization), fastest single-user latency Cons: Lower throughput for concurrent users, manual model conversion (GGUF format)

Framework Comparison

Feature Ollama vLLM llama.cpp
Setup complexity Very easy Moderate Moderate
Multi-GPU No Yes Limited
Throughput (concurrent) Low-Med High Low-Med
Single-user latency Good Good Best
Memory efficiency Good Best Good (quantization)
CPU inference Limited No Excellent
AMD GPU support Yes (ROCm) Improving Yes (ROCm/Vulkan)
Apple Silicon Yes (Metal) No Yes (Metal)
OpenAI-compatible API Yes Yes Yes
Model management Built-in External External

Model Selection Guide

Recommended Models by Use Case (2025-2026)

Use Case Model Parameters VRAM Required Quality
General assistant Llama 3.1 8B Instruct 8B 6-8 GB Good
General assistant (better) Llama 3.1 70B Instruct 70B 40-48 GB Excellent
Code generation DeepSeek Coder V2 16B/236B 12-48 GB Excellent
Code generation Qwen 2.5 Coder 32B 32B 20-24 GB Excellent
Document analysis Mixtral 8x22B 141B (sparse) 48-80 GB Excellent
Medical/Clinical Med-PaLM (fine-tuned Llama) Varies Varies Domain-specific
Summarization Llama 3.1 8B + fine-tune 8B 6-8 GB Good
Embedding/RAG Nomic Embed Text 137M 1 GB Excellent
Reasoning QwQ 32B 32B 20-24 GB Excellent

Quantization Options

Quantization reduces model size and VRAM requirements at a small quality cost:

Quantization Size Reduction Quality Impact VRAM Savings
FP16 (default) 0% None Baseline
Q8 ~50% Negligible ~50%
Q6_K ~60% Very minor ~60%
Q5_K_M ~65% Minor ~65%
Q4_K_M ~75% Noticeable for complex tasks ~75%
Q3_K_M ~80% Moderate ~80%

Recommendation: Start with Q5_K_M or Q6_K for the best balance of quality and efficiency.


Deployment Options

Option 1: Single-Server Deployment

Best for small teams (5-20 users) with straightforward needs.

+-------------------+
| Single Server     |
| - Ollama or vLLM  |
| - 1x GPU          |
| - Model storage   |
| - API endpoint    |
+-------------------+

Option 2: Docker-Based Deployment

Best for teams familiar with containers who want reproducible deployments.

# docker-compose.yml example
services:
  ollama:
    image: ollama/ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]
    restart: unless-stopped

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    ports:
      - "3000:8080"
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    volumes:
      - webui_data:/app/backend/data
    depends_on:
      - ollama
    restart: unless-stopped

volumes:
  ollama_data:
  webui_data:

Option 3: Kubernetes Deployment

Best for enterprise scale with multiple GPU nodes and high availability requirements.

Option 4: Air-Gapped Deployment

For classified or highly sensitive environments with no internet connectivity:

  1. Download models on an internet-connected system
  2. Transfer model files via approved media (encrypted USB, optical disc)
  3. Deploy inference framework from pre-built packages
  4. Validate model integrity via checksums
  5. Monitor via internal-only dashboards

Security and Compliance

See guides/security-hardening.md for the complete security hardening guide.

Security Architecture Principles

  1. Network isolation: Deploy AI infrastructure on a dedicated VLAN with strict firewall rules
  2. Authentication: Require authentication for all API access (API keys, OAuth 2.0, or mTLS)
  3. Authorization: Implement RBAC to control which users can access which models and data
  4. Encryption: TLS for all API communication, encryption at rest for model storage and logs
  5. Audit logging: Log all prompts and responses (with PII redaction where required)
  6. Input validation: Sanitize and validate all inputs to prevent prompt injection attacks
  7. Output filtering: Implement guardrails to prevent the model from generating harmful content
  8. Model integrity: Verify model checksums and maintain a software bill of materials (SBOM)

Compliance Mapping

Framework Key AI-Related Requirements How Private AI Helps
HIPAA PHI must be on authorized systems, BAA required for cloud services PHI never leaves your infrastructure
CMMC CUI must be on assessed systems within the boundary AI processing stays within CMMC boundary
SOC 2 Data processing controls, vendor management No third-party data processor for AI
ITAR Technical data cannot leave US/authorized persons Air-gapped deployment ensures compliance
GDPR Data processing agreements, data residency Full control over data location and retention
PCI DSS Cardholder data protection Payment data processed locally

AI Governance

Every organization deploying AI should establish a governance framework. See templates/ai-governance-policy.md for a complete policy template covering:

  • Acceptable use policies for AI tools
  • Data classification and handling rules
  • Model evaluation and approval processes
  • Bias monitoring and fairness requirements
  • Incident response for AI-specific failures
  • Training requirements for AI users

Performance Optimization

Key Metrics to Monitor

Metric Target Tool
Tokens/second (generation) 30-80 tok/s per user Framework metrics
Time to first token (TTFT) < 500ms API latency monitoring
GPU utilization 70-90% during peak nvidia-smi, Prometheus
VRAM usage < 90% of capacity nvidia-smi
Request queue depth < 10 requests Framework metrics
Error rate < 0.1% API monitoring

Optimization Techniques

  1. KV-cache optimization: Use PagedAttention (vLLM) to maximize concurrent requests
  2. Quantization: Use Q5_K_M or Q6_K for 60-65% memory savings with minimal quality loss
  3. Speculative decoding: Use a small draft model to accelerate generation from larger models
  4. Flash Attention: Enable FlashAttention 2 for significant memory and speed improvements
  5. Continuous batching: vLLM and TGI support batching multiple requests for higher throughput
  6. Model sharding: Distribute large models across multiple GPUs with tensor parallelism

Cost Analysis

On-Premise vs Cloud Cost Comparison

Scenario: 50 users, ~100,000 tokens/user/day, Llama 3.1 70B equivalent quality

Cost Factor Cloud AI (GPT-4o) Private AI (On-Prem)
Monthly API/inference cost $15,000-$30,000 $200 (electricity)
Hardware (amortized/mo over 3yr) $0 $1,400
Staff time Minimal 10 hrs/mo ($2,000)
Total monthly cost $15,000-$30,000 $3,600
Annual cost $180,000-$360,000 $43,200
3-year total $540,000-$1,080,000 $129,600 + $50,000 hardware

Break-even: Private AI typically breaks even within 3-6 months for organizations with moderate usage.


Implementation Roadmap

Phase 1: Proof of Concept (Weeks 1-4)

  • Define use cases and success criteria
  • Assess existing infrastructure
  • Deploy single-server Ollama with a 7-8B model
  • Pilot with 5-10 users from a single team
  • Measure quality, latency, and user satisfaction

Phase 2: Production Pilot (Weeks 5-12)

  • Select production hardware based on POC findings
  • Deploy production inference framework (vLLM recommended)
  • Implement security controls (authentication, encryption, logging)
  • Deploy web UI for non-technical users (Open WebUI)
  • Expand to 20-50 users across multiple teams
  • Implement RAG pipeline for domain-specific knowledge

Phase 3: Enterprise Rollout (Weeks 13-24)

  • Scale infrastructure for full user base
  • Implement AI governance policy
  • Deploy monitoring and alerting
  • Integrate with existing workflows (API connections)
  • Train users and administrators
  • Establish ongoing operations procedures

Phase 4: Optimization (Ongoing)

  • Fine-tune models on organization-specific data
  • Evaluate new model releases for quality improvements
  • Optimize hardware utilization and costs
  • Expand use cases based on user feedback
  • Conduct regular security assessments

Templates and Resources

Resource Purpose Location
Hardware Selection Guide Detailed GPU and system recommendations guides/hardware-selection.md
Model Comparison Ollama vs vLLM vs llama.cpp deep dive guides/model-comparison.md
Security Hardening Guide Complete security checklist for AI deployments guides/security-hardening.md
AI Governance Policy Template policy for organizational AI use templates/ai-governance-policy.md
GPU Benchmark Script Automated GPU performance testing scripts/gpu-benchmark.sh

About Petronella Technology Group

Petronella Technology Group has been providing cybersecurity and IT infrastructure services for over 23 years. Founded by Craig Petronella, a 15x published author, PTG helps organizations deploy private AI solutions that meet strict compliance requirements while delivering real business value.

Our Private AI Services

  • AI readiness assessments -- Evaluate your infrastructure and use cases
  • Hardware specification and procurement -- Right-sized GPU infrastructure
  • Deployment and configuration -- Production-ready private AI environments
  • Security hardening -- Compliance-aligned AI security controls
  • RAG pipeline development -- Domain-specific knowledge integration
  • Ongoing management -- Monitoring, updates, and optimization
  • AI governance program development -- Policies, training, and oversight

Get Started

Additional Resources


Contributing

We welcome contributions from the AI and security communities. Submit pull requests with hardware benchmarks, deployment guides, or security recommendations.

License

This project is licensed under the MIT License -- see the LICENSE file for details.


Built with hands-on private AI deployment experience by Petronella Technology Group -- Securing businesses for over 23 years.

About

Complete guide to deploying private, on-premise AI and LLMs: hardware selection, model comparison (ollama vs vLLM vs llama.cpp), security hardening, and AI governance policy templates. By Petronella Technology Group.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages