Deployment

Production deployment guide for LayerScale.

Docker Compose

Single GPU

services:
  layerscale:
    image: layerscale/layerscale:latest
    ports:
      - "8080:8080"
    environment:
      - LAYERSCALE_LICENSE_KEY=${LAYERSCALE_LICENSE_KEY}
    volumes:
      - ./models:/models
      - model-cache:/home/layerscale/.cache
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
        limits:
          memory: 32G
    command: >
      --model /models/llama-3.1-8b-instruct
      --ctx-size 16384
      --pool-size 4
      --kv-cache-type q8_0
      --log-format json
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 30s
      timeout: 10s
      start_period: 120s
      retries: 3
    restart: unless-stopped

volumes:
  model-cache:

Start with:

LAYERSCALE_LICENSE_KEY=your-key docker compose up -d

With HuggingFace Model

services:
  layerscale:
    image: layerscale/layerscale:latest
    ports:
      - "8080:8080"
    environment:
      - LAYERSCALE_LICENSE_KEY=${LAYERSCALE_LICENSE_KEY}
    volumes:
      - model-cache:/home/layerscale/.cache
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    command: >
      --model meta-llama/Llama-3.1-8B-Instruct
      --ctx-size 8192
      --pool-size 4
      --log-format json
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 30s
      timeout: 10s
      start_period: 300s
      retries: 3
    restart: unless-stopped

volumes:
  model-cache:

Note the longer start_period (300s) to account for model download time on first run.

GPU Passthrough

NVIDIA Container Toolkit

Install the NVIDIA Container Toolkit on the host:

# Ubuntu/Debian
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Verify GPU access:

docker run --rm --gpus all nvidia/cuda:12.9.1-base-ubuntu22.04 nvidia-smi

AMD ROCm

For AMD GPUs, use the ROCm image and pass the GPU device:

docker run --device=/dev/kfd --device=/dev/dri --group-add video \
  -p 8080:8080 \
  -e LAYERSCALE_LICENSE_KEY=your-key \
  layerscale/layerscale:latest-rocm \
  --model meta-llama/Llama-3.1-8B-Instruct

Verify GPU access:

docker run --rm --device=/dev/kfd --device=/dev/dri --group-add video \
  rocm/dev-ubuntu-22.04 rocm-smi

Apple Silicon (Coming Soon)

Native Metal support for Apple Silicon Macs is in development. This will enable running LayerScale directly on M-series hardware without Docker or an NVIDIA/AMD GPU.

Selecting a Specific GPU

On multi-GPU hosts, select a GPU by device index:

docker run --gpus '"device=0"' -p 8080:8080 \
  -e LAYERSCALE_LICENSE_KEY=your-key \
  layerscale/layerscale:latest \
  --model /models/your-model

Or by UUID:

docker run --gpus '"device=GPU-abc123"' -p 8080:8080 \
  -e LAYERSCALE_LICENSE_KEY=your-key \
  layerscale/layerscale:latest \
  --model /models/your-model

Multiple Instances on Multiple GPUs

Run separate LayerScale instances on different GPUs:

services:
  layerscale-gpu0:
    image: layerscale/layerscale:latest
    ports:
      - "8080:8080"
    environment:
      - LAYERSCALE_LICENSE_KEY=${LAYERSCALE_LICENSE_KEY}
      - NVIDIA_VISIBLE_DEVICES=0
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["0"]
              capabilities: [gpu]
    command: --model /models/llama-3.1-8b-instruct --ctx-size 16384
    volumes:
      - ./models:/models

  layerscale-gpu1:
    image: layerscale/layerscale:latest
    ports:
      - "8081:8080"
    environment:
      - LAYERSCALE_LICENSE_KEY=${LAYERSCALE_LICENSE_KEY}
      - NVIDIA_VISIBLE_DEVICES=1
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["1"]
              capabilities: [gpu]
    command: --model /models/mistral-7b-instruct --ctx-size 16384
    volumes:
      - ./models:/models

Load Balancing

Use a reverse proxy to distribute requests across multiple instances:

services:
  nginx:
    image: nginx:latest
    ports:
      - "80:80"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf:ro
    depends_on:
      - layerscale-gpu0
      - layerscale-gpu1

  layerscale-gpu0:
    image: layerscale/layerscale:latest
    # ... (as above)

  layerscale-gpu1:
    image: layerscale/layerscale:latest
    # ... (as above)

upstream layerscale {
    least_conn;
    server layerscale-gpu0:8080;
    server layerscale-gpu1:8080;
}

server {
    listen 80;

    location / {
        proxy_pass http://layerscale;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
        proxy_set_header Host $host;
        proxy_read_timeout 300s;
    }

    # SSE and WebSocket require sticky sessions
    location ~ /v1/sessions/.*/events {
        proxy_pass http://layerscale;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
        proxy_buffering off;
        proxy_cache off;
        proxy_read_timeout 86400s;
    }

    location ~ /v1/sessions/.*/ws {
        proxy_pass http://layerscale;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_read_timeout 86400s;
    }
}

Note: Session-based endpoints (streaming, Flash Queries, WebSocket) must always route to the same instance. Use sticky sessions or route session requests to specific instances.

Resource Limits

Memory

Set memory limits to prevent the container from consuming all host memory:

deploy:
  resources:
    limits:
      memory: 32G
    reservations:
      memory: 16G

A rule of thumb: allocate 1.5x the model’s VRAM requirement as container memory to account for CPU-side buffers, tokenizer, and HTTP server overhead.

Shared Memory

If you see errors about shared memory, increase the shared memory size:

shm_size: "2g"

Logging

JSON Logs

Use --log-format json for structured logging compatible with log aggregation tools:

--log-format json

Output:

{"timestamp":"2024-03-15T10:30:00Z","level":"info","msg":"model loaded","model":"llama-3.1-8b-instruct","layers":32,"ctx_size":16384}
{"timestamp":"2024-03-15T10:30:01Z","level":"info","msg":"server listening","host":"0.0.0.0","port":8080}

Docker Logging Drivers

Configure Docker to forward logs:

logging:
  driver: json-file
  options:
    max-size: "100m"
    max-file: "5"

Or forward to a log aggregator:

logging:
  driver: fluentd
  options:
    fluentd-address: "localhost:24224"
    tag: "layerscale"

AWS Deployment

EC2 with GPU

Recommended instances:

Instance	GPU	VRAM	Good For
g6.xlarge	1x L4	24 GB	7B Q8, 13B Q4. Best cost-efficiency for small models.
g6e.xlarge	1x L40S	48 GB	13B Q8, 30B Q4. Good balance of VRAM and price.
g7e.2xlarge	1x RTX PRO 6000	96 GB	70B Q4 on a single GPU. No tensor splitting overhead.
g7e.12xlarge	4x RTX PRO 6000	384 GB	70B Q8 or 100B+ models.
p5.48xlarge	8x H100	640 GB	Largest models at high quantization. NVLink interconnect.

The g7e instances are the best fit for LayerScale. 96 GB per GPU means a 70B model fits on a single device without sharding, which gives significantly better token throughput than splitting across multiple smaller GPUs.

Launch with the NVIDIA AMI:

# Use the Deep Learning AMI (includes NVIDIA drivers + Container Toolkit)
aws ec2 run-instances \
  --image-id ami-0abcdef1234567890 \
  --instance-type g6e.xlarge \
  --key-name your-key \
  --security-group-ids sg-your-group

Then SSH in and run:

docker run --gpus all -p 8080:8080 \
  -e LAYERSCALE_LICENSE_KEY=your-key \
  layerscale/layerscale:latest \
  --model meta-llama/Llama-3.1-8B-Instruct

ECS with GPU

Task definition:

{
  "family": "layerscale",
  "requiresCompatibilities": ["EC2"],
  "containerDefinitions": [
    {
      "name": "meta-llama/Llama-4-Maverick-17B-128E-Original",
      "image": "layerscale/layerscale:latest",
      "portMappings": [
        {"containerPort": 8080, "hostPort": 8080}
      ],
      "environment": [
        {"name": "LAYERSCALE_LICENSE_KEY", "value": "your-key"}
      ],
      "command": [
        "--model", "meta-llama/Llama-3.1-8B-Instruct",
        "--ctx-size", "16384",
        "--log-format", "json"
      ],
      "resourceRequirements": [
        {"type": "GPU", "value": "1"}
      ],
      "memory": 32768,
      "healthCheck": {
        "command": ["CMD-SHELL", "curl -f http://localhost:8080/health || exit 1"],
        "interval": 30,
        "timeout": 10,
        "startPeriod": 120,
        "retries": 3
      },
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/layerscale",
          "awslogs-region": "us-east-1",
          "awslogs-stream-prefix": "meta-llama/Llama-4-Maverick-17B-128E-Original"
        }
      }
    }
  ]
}

GCP Deployment

Compute Engine with GPU

gcloud compute instances create layerscale-server \
  --zone=us-central1-a \
  --machine-type=g2-standard-8 \
  --accelerator=type=nvidia-l4,count=1 \
  --image-family=common-cu124-debian-11 \
  --image-project=deeplearning-platform-release \
  --boot-disk-size=200GB \
  --metadata=install-nvidia-driver=True

Then SSH in and run Docker as above.

GKE with GPU

Node pool:

gcloud container node-pools create gpu-pool \
  --cluster=your-cluster \
  --zone=us-central1-a \
  --machine-type=g2-standard-8 \
  --accelerator=type=nvidia-l4,count=1 \
  --num-nodes=1

Kubernetes deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: layerscale
spec:
  replicas: 1
  selector:
    matchLabels:
      app: layerscale
  template:
    metadata:
      labels:
        app: layerscale
    spec:
      containers:
        - name: layerscale
          image: layerscale/layerscale:latest
          ports:
            - containerPort: 8080
          env:
            - name: LAYERSCALE_LICENSE_KEY
              valueFrom:
                secretKeyRef:
                  name: layerscale-secrets
                  key: license-key
          args:
            - "--model"
            - "meta-llama/Llama-3.1-8B-Instruct"
            - "--ctx-size"
            - "16384"
            - "--log-format"
            - "json"
          resources:
            limits:
              nvidia.com/gpu: 1
              memory: 32Gi
          livenessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 120
            periodSeconds: 30
          readinessProbe:
            httpGet:
              path: /v1/health
              port: 8080
            initialDelaySeconds: 120
            periodSeconds: 10
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
---
apiVersion: v1
kind: Service
metadata:
  name: layerscale
spec:
  selector:
    app: layerscale
  ports:
    - port: 80
      targetPort: 8080
  type: LoadBalancer

Health Checks

LayerScale exposes two health endpoints:

Endpoint	Returns 200 When
`GET /health`	Server is running and model is loaded
`GET /v1/health`	Same (OpenAI-compatible path)

Both return HTTP 503 while the model is still loading. Configure your orchestrator’s health check with a generous start period (120-300 seconds) to allow for model loading.

Security

Network

Bind to 0.0.0.0 inside the container (default) but expose only on trusted networks
Use a reverse proxy (nginx, Caddy) for TLS termination
Restrict access with firewall rules or security groups

License Key

Store the license key securely:

# Docker: use environment variable
-e LAYERSCALE_LICENSE_KEY=your-key

# Docker Compose: use .env file
echo "LAYERSCALE_LICENSE_KEY=your-key" >> .env

# Kubernetes: use a Secret
kubectl create secret generic layerscale-secrets --from-literal=license-key=your-key

Never commit the license key to version control.