Deployment

Production deployment guide for LayerScale. You’ll need a license key before proceeding.

Docker Compose

Single GPU

services:
  layerscale:
    image: layerscale/layerscale:latest
    ports:
      - "8080:8080"
    environment:
      - LAYERSCALE_LICENSE_KEY=${LAYERSCALE_LICENSE_KEY}
    volumes:
      - ./models:/models
      - model-cache:/root/.cache
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
        limits:
          memory: 32G
    command: >
      --model /models/llama-3.1-8b-instruct
      --context 16384
      --sessions 8
      --kv-cache-type int8
      --log-format json
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 30s
      timeout: 10s
      start_period: 120s
      retries: 3
    restart: unless-stopped

volumes:
  model-cache:

Start with:

LAYERSCALE_LICENSE_KEY=$LAYERSCALE_LICENSE_KEY docker compose up -d

With HuggingFace Model

services:
  layerscale:
    image: layerscale/layerscale:latest
    ports:
      - "8080:8080"
    environment:
      - LAYERSCALE_LICENSE_KEY=${LAYERSCALE_LICENSE_KEY}
    volumes:
      - model-cache:/root/.cache
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    command: >
      --model meta-llama/Llama-3.1-8B-Instruct
      --context 8192
      --log-format json
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 30s
      timeout: 10s
      start_period: 300s
      retries: 3
    restart: unless-stopped

volumes:
  model-cache:

Note the longer start_period (300s) to account for model download time on first run.

GPU Passthrough

NVIDIA Container Toolkit

Install the NVIDIA Container Toolkit on the host:

# Ubuntu/Debian
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Verify GPU access:

docker run --rm --gpus all nvidia/cuda:12.9.1-base-ubuntu22.04 nvidia-smi

AMD ROCm

For AMD GPUs, use the ROCm image and pass the GPU device:

docker run --device=/dev/kfd --device=/dev/dri --group-add video \
  -p 8080:8080 \
  -e LAYERSCALE_LICENSE_KEY=$LAYERSCALE_LICENSE_KEY \
  layerscale/layerscale:latest-rocm \
  --model meta-llama/Llama-3.1-8B-Instruct

Verify GPU access:

docker run --rm --device=/dev/kfd --device=/dev/dri --group-add video \
  rocm/dev-ubuntu-22.04 rocm-smi

Apple Silicon (Coming Soon)

Native Metal support for Apple Silicon Macs is in development. This will enable running LayerScale directly on M-series hardware without Docker or an NVIDIA/AMD GPU.

Selecting a Specific GPU

On multi-GPU hosts, select a GPU by device index:

docker run --gpus '"device=0"' -p 8080:8080 \
  -e LAYERSCALE_LICENSE_KEY=$LAYERSCALE_LICENSE_KEY \
  layerscale/layerscale:latest \
  --model /models/your-model

Or by UUID:

docker run --gpus '"device=GPU-abc123"' -p 8080:8080 \
  -e LAYERSCALE_LICENSE_KEY=$LAYERSCALE_LICENSE_KEY \
  layerscale/layerscale:latest \
  --model /models/your-model

Multiple Instances on Multiple GPUs

Run separate LayerScale instances on different GPUs:

services:
  layerscale-gpu0:
    image: layerscale/layerscale:latest
    ports:
      - "8080:8080"
    environment:
      - LAYERSCALE_LICENSE_KEY=${LAYERSCALE_LICENSE_KEY}
      - NVIDIA_VISIBLE_DEVICES=0
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["0"]
              capabilities: [gpu]
    command: --model /models/llama-3.1-8b-instruct --context 16384
    volumes:
      - ./models:/models

  layerscale-gpu1:
    image: layerscale/layerscale:latest
    ports:
      - "8081:8080"
    environment:
      - LAYERSCALE_LICENSE_KEY=${LAYERSCALE_LICENSE_KEY}
      - NVIDIA_VISIBLE_DEVICES=1
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["1"]
              capabilities: [gpu]
    command: --model /models/mistral-7b-instruct --context 16384
    volumes:
      - ./models:/models

Load Balancing

Use a reverse proxy to distribute requests across multiple instances:

services:
  nginx:
    image: nginx:latest
    ports:
      - "80:80"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf:ro
    depends_on:
      - layerscale-gpu0
      - layerscale-gpu1

  layerscale-gpu0:
    image: layerscale/layerscale:latest
    # ... (as above)

  layerscale-gpu1:
    image: layerscale/layerscale:latest
    # ... (as above)

upstream layerscale {
    hash $session_id consistent;
    server layerscale-gpu0:8080;
    server layerscale-gpu1:8080;
}

map $uri $session_id {
    ~^/v1/sessions/(?<sid>[^/]+) $sid;
    default                      $remote_addr;
}

server {
    listen 80;

    # Stateless endpoints (chat/completions, messages, session init)
    # Fall back to hashing on remote_addr
    location / {
        proxy_pass http://layerscale;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
        proxy_set_header Host $host;
        proxy_read_timeout 300s;
    }

    # Session-bound requests (append, generate, push, flash)
    # Hash on session ID so they always hit the same GPU
    location ~ ^/v1/sessions/ {
        proxy_pass http://layerscale;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
        proxy_set_header Host $host;
        proxy_read_timeout 300s;
    }

    # SSE (Flash Query events)
    location ~ ^/v1/sessions/.*/events {
        proxy_pass http://layerscale;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
        proxy_buffering off;
        proxy_cache off;
        proxy_read_timeout 86400s;
    }

    # WebSocket
    location ~ ^/v1/sessions/.*/ws {
        proxy_pass http://layerscale;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_read_timeout 86400s;
    }
}

Session state (KV cache) is pinned to the GPU instance that created it. The hash $session_id consistent directive extracts the session ID from the URL path and routes all requests for the same session to the same backend. Stateless endpoints (like /v1/chat/completions) fall back to hashing on the client IP. The consistent flag ensures that adding or removing a backend only remaps a minimal number of sessions.

Resource Limits

Memory

Set memory limits to prevent the container from consuming all host memory:

deploy:
  resources:
    limits:
      memory: 32G
    reservations:
      memory: 16G

A rule of thumb: allocate 1.5x the model’s VRAM requirement as container memory to account for CPU-side buffers, tokenizer, and HTTP server overhead.

Shared Memory

If you see errors about shared memory, increase the shared memory size:

shm_size: "2g"

Logging

JSON Logs

Use --log-format json for structured logging compatible with log aggregation tools:

--log-format json

Output:

{"timestamp":"2024-03-15T10:30:00Z","level":"info","msg":"model loaded","model":"llama-3.1-8b-instruct","layers":32,"context":16384}
{"timestamp":"2024-03-15T10:30:01Z","level":"info","msg":"server listening","host":"0.0.0.0","port":8080}

Docker Logging Drivers

Configure Docker to forward logs:

logging:
  driver: json-file
  options:
    max-size: "100m"
    max-file: "5"

Or forward to a log aggregator:

logging:
  driver: fluentd
  options:
    fluentd-address: "localhost:24224"
    tag: "layerscale"

AWS Deployment

EC2 with GPU

Recommended instances:

Instance	GPU	VRAM	Good For
g6.xlarge	1x L4	24 GB	7B Q8, 13B Q4. Best cost-efficiency for small models.
g6e.xlarge	1x L40S	48 GB	13B Q8, 30B Q4. Good balance of VRAM and price.
g7e.2xlarge	1x RTX PRO 6000	96 GB	70B Q4 on a single GPU. No tensor splitting overhead.
g7e.12xlarge	4x RTX PRO 6000	384 GB	70B Q8 or 100B+ models.
p5.48xlarge	8x H100	640 GB	Largest models at high quantization. NVLink interconnect.

The g7e instances are the best fit for LayerScale. 96 GB per GPU means a 70B model fits on a single device without sharding, which gives significantly better token throughput than splitting across multiple smaller GPUs.

Launch with the NVIDIA AMI:

# Use the Deep Learning AMI (includes NVIDIA drivers + Container Toolkit)
aws ec2 run-instances \
  --image-id ami-0abcdef1234567890 \
  --instance-type g6e.xlarge \
  --key-name your-key \
  --security-group-ids sg-your-group

Then SSH in and run:

docker run --gpus all -p 8080:8080 \
  -e LAYERSCALE_LICENSE_KEY=$LAYERSCALE_LICENSE_KEY \
  -e HF_TOKEN=$HF_TOKEN \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  layerscale/layerscale:latest \
  --model meta-llama/Llama-3.1-8B-Instruct

ECS with GPU

Task definition:

{
  "family": "layerscale",
  "requiresCompatibilities": ["EC2"],
  "containerDefinitions": [
    {
      "name": "meta-llama/Llama-4-Maverick-17B-128E-Original",
      "image": "layerscale/layerscale:latest",
      "portMappings": [
        {"containerPort": 8080, "hostPort": 8080}
      ],
      "environment": [
        {"name": "LAYERSCALE_LICENSE_KEY", "value": "your-key"}
      ],
      "command": [
        "--model", "meta-llama/Llama-3.1-8B-Instruct",
        "--context", "16384",
        "--log-format", "json"
      ],
      "resourceRequirements": [
        {"type": "GPU", "value": "1"}
      ],
      "memory": 32768,
      "healthCheck": {
        "command": ["CMD-SHELL", "curl -f http://localhost:8080/health || exit 1"],
        "interval": 30,
        "timeout": 10,
        "startPeriod": 120,
        "retries": 3
      },
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/layerscale",
          "awslogs-region": "us-east-1",
          "awslogs-stream-prefix": "meta-llama/Llama-4-Maverick-17B-128E-Original"
        }
      }
    }
  ]
}

GCP Deployment

Compute Engine with GPU

gcloud compute instances create layerscale-server \
  --zone=us-central1-a \
  --machine-type=g2-standard-8 \
  --accelerator=type=nvidia-l4,count=1 \
  --image-family=common-cu124-debian-11 \
  --image-project=deeplearning-platform-release \
  --boot-disk-size=200GB \
  --metadata=install-nvidia-driver=True

Then SSH in and run Docker as above.

GKE with GPU

Node pool:

gcloud container node-pools create gpu-pool \
  --cluster=your-cluster \
  --zone=us-central1-a \
  --machine-type=g2-standard-8 \
  --accelerator=type=nvidia-l4,count=1 \
  --num-nodes=1

Kubernetes deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: layerscale
spec:
  replicas: 1
  selector:
    matchLabels:
      app: layerscale
  template:
    metadata:
      labels:
        app: layerscale
    spec:
      containers:
        - name: layerscale
          image: layerscale/layerscale:latest
          ports:
            - containerPort: 8080
          env:
            - name: LAYERSCALE_LICENSE_KEY
              valueFrom:
                secretKeyRef:
                  name: layerscale-secrets
                  key: license-key
          args:
            - "--model"
            - "meta-llama/Llama-3.1-8B-Instruct"
            - "--context"
            - "16384"
            - "--log-format"
            - "json"
          resources:
            limits:
              nvidia.com/gpu: 1
              memory: 32Gi
          livenessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 120
            periodSeconds: 30
          readinessProbe:
            httpGet:
              path: /v1/health
              port: 8080
            initialDelaySeconds: 120
            periodSeconds: 10
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
---
apiVersion: v1
kind: Service
metadata:
  name: layerscale
spec:
  selector:
    app: layerscale
  ports:
    - port: 80
      targetPort: 8080
  type: LoadBalancer

Health Checks

LayerScale exposes two health endpoints:

Endpoint	Returns 200 When
`GET /health`	Server is running and model is loaded
`GET /v1/health`	Same (OpenAI-compatible path)

Both return HTTP 503 while the model is still loading. Configure your orchestrator’s health check with a generous start period (120-300 seconds) to allow for model loading.

Security

Network

Bind to 0.0.0.0 inside the container (default) but expose only on trusted networks
Use a reverse proxy (nginx, Caddy) for TLS termination
Restrict access with firewall rules or security groups

License Key

Store the license key securely:

# Docker: use environment variable
-e LAYERSCALE_LICENSE_KEY=$LAYERSCALE_LICENSE_KEY

# Docker Compose: use .env file
echo "LAYERSCALE_LICENSE_KEY=$LAYERSCALE_LICENSE_KEY" >> .env

# Kubernetes: use a Secret
kubectl create secret generic layerscale-secrets --from-literal=license-key=your-key

Never commit the license key to version control.