Skip to content

Deployment

Production deployment guide for LayerScale. You’ll need a license key before proceeding.

Docker Compose

Single GPU

docker-compose.yml
services:
layerscale:
image: layerscale/layerscale:latest
ports:
- "8080:8080"
environment:
- LAYERSCALE_LICENSE_KEY=${LAYERSCALE_LICENSE_KEY}
volumes:
- ./models:/models
- model-cache:/root/.cache
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
limits:
memory: 32G
command: >
--model /models/llama-3.1-8b-instruct
--context 16384
--sessions 8
--kv-cache-type int8
--log-format json
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
interval: 30s
timeout: 10s
start_period: 120s
retries: 3
restart: unless-stopped
volumes:
model-cache:

Start with:

Terminal window
LAYERSCALE_LICENSE_KEY=$LAYERSCALE_LICENSE_KEY docker compose up -d

With HuggingFace Model

services:
layerscale:
image: layerscale/layerscale:latest
ports:
- "8080:8080"
environment:
- LAYERSCALE_LICENSE_KEY=${LAYERSCALE_LICENSE_KEY}
volumes:
- model-cache:/root/.cache
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
command: >
--model meta-llama/Llama-3.1-8B-Instruct
--context 8192
--log-format json
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
interval: 30s
timeout: 10s
start_period: 300s
retries: 3
restart: unless-stopped
volumes:
model-cache:

Note the longer start_period (300s) to account for model download time on first run.

GPU Passthrough

NVIDIA Container Toolkit

Install the NVIDIA Container Toolkit on the host:

Terminal window
# Ubuntu/Debian
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Verify GPU access:

Terminal window
docker run --rm --gpus all nvidia/cuda:12.9.1-base-ubuntu22.04 nvidia-smi

AMD ROCm

For AMD GPUs, use the ROCm image and pass the GPU device:

Terminal window
docker run --device=/dev/kfd --device=/dev/dri --group-add video \
-p 8080:8080 \
-e LAYERSCALE_LICENSE_KEY=$LAYERSCALE_LICENSE_KEY \
layerscale/layerscale:latest-rocm \
--model meta-llama/Llama-3.1-8B-Instruct

Verify GPU access:

Terminal window
docker run --rm --device=/dev/kfd --device=/dev/dri --group-add video \
rocm/dev-ubuntu-22.04 rocm-smi

Apple Silicon (Coming Soon)

Native Metal support for Apple Silicon Macs is in development. This will enable running LayerScale directly on M-series hardware without Docker or an NVIDIA/AMD GPU.

Selecting a Specific GPU

On multi-GPU hosts, select a GPU by device index:

Terminal window
docker run --gpus '"device=0"' -p 8080:8080 \
-e LAYERSCALE_LICENSE_KEY=$LAYERSCALE_LICENSE_KEY \
layerscale/layerscale:latest \
--model /models/your-model

Or by UUID:

Terminal window
docker run --gpus '"device=GPU-abc123"' -p 8080:8080 \
-e LAYERSCALE_LICENSE_KEY=$LAYERSCALE_LICENSE_KEY \
layerscale/layerscale:latest \
--model /models/your-model

Multiple Instances on Multiple GPUs

Run separate LayerScale instances on different GPUs:

services:
layerscale-gpu0:
image: layerscale/layerscale:latest
ports:
- "8080:8080"
environment:
- LAYERSCALE_LICENSE_KEY=${LAYERSCALE_LICENSE_KEY}
- NVIDIA_VISIBLE_DEVICES=0
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ["0"]
capabilities: [gpu]
command: --model /models/llama-3.1-8b-instruct --context 16384
volumes:
- ./models:/models
layerscale-gpu1:
image: layerscale/layerscale:latest
ports:
- "8081:8080"
environment:
- LAYERSCALE_LICENSE_KEY=${LAYERSCALE_LICENSE_KEY}
- NVIDIA_VISIBLE_DEVICES=1
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ["1"]
capabilities: [gpu]
command: --model /models/mistral-7b-instruct --context 16384
volumes:
- ./models:/models

Load Balancing

Use a reverse proxy to distribute requests across multiple instances:

services:
nginx:
image: nginx:latest
ports:
- "80:80"
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf:ro
depends_on:
- layerscale-gpu0
- layerscale-gpu1
layerscale-gpu0:
image: layerscale/layerscale:latest
# ... (as above)
layerscale-gpu1:
image: layerscale/layerscale:latest
# ... (as above)
nginx.conf
upstream layerscale {
hash $session_id consistent;
server layerscale-gpu0:8080;
server layerscale-gpu1:8080;
}
map $uri $session_id {
~^/v1/sessions/(?<sid>[^/]+) $sid;
default $remote_addr;
}
server {
listen 80;
# Stateless endpoints (chat/completions, messages, session init)
# Fall back to hashing on remote_addr
location / {
proxy_pass http://layerscale;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_set_header Host $host;
proxy_read_timeout 300s;
}
# Session-bound requests (append, generate, push, flash)
# Hash on session ID so they always hit the same GPU
location ~ ^/v1/sessions/ {
proxy_pass http://layerscale;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_set_header Host $host;
proxy_read_timeout 300s;
}
# SSE (Flash Query events)
location ~ ^/v1/sessions/.*/events {
proxy_pass http://layerscale;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_buffering off;
proxy_cache off;
proxy_read_timeout 86400s;
}
# WebSocket
location ~ ^/v1/sessions/.*/ws {
proxy_pass http://layerscale;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_read_timeout 86400s;
}
}

Session state (KV cache) is pinned to the GPU instance that created it. The hash $session_id consistent directive extracts the session ID from the URL path and routes all requests for the same session to the same backend. Stateless endpoints (like /v1/chat/completions) fall back to hashing on the client IP. The consistent flag ensures that adding or removing a backend only remaps a minimal number of sessions.

Resource Limits

Memory

Set memory limits to prevent the container from consuming all host memory:

deploy:
resources:
limits:
memory: 32G
reservations:
memory: 16G

A rule of thumb: allocate 1.5x the model’s VRAM requirement as container memory to account for CPU-side buffers, tokenizer, and HTTP server overhead.

Shared Memory

If you see errors about shared memory, increase the shared memory size:

shm_size: "2g"

Logging

JSON Logs

Use --log-format json for structured logging compatible with log aggregation tools:

Terminal window
--log-format json

Output:

{"timestamp":"2024-03-15T10:30:00Z","level":"info","msg":"model loaded","model":"llama-3.1-8b-instruct","layers":32,"context":16384}
{"timestamp":"2024-03-15T10:30:01Z","level":"info","msg":"server listening","host":"0.0.0.0","port":8080}

Docker Logging Drivers

Configure Docker to forward logs:

logging:
driver: json-file
options:
max-size: "100m"
max-file: "5"

Or forward to a log aggregator:

logging:
driver: fluentd
options:
fluentd-address: "localhost:24224"
tag: "layerscale"

AWS Deployment

EC2 with GPU

Recommended instances:

InstanceGPUVRAMGood For
g6.xlarge1x L424 GB7B Q8, 13B Q4. Best cost-efficiency for small models.
g6e.xlarge1x L40S48 GB13B Q8, 30B Q4. Good balance of VRAM and price.
g7e.2xlarge1x RTX PRO 600096 GB70B Q4 on a single GPU. No tensor splitting overhead.
g7e.12xlarge4x RTX PRO 6000384 GB70B Q8 or 100B+ models.
p5.48xlarge8x H100640 GBLargest models at high quantization. NVLink interconnect.

The g7e instances are the best fit for LayerScale. 96 GB per GPU means a 70B model fits on a single device without sharding, which gives significantly better token throughput than splitting across multiple smaller GPUs.

Launch with the NVIDIA AMI:

Terminal window
# Use the Deep Learning AMI (includes NVIDIA drivers + Container Toolkit)
aws ec2 run-instances \
--image-id ami-0abcdef1234567890 \
--instance-type g6e.xlarge \
--key-name your-key \
--security-group-ids sg-your-group

Then SSH in and run:

Terminal window
docker run --gpus all -p 8080:8080 \
-e LAYERSCALE_LICENSE_KEY=$LAYERSCALE_LICENSE_KEY \
-e HF_TOKEN=$HF_TOKEN \
-v ~/.cache/huggingface:/root/.cache/huggingface \
layerscale/layerscale:latest \
--model meta-llama/Llama-3.1-8B-Instruct

ECS with GPU

Task definition:

{
"family": "layerscale",
"requiresCompatibilities": ["EC2"],
"containerDefinitions": [
{
"name": "meta-llama/Llama-4-Maverick-17B-128E-Original",
"image": "layerscale/layerscale:latest",
"portMappings": [
{"containerPort": 8080, "hostPort": 8080}
],
"environment": [
{"name": "LAYERSCALE_LICENSE_KEY", "value": "your-key"}
],
"command": [
"--model", "meta-llama/Llama-3.1-8B-Instruct",
"--context", "16384",
"--log-format", "json"
],
"resourceRequirements": [
{"type": "GPU", "value": "1"}
],
"memory": 32768,
"healthCheck": {
"command": ["CMD-SHELL", "curl -f http://localhost:8080/health || exit 1"],
"interval": 30,
"timeout": 10,
"startPeriod": 120,
"retries": 3
},
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/layerscale",
"awslogs-region": "us-east-1",
"awslogs-stream-prefix": "meta-llama/Llama-4-Maverick-17B-128E-Original"
}
}
}
]
}

GCP Deployment

Compute Engine with GPU

Terminal window
gcloud compute instances create layerscale-server \
--zone=us-central1-a \
--machine-type=g2-standard-8 \
--accelerator=type=nvidia-l4,count=1 \
--image-family=common-cu124-debian-11 \
--image-project=deeplearning-platform-release \
--boot-disk-size=200GB \
--metadata=install-nvidia-driver=True

Then SSH in and run Docker as above.

GKE with GPU

Node pool:

Terminal window
gcloud container node-pools create gpu-pool \
--cluster=your-cluster \
--zone=us-central1-a \
--machine-type=g2-standard-8 \
--accelerator=type=nvidia-l4,count=1 \
--num-nodes=1

Kubernetes deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
name: layerscale
spec:
replicas: 1
selector:
matchLabels:
app: layerscale
template:
metadata:
labels:
app: layerscale
spec:
containers:
- name: layerscale
image: layerscale/layerscale:latest
ports:
- containerPort: 8080
env:
- name: LAYERSCALE_LICENSE_KEY
valueFrom:
secretKeyRef:
name: layerscale-secrets
key: license-key
args:
- "--model"
- "meta-llama/Llama-3.1-8B-Instruct"
- "--context"
- "16384"
- "--log-format"
- "json"
resources:
limits:
nvidia.com/gpu: 1
memory: 32Gi
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 120
periodSeconds: 30
readinessProbe:
httpGet:
path: /v1/health
port: 8080
initialDelaySeconds: 120
periodSeconds: 10
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
---
apiVersion: v1
kind: Service
metadata:
name: layerscale
spec:
selector:
app: layerscale
ports:
- port: 80
targetPort: 8080
type: LoadBalancer

Health Checks

LayerScale exposes two health endpoints:

EndpointReturns 200 When
GET /healthServer is running and model is loaded
GET /v1/healthSame (OpenAI-compatible path)

Both return HTTP 503 while the model is still loading. Configure your orchestrator’s health check with a generous start period (120-300 seconds) to allow for model loading.

Security

Network

  • Bind to 0.0.0.0 inside the container (default) but expose only on trusted networks
  • Use a reverse proxy (nginx, Caddy) for TLS termination
  • Restrict access with firewall rules or security groups

License Key

Store the license key securely:

Terminal window
# Docker: use environment variable
-e LAYERSCALE_LICENSE_KEY=$LAYERSCALE_LICENSE_KEY
# Docker Compose: use .env file
echo "LAYERSCALE_LICENSE_KEY=$LAYERSCALE_LICENSE_KEY" >> .env
# Kubernetes: use a Secret
kubectl create secret generic layerscale-secrets --from-literal=license-key=your-key

Never commit the license key to version control.