Skip to content

Deployment

Production deployment guide for LayerScale.

Docker Compose

Single GPU

docker-compose.yml
services:
layerscale:
image: layerscale/layerscale:latest
ports:
- "8080:8080"
environment:
- LAYERSCALE_LICENSE_KEY=${LAYERSCALE_LICENSE_KEY}
volumes:
- ./models:/models
- model-cache:/home/layerscale/.cache
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
limits:
memory: 32G
command: >
--model /models/llama-3.1-8b-instruct
--ctx-size 16384
--pool-size 4
--kv-cache-type q8_0
--log-format json
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
interval: 30s
timeout: 10s
start_period: 120s
retries: 3
restart: unless-stopped
volumes:
model-cache:

Start with:

Terminal window
LAYERSCALE_LICENSE_KEY=your-key docker compose up -d

With HuggingFace Model

services:
layerscale:
image: layerscale/layerscale:latest
ports:
- "8080:8080"
environment:
- LAYERSCALE_LICENSE_KEY=${LAYERSCALE_LICENSE_KEY}
volumes:
- model-cache:/home/layerscale/.cache
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
command: >
--model meta-llama/Llama-3.1-8B-Instruct
--ctx-size 8192
--pool-size 4
--log-format json
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
interval: 30s
timeout: 10s
start_period: 300s
retries: 3
restart: unless-stopped
volumes:
model-cache:

Note the longer start_period (300s) to account for model download time on first run.

GPU Passthrough

NVIDIA Container Toolkit

Install the NVIDIA Container Toolkit on the host:

Terminal window
# Ubuntu/Debian
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Verify GPU access:

Terminal window
docker run --rm --gpus all nvidia/cuda:12.9.1-base-ubuntu22.04 nvidia-smi

AMD ROCm

For AMD GPUs, use the ROCm image and pass the GPU device:

Terminal window
docker run --device=/dev/kfd --device=/dev/dri --group-add video \
-p 8080:8080 \
-e LAYERSCALE_LICENSE_KEY=your-key \
layerscale/layerscale:latest-rocm \
--model meta-llama/Llama-3.1-8B-Instruct

Verify GPU access:

Terminal window
docker run --rm --device=/dev/kfd --device=/dev/dri --group-add video \
rocm/dev-ubuntu-22.04 rocm-smi

Apple Silicon (Coming Soon)

Native Metal support for Apple Silicon Macs is in development. This will enable running LayerScale directly on M-series hardware without Docker or an NVIDIA/AMD GPU.

Selecting a Specific GPU

On multi-GPU hosts, select a GPU by device index:

Terminal window
docker run --gpus '"device=0"' -p 8080:8080 \
-e LAYERSCALE_LICENSE_KEY=your-key \
layerscale/layerscale:latest \
--model /models/your-model

Or by UUID:

Terminal window
docker run --gpus '"device=GPU-abc123"' -p 8080:8080 \
-e LAYERSCALE_LICENSE_KEY=your-key \
layerscale/layerscale:latest \
--model /models/your-model

Multiple Instances on Multiple GPUs

Run separate LayerScale instances on different GPUs:

services:
layerscale-gpu0:
image: layerscale/layerscale:latest
ports:
- "8080:8080"
environment:
- LAYERSCALE_LICENSE_KEY=${LAYERSCALE_LICENSE_KEY}
- NVIDIA_VISIBLE_DEVICES=0
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ["0"]
capabilities: [gpu]
command: --model /models/llama-3.1-8b-instruct --ctx-size 16384
volumes:
- ./models:/models
layerscale-gpu1:
image: layerscale/layerscale:latest
ports:
- "8081:8080"
environment:
- LAYERSCALE_LICENSE_KEY=${LAYERSCALE_LICENSE_KEY}
- NVIDIA_VISIBLE_DEVICES=1
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ["1"]
capabilities: [gpu]
command: --model /models/mistral-7b-instruct --ctx-size 16384
volumes:
- ./models:/models

Load Balancing

Use a reverse proxy to distribute requests across multiple instances:

services:
nginx:
image: nginx:latest
ports:
- "80:80"
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf:ro
depends_on:
- layerscale-gpu0
- layerscale-gpu1
layerscale-gpu0:
image: layerscale/layerscale:latest
# ... (as above)
layerscale-gpu1:
image: layerscale/layerscale:latest
# ... (as above)
nginx.conf
upstream layerscale {
least_conn;
server layerscale-gpu0:8080;
server layerscale-gpu1:8080;
}
server {
listen 80;
location / {
proxy_pass http://layerscale;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_set_header Host $host;
proxy_read_timeout 300s;
}
# SSE and WebSocket require sticky sessions
location ~ /v1/sessions/.*/events {
proxy_pass http://layerscale;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_buffering off;
proxy_cache off;
proxy_read_timeout 86400s;
}
location ~ /v1/sessions/.*/ws {
proxy_pass http://layerscale;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_read_timeout 86400s;
}
}

Note: Session-based endpoints (streaming, Flash Queries, WebSocket) must always route to the same instance. Use sticky sessions or route session requests to specific instances.

Resource Limits

Memory

Set memory limits to prevent the container from consuming all host memory:

deploy:
resources:
limits:
memory: 32G
reservations:
memory: 16G

A rule of thumb: allocate 1.5x the model’s VRAM requirement as container memory to account for CPU-side buffers, tokenizer, and HTTP server overhead.

Shared Memory

If you see errors about shared memory, increase the shared memory size:

shm_size: "2g"

Logging

JSON Logs

Use --log-format json for structured logging compatible with log aggregation tools:

Terminal window
--log-format json

Output:

{"timestamp":"2024-03-15T10:30:00Z","level":"info","msg":"model loaded","model":"llama-3.1-8b-instruct","layers":32,"ctx_size":16384}
{"timestamp":"2024-03-15T10:30:01Z","level":"info","msg":"server listening","host":"0.0.0.0","port":8080}

Docker Logging Drivers

Configure Docker to forward logs:

logging:
driver: json-file
options:
max-size: "100m"
max-file: "5"

Or forward to a log aggregator:

logging:
driver: fluentd
options:
fluentd-address: "localhost:24224"
tag: "layerscale"

AWS Deployment

EC2 with GPU

Recommended instances:

InstanceGPUVRAMGood For
g6.xlarge1x L424 GB7B Q8, 13B Q4. Best cost-efficiency for small models.
g6e.xlarge1x L40S48 GB13B Q8, 30B Q4. Good balance of VRAM and price.
g7e.2xlarge1x RTX PRO 600096 GB70B Q4 on a single GPU. No tensor splitting overhead.
g7e.12xlarge4x RTX PRO 6000384 GB70B Q8 or 100B+ models.
p5.48xlarge8x H100640 GBLargest models at high quantization. NVLink interconnect.

The g7e instances are the best fit for LayerScale. 96 GB per GPU means a 70B model fits on a single device without sharding, which gives significantly better token throughput than splitting across multiple smaller GPUs.

Launch with the NVIDIA AMI:

Terminal window
# Use the Deep Learning AMI (includes NVIDIA drivers + Container Toolkit)
aws ec2 run-instances \
--image-id ami-0abcdef1234567890 \
--instance-type g6e.xlarge \
--key-name your-key \
--security-group-ids sg-your-group

Then SSH in and run:

Terminal window
docker run --gpus all -p 8080:8080 \
-e LAYERSCALE_LICENSE_KEY=your-key \
layerscale/layerscale:latest \
--model meta-llama/Llama-3.1-8B-Instruct

ECS with GPU

Task definition:

{
"family": "layerscale",
"requiresCompatibilities": ["EC2"],
"containerDefinitions": [
{
"name": "meta-llama/Llama-4-Maverick-17B-128E-Original",
"image": "layerscale/layerscale:latest",
"portMappings": [
{"containerPort": 8080, "hostPort": 8080}
],
"environment": [
{"name": "LAYERSCALE_LICENSE_KEY", "value": "your-key"}
],
"command": [
"--model", "meta-llama/Llama-3.1-8B-Instruct",
"--ctx-size", "16384",
"--log-format", "json"
],
"resourceRequirements": [
{"type": "GPU", "value": "1"}
],
"memory": 32768,
"healthCheck": {
"command": ["CMD-SHELL", "curl -f http://localhost:8080/health || exit 1"],
"interval": 30,
"timeout": 10,
"startPeriod": 120,
"retries": 3
},
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/layerscale",
"awslogs-region": "us-east-1",
"awslogs-stream-prefix": "meta-llama/Llama-4-Maverick-17B-128E-Original"
}
}
}
]
}

GCP Deployment

Compute Engine with GPU

Terminal window
gcloud compute instances create layerscale-server \
--zone=us-central1-a \
--machine-type=g2-standard-8 \
--accelerator=type=nvidia-l4,count=1 \
--image-family=common-cu124-debian-11 \
--image-project=deeplearning-platform-release \
--boot-disk-size=200GB \
--metadata=install-nvidia-driver=True

Then SSH in and run Docker as above.

GKE with GPU

Node pool:

Terminal window
gcloud container node-pools create gpu-pool \
--cluster=your-cluster \
--zone=us-central1-a \
--machine-type=g2-standard-8 \
--accelerator=type=nvidia-l4,count=1 \
--num-nodes=1

Kubernetes deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
name: layerscale
spec:
replicas: 1
selector:
matchLabels:
app: layerscale
template:
metadata:
labels:
app: layerscale
spec:
containers:
- name: layerscale
image: layerscale/layerscale:latest
ports:
- containerPort: 8080
env:
- name: LAYERSCALE_LICENSE_KEY
valueFrom:
secretKeyRef:
name: layerscale-secrets
key: license-key
args:
- "--model"
- "meta-llama/Llama-3.1-8B-Instruct"
- "--ctx-size"
- "16384"
- "--log-format"
- "json"
resources:
limits:
nvidia.com/gpu: 1
memory: 32Gi
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 120
periodSeconds: 30
readinessProbe:
httpGet:
path: /v1/health
port: 8080
initialDelaySeconds: 120
periodSeconds: 10
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
---
apiVersion: v1
kind: Service
metadata:
name: layerscale
spec:
selector:
app: layerscale
ports:
- port: 80
targetPort: 8080
type: LoadBalancer

Health Checks

LayerScale exposes two health endpoints:

EndpointReturns 200 When
GET /healthServer is running and model is loaded
GET /v1/healthSame (OpenAI-compatible path)

Both return HTTP 503 while the model is still loading. Configure your orchestrator’s health check with a generous start period (120-300 seconds) to allow for model loading.

Security

Network

  • Bind to 0.0.0.0 inside the container (default) but expose only on trusted networks
  • Use a reverse proxy (nginx, Caddy) for TLS termination
  • Restrict access with firewall rules or security groups

License Key

Store the license key securely:

Terminal window
# Docker: use environment variable
-e LAYERSCALE_LICENSE_KEY=your-key
# Docker Compose: use .env file
echo "LAYERSCALE_LICENSE_KEY=your-key" >> .env
# Kubernetes: use a Secret
kubectl create secret generic layerscale-secrets --from-literal=license-key=your-key

Never commit the license key to version control.