Deployment
Production deployment guide for LayerScale. You’ll need a license key before proceeding.
Docker Compose
Single GPU
services: layerscale: image: layerscale/layerscale:latest ports: - "8080:8080" environment: - LAYERSCALE_LICENSE_KEY=${LAYERSCALE_LICENSE_KEY} volumes: - ./models:/models - model-cache:/root/.cache deploy: resources: reservations: devices: - driver: nvidia count: 1 capabilities: [gpu] limits: memory: 32G command: > --model /models/llama-3.1-8b-instruct --context 16384 --sessions 8 --kv-cache-type int8 --log-format json healthcheck: test: ["CMD", "curl", "-f", "http://localhost:8080/health"] interval: 30s timeout: 10s start_period: 120s retries: 3 restart: unless-stopped
volumes: model-cache:Start with:
LAYERSCALE_LICENSE_KEY=$LAYERSCALE_LICENSE_KEY docker compose up -dWith HuggingFace Model
services: layerscale: image: layerscale/layerscale:latest ports: - "8080:8080" environment: - LAYERSCALE_LICENSE_KEY=${LAYERSCALE_LICENSE_KEY} volumes: - model-cache:/root/.cache deploy: resources: reservations: devices: - driver: nvidia count: 1 capabilities: [gpu] command: > --model meta-llama/Llama-3.1-8B-Instruct --context 8192 --log-format json healthcheck: test: ["CMD", "curl", "-f", "http://localhost:8080/health"] interval: 30s timeout: 10s start_period: 300s retries: 3 restart: unless-stopped
volumes: model-cache:Note the longer start_period (300s) to account for model download time on first run.
GPU Passthrough
NVIDIA Container Toolkit
Install the NVIDIA Container Toolkit on the host:
# Ubuntu/Debiancurl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpgcurl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \ sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \ sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.listsudo apt-get updatesudo apt-get install -y nvidia-container-toolkitsudo nvidia-ctk runtime configure --runtime=dockersudo systemctl restart dockerVerify GPU access:
docker run --rm --gpus all nvidia/cuda:12.9.1-base-ubuntu22.04 nvidia-smiAMD ROCm
For AMD GPUs, use the ROCm image and pass the GPU device:
docker run --device=/dev/kfd --device=/dev/dri --group-add video \ -p 8080:8080 \ -e LAYERSCALE_LICENSE_KEY=$LAYERSCALE_LICENSE_KEY \ layerscale/layerscale:latest-rocm \ --model meta-llama/Llama-3.1-8B-InstructVerify GPU access:
docker run --rm --device=/dev/kfd --device=/dev/dri --group-add video \ rocm/dev-ubuntu-22.04 rocm-smiApple Silicon (Coming Soon)
Native Metal support for Apple Silicon Macs is in development. This will enable running LayerScale directly on M-series hardware without Docker or an NVIDIA/AMD GPU.
Selecting a Specific GPU
On multi-GPU hosts, select a GPU by device index:
docker run --gpus '"device=0"' -p 8080:8080 \ -e LAYERSCALE_LICENSE_KEY=$LAYERSCALE_LICENSE_KEY \ layerscale/layerscale:latest \ --model /models/your-modelOr by UUID:
docker run --gpus '"device=GPU-abc123"' -p 8080:8080 \ -e LAYERSCALE_LICENSE_KEY=$LAYERSCALE_LICENSE_KEY \ layerscale/layerscale:latest \ --model /models/your-modelMultiple Instances on Multiple GPUs
Run separate LayerScale instances on different GPUs:
services: layerscale-gpu0: image: layerscale/layerscale:latest ports: - "8080:8080" environment: - LAYERSCALE_LICENSE_KEY=${LAYERSCALE_LICENSE_KEY} - NVIDIA_VISIBLE_DEVICES=0 deploy: resources: reservations: devices: - driver: nvidia device_ids: ["0"] capabilities: [gpu] command: --model /models/llama-3.1-8b-instruct --context 16384 volumes: - ./models:/models
layerscale-gpu1: image: layerscale/layerscale:latest ports: - "8081:8080" environment: - LAYERSCALE_LICENSE_KEY=${LAYERSCALE_LICENSE_KEY} - NVIDIA_VISIBLE_DEVICES=1 deploy: resources: reservations: devices: - driver: nvidia device_ids: ["1"] capabilities: [gpu] command: --model /models/mistral-7b-instruct --context 16384 volumes: - ./models:/modelsLoad Balancing
Use a reverse proxy to distribute requests across multiple instances:
services: nginx: image: nginx:latest ports: - "80:80" volumes: - ./nginx.conf:/etc/nginx/nginx.conf:ro depends_on: - layerscale-gpu0 - layerscale-gpu1
layerscale-gpu0: image: layerscale/layerscale:latest # ... (as above)
layerscale-gpu1: image: layerscale/layerscale:latest # ... (as above)upstream layerscale { hash $session_id consistent; server layerscale-gpu0:8080; server layerscale-gpu1:8080;}
map $uri $session_id { ~^/v1/sessions/(?<sid>[^/]+) $sid; default $remote_addr;}
server { listen 80;
# Stateless endpoints (chat/completions, messages, session init) # Fall back to hashing on remote_addr location / { proxy_pass http://layerscale; proxy_http_version 1.1; proxy_set_header Connection ""; proxy_set_header Host $host; proxy_read_timeout 300s; }
# Session-bound requests (append, generate, push, flash) # Hash on session ID so they always hit the same GPU location ~ ^/v1/sessions/ { proxy_pass http://layerscale; proxy_http_version 1.1; proxy_set_header Connection ""; proxy_set_header Host $host; proxy_read_timeout 300s; }
# SSE (Flash Query events) location ~ ^/v1/sessions/.*/events { proxy_pass http://layerscale; proxy_http_version 1.1; proxy_set_header Connection ""; proxy_buffering off; proxy_cache off; proxy_read_timeout 86400s; }
# WebSocket location ~ ^/v1/sessions/.*/ws { proxy_pass http://layerscale; proxy_http_version 1.1; proxy_set_header Upgrade $http_upgrade; proxy_set_header Connection "upgrade"; proxy_read_timeout 86400s; }}Session state (KV cache) is pinned to the GPU instance that created it. The hash $session_id consistent directive extracts the session ID from the URL path and routes all requests for the same session to the same backend. Stateless endpoints (like /v1/chat/completions) fall back to hashing on the client IP. The consistent flag ensures that adding or removing a backend only remaps a minimal number of sessions.
Resource Limits
Memory
Set memory limits to prevent the container from consuming all host memory:
deploy: resources: limits: memory: 32G reservations: memory: 16GA rule of thumb: allocate 1.5x the model’s VRAM requirement as container memory to account for CPU-side buffers, tokenizer, and HTTP server overhead.
Shared Memory
If you see errors about shared memory, increase the shared memory size:
shm_size: "2g"Logging
JSON Logs
Use --log-format json for structured logging compatible with log aggregation tools:
--log-format jsonOutput:
{"timestamp":"2024-03-15T10:30:00Z","level":"info","msg":"model loaded","model":"llama-3.1-8b-instruct","layers":32,"context":16384}{"timestamp":"2024-03-15T10:30:01Z","level":"info","msg":"server listening","host":"0.0.0.0","port":8080}Docker Logging Drivers
Configure Docker to forward logs:
logging: driver: json-file options: max-size: "100m" max-file: "5"Or forward to a log aggregator:
logging: driver: fluentd options: fluentd-address: "localhost:24224" tag: "layerscale"AWS Deployment
EC2 with GPU
Recommended instances:
| Instance | GPU | VRAM | Good For |
|---|---|---|---|
| g6.xlarge | 1x L4 | 24 GB | 7B Q8, 13B Q4. Best cost-efficiency for small models. |
| g6e.xlarge | 1x L40S | 48 GB | 13B Q8, 30B Q4. Good balance of VRAM and price. |
| g7e.2xlarge | 1x RTX PRO 6000 | 96 GB | 70B Q4 on a single GPU. No tensor splitting overhead. |
| g7e.12xlarge | 4x RTX PRO 6000 | 384 GB | 70B Q8 or 100B+ models. |
| p5.48xlarge | 8x H100 | 640 GB | Largest models at high quantization. NVLink interconnect. |
The g7e instances are the best fit for LayerScale. 96 GB per GPU means a 70B model fits on a single device without sharding, which gives significantly better token throughput than splitting across multiple smaller GPUs.
Launch with the NVIDIA AMI:
# Use the Deep Learning AMI (includes NVIDIA drivers + Container Toolkit)aws ec2 run-instances \ --image-id ami-0abcdef1234567890 \ --instance-type g6e.xlarge \ --key-name your-key \ --security-group-ids sg-your-groupThen SSH in and run:
docker run --gpus all -p 8080:8080 \ -e LAYERSCALE_LICENSE_KEY=$LAYERSCALE_LICENSE_KEY \ -e HF_TOKEN=$HF_TOKEN \ -v ~/.cache/huggingface:/root/.cache/huggingface \ layerscale/layerscale:latest \ --model meta-llama/Llama-3.1-8B-InstructECS with GPU
Task definition:
{ "family": "layerscale", "requiresCompatibilities": ["EC2"], "containerDefinitions": [ { "name": "meta-llama/Llama-4-Maverick-17B-128E-Original", "image": "layerscale/layerscale:latest", "portMappings": [ {"containerPort": 8080, "hostPort": 8080} ], "environment": [ {"name": "LAYERSCALE_LICENSE_KEY", "value": "your-key"} ], "command": [ "--model", "meta-llama/Llama-3.1-8B-Instruct", "--context", "16384", "--log-format", "json" ], "resourceRequirements": [ {"type": "GPU", "value": "1"} ], "memory": 32768, "healthCheck": { "command": ["CMD-SHELL", "curl -f http://localhost:8080/health || exit 1"], "interval": 30, "timeout": 10, "startPeriod": 120, "retries": 3 }, "logConfiguration": { "logDriver": "awslogs", "options": { "awslogs-group": "/ecs/layerscale", "awslogs-region": "us-east-1", "awslogs-stream-prefix": "meta-llama/Llama-4-Maverick-17B-128E-Original" } } } ]}GCP Deployment
Compute Engine with GPU
gcloud compute instances create layerscale-server \ --zone=us-central1-a \ --machine-type=g2-standard-8 \ --accelerator=type=nvidia-l4,count=1 \ --image-family=common-cu124-debian-11 \ --image-project=deeplearning-platform-release \ --boot-disk-size=200GB \ --metadata=install-nvidia-driver=TrueThen SSH in and run Docker as above.
GKE with GPU
Node pool:
gcloud container node-pools create gpu-pool \ --cluster=your-cluster \ --zone=us-central1-a \ --machine-type=g2-standard-8 \ --accelerator=type=nvidia-l4,count=1 \ --num-nodes=1Kubernetes deployment:
apiVersion: apps/v1kind: Deploymentmetadata: name: layerscalespec: replicas: 1 selector: matchLabels: app: layerscale template: metadata: labels: app: layerscale spec: containers: - name: layerscale image: layerscale/layerscale:latest ports: - containerPort: 8080 env: - name: LAYERSCALE_LICENSE_KEY valueFrom: secretKeyRef: name: layerscale-secrets key: license-key args: - "--model" - "meta-llama/Llama-3.1-8B-Instruct" - "--context" - "16384" - "--log-format" - "json" resources: limits: nvidia.com/gpu: 1 memory: 32Gi livenessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 120 periodSeconds: 30 readinessProbe: httpGet: path: /v1/health port: 8080 initialDelaySeconds: 120 periodSeconds: 10 tolerations: - key: nvidia.com/gpu operator: Exists effect: NoSchedule---apiVersion: v1kind: Servicemetadata: name: layerscalespec: selector: app: layerscale ports: - port: 80 targetPort: 8080 type: LoadBalancerHealth Checks
LayerScale exposes two health endpoints:
| Endpoint | Returns 200 When |
|---|---|
GET /health | Server is running and model is loaded |
GET /v1/health | Same (OpenAI-compatible path) |
Both return HTTP 503 while the model is still loading. Configure your orchestrator’s health check with a generous start period (120-300 seconds) to allow for model loading.
Security
Network
- Bind to
0.0.0.0inside the container (default) but expose only on trusted networks - Use a reverse proxy (nginx, Caddy) for TLS termination
- Restrict access with firewall rules or security groups
License Key
Store the license key securely:
# Docker: use environment variable-e LAYERSCALE_LICENSE_KEY=$LAYERSCALE_LICENSE_KEY
# Docker Compose: use .env fileecho "LAYERSCALE_LICENSE_KEY=$LAYERSCALE_LICENSE_KEY" >> .env
# Kubernetes: use a Secretkubectl create secret generic layerscale-secrets --from-literal=license-key=your-keyNever commit the license key to version control.