Deployment
Production deployment guide for LayerScale.
Docker Compose
Single GPU
services: layerscale: image: layerscale/layerscale:latest ports: - "8080:8080" environment: - LAYERSCALE_LICENSE_KEY=${LAYERSCALE_LICENSE_KEY} volumes: - ./models:/models - model-cache:/home/layerscale/.cache deploy: resources: reservations: devices: - driver: nvidia count: 1 capabilities: [gpu] limits: memory: 32G command: > --model /models/llama-3.1-8b-instruct --ctx-size 16384 --pool-size 4 --kv-cache-type q8_0 --log-format json healthcheck: test: ["CMD", "curl", "-f", "http://localhost:8080/health"] interval: 30s timeout: 10s start_period: 120s retries: 3 restart: unless-stopped
volumes: model-cache:Start with:
LAYERSCALE_LICENSE_KEY=your-key docker compose up -dWith HuggingFace Model
services: layerscale: image: layerscale/layerscale:latest ports: - "8080:8080" environment: - LAYERSCALE_LICENSE_KEY=${LAYERSCALE_LICENSE_KEY} volumes: - model-cache:/home/layerscale/.cache deploy: resources: reservations: devices: - driver: nvidia count: 1 capabilities: [gpu] command: > --model meta-llama/Llama-3.1-8B-Instruct --ctx-size 8192 --pool-size 4 --log-format json healthcheck: test: ["CMD", "curl", "-f", "http://localhost:8080/health"] interval: 30s timeout: 10s start_period: 300s retries: 3 restart: unless-stopped
volumes: model-cache:Note the longer start_period (300s) to account for model download time on first run.
GPU Passthrough
NVIDIA Container Toolkit
Install the NVIDIA Container Toolkit on the host:
# Ubuntu/Debiancurl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpgcurl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \ sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \ sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.listsudo apt-get updatesudo apt-get install -y nvidia-container-toolkitsudo nvidia-ctk runtime configure --runtime=dockersudo systemctl restart dockerVerify GPU access:
docker run --rm --gpus all nvidia/cuda:12.9.1-base-ubuntu22.04 nvidia-smiAMD ROCm
For AMD GPUs, use the ROCm image and pass the GPU device:
docker run --device=/dev/kfd --device=/dev/dri --group-add video \ -p 8080:8080 \ -e LAYERSCALE_LICENSE_KEY=your-key \ layerscale/layerscale:latest-rocm \ --model meta-llama/Llama-3.1-8B-InstructVerify GPU access:
docker run --rm --device=/dev/kfd --device=/dev/dri --group-add video \ rocm/dev-ubuntu-22.04 rocm-smiApple Silicon (Coming Soon)
Native Metal support for Apple Silicon Macs is in development. This will enable running LayerScale directly on M-series hardware without Docker or an NVIDIA/AMD GPU.
Selecting a Specific GPU
On multi-GPU hosts, select a GPU by device index:
docker run --gpus '"device=0"' -p 8080:8080 \ -e LAYERSCALE_LICENSE_KEY=your-key \ layerscale/layerscale:latest \ --model /models/your-modelOr by UUID:
docker run --gpus '"device=GPU-abc123"' -p 8080:8080 \ -e LAYERSCALE_LICENSE_KEY=your-key \ layerscale/layerscale:latest \ --model /models/your-modelMultiple Instances on Multiple GPUs
Run separate LayerScale instances on different GPUs:
services: layerscale-gpu0: image: layerscale/layerscale:latest ports: - "8080:8080" environment: - LAYERSCALE_LICENSE_KEY=${LAYERSCALE_LICENSE_KEY} - NVIDIA_VISIBLE_DEVICES=0 deploy: resources: reservations: devices: - driver: nvidia device_ids: ["0"] capabilities: [gpu] command: --model /models/llama-3.1-8b-instruct --ctx-size 16384 volumes: - ./models:/models
layerscale-gpu1: image: layerscale/layerscale:latest ports: - "8081:8080" environment: - LAYERSCALE_LICENSE_KEY=${LAYERSCALE_LICENSE_KEY} - NVIDIA_VISIBLE_DEVICES=1 deploy: resources: reservations: devices: - driver: nvidia device_ids: ["1"] capabilities: [gpu] command: --model /models/mistral-7b-instruct --ctx-size 16384 volumes: - ./models:/modelsLoad Balancing
Use a reverse proxy to distribute requests across multiple instances:
services: nginx: image: nginx:latest ports: - "80:80" volumes: - ./nginx.conf:/etc/nginx/nginx.conf:ro depends_on: - layerscale-gpu0 - layerscale-gpu1
layerscale-gpu0: image: layerscale/layerscale:latest # ... (as above)
layerscale-gpu1: image: layerscale/layerscale:latest # ... (as above)upstream layerscale { least_conn; server layerscale-gpu0:8080; server layerscale-gpu1:8080;}
server { listen 80;
location / { proxy_pass http://layerscale; proxy_http_version 1.1; proxy_set_header Connection ""; proxy_set_header Host $host; proxy_read_timeout 300s; }
# SSE and WebSocket require sticky sessions location ~ /v1/sessions/.*/events { proxy_pass http://layerscale; proxy_http_version 1.1; proxy_set_header Connection ""; proxy_buffering off; proxy_cache off; proxy_read_timeout 86400s; }
location ~ /v1/sessions/.*/ws { proxy_pass http://layerscale; proxy_http_version 1.1; proxy_set_header Upgrade $http_upgrade; proxy_set_header Connection "upgrade"; proxy_read_timeout 86400s; }}Note: Session-based endpoints (streaming, Flash Queries, WebSocket) must always route to the same instance. Use sticky sessions or route session requests to specific instances.
Resource Limits
Memory
Set memory limits to prevent the container from consuming all host memory:
deploy: resources: limits: memory: 32G reservations: memory: 16GA rule of thumb: allocate 1.5x the model’s VRAM requirement as container memory to account for CPU-side buffers, tokenizer, and HTTP server overhead.
Shared Memory
If you see errors about shared memory, increase the shared memory size:
shm_size: "2g"Logging
JSON Logs
Use --log-format json for structured logging compatible with log aggregation tools:
--log-format jsonOutput:
{"timestamp":"2024-03-15T10:30:00Z","level":"info","msg":"model loaded","model":"llama-3.1-8b-instruct","layers":32,"ctx_size":16384}{"timestamp":"2024-03-15T10:30:01Z","level":"info","msg":"server listening","host":"0.0.0.0","port":8080}Docker Logging Drivers
Configure Docker to forward logs:
logging: driver: json-file options: max-size: "100m" max-file: "5"Or forward to a log aggregator:
logging: driver: fluentd options: fluentd-address: "localhost:24224" tag: "layerscale"AWS Deployment
EC2 with GPU
Recommended instances:
| Instance | GPU | VRAM | Good For |
|---|---|---|---|
| g6.xlarge | 1x L4 | 24 GB | 7B Q8, 13B Q4. Best cost-efficiency for small models. |
| g6e.xlarge | 1x L40S | 48 GB | 13B Q8, 30B Q4. Good balance of VRAM and price. |
| g7e.2xlarge | 1x RTX PRO 6000 | 96 GB | 70B Q4 on a single GPU. No tensor splitting overhead. |
| g7e.12xlarge | 4x RTX PRO 6000 | 384 GB | 70B Q8 or 100B+ models. |
| p5.48xlarge | 8x H100 | 640 GB | Largest models at high quantization. NVLink interconnect. |
The g7e instances are the best fit for LayerScale. 96 GB per GPU means a 70B model fits on a single device without sharding, which gives significantly better token throughput than splitting across multiple smaller GPUs.
Launch with the NVIDIA AMI:
# Use the Deep Learning AMI (includes NVIDIA drivers + Container Toolkit)aws ec2 run-instances \ --image-id ami-0abcdef1234567890 \ --instance-type g6e.xlarge \ --key-name your-key \ --security-group-ids sg-your-groupThen SSH in and run:
docker run --gpus all -p 8080:8080 \ -e LAYERSCALE_LICENSE_KEY=your-key \ layerscale/layerscale:latest \ --model meta-llama/Llama-3.1-8B-InstructECS with GPU
Task definition:
{ "family": "layerscale", "requiresCompatibilities": ["EC2"], "containerDefinitions": [ { "name": "meta-llama/Llama-4-Maverick-17B-128E-Original", "image": "layerscale/layerscale:latest", "portMappings": [ {"containerPort": 8080, "hostPort": 8080} ], "environment": [ {"name": "LAYERSCALE_LICENSE_KEY", "value": "your-key"} ], "command": [ "--model", "meta-llama/Llama-3.1-8B-Instruct", "--ctx-size", "16384", "--log-format", "json" ], "resourceRequirements": [ {"type": "GPU", "value": "1"} ], "memory": 32768, "healthCheck": { "command": ["CMD-SHELL", "curl -f http://localhost:8080/health || exit 1"], "interval": 30, "timeout": 10, "startPeriod": 120, "retries": 3 }, "logConfiguration": { "logDriver": "awslogs", "options": { "awslogs-group": "/ecs/layerscale", "awslogs-region": "us-east-1", "awslogs-stream-prefix": "meta-llama/Llama-4-Maverick-17B-128E-Original" } } } ]}GCP Deployment
Compute Engine with GPU
gcloud compute instances create layerscale-server \ --zone=us-central1-a \ --machine-type=g2-standard-8 \ --accelerator=type=nvidia-l4,count=1 \ --image-family=common-cu124-debian-11 \ --image-project=deeplearning-platform-release \ --boot-disk-size=200GB \ --metadata=install-nvidia-driver=TrueThen SSH in and run Docker as above.
GKE with GPU
Node pool:
gcloud container node-pools create gpu-pool \ --cluster=your-cluster \ --zone=us-central1-a \ --machine-type=g2-standard-8 \ --accelerator=type=nvidia-l4,count=1 \ --num-nodes=1Kubernetes deployment:
apiVersion: apps/v1kind: Deploymentmetadata: name: layerscalespec: replicas: 1 selector: matchLabels: app: layerscale template: metadata: labels: app: layerscale spec: containers: - name: layerscale image: layerscale/layerscale:latest ports: - containerPort: 8080 env: - name: LAYERSCALE_LICENSE_KEY valueFrom: secretKeyRef: name: layerscale-secrets key: license-key args: - "--model" - "meta-llama/Llama-3.1-8B-Instruct" - "--ctx-size" - "16384" - "--log-format" - "json" resources: limits: nvidia.com/gpu: 1 memory: 32Gi livenessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 120 periodSeconds: 30 readinessProbe: httpGet: path: /v1/health port: 8080 initialDelaySeconds: 120 periodSeconds: 10 tolerations: - key: nvidia.com/gpu operator: Exists effect: NoSchedule---apiVersion: v1kind: Servicemetadata: name: layerscalespec: selector: app: layerscale ports: - port: 80 targetPort: 8080 type: LoadBalancerHealth Checks
LayerScale exposes two health endpoints:
| Endpoint | Returns 200 When |
|---|---|
GET /health | Server is running and model is loaded |
GET /v1/health | Same (OpenAI-compatible path) |
Both return HTTP 503 while the model is still loading. Configure your orchestrator’s health check with a generous start period (120-300 seconds) to allow for model loading.
Security
Network
- Bind to
0.0.0.0inside the container (default) but expose only on trusted networks - Use a reverse proxy (nginx, Caddy) for TLS termination
- Restrict access with firewall rules or security groups
License Key
Store the license key securely:
# Docker: use environment variable-e LAYERSCALE_LICENSE_KEY=your-key
# Docker Compose: use .env fileecho "LAYERSCALE_LICENSE_KEY=your-key" >> .env
# Kubernetes: use a Secretkubectl create secret generic layerscale-secrets --from-literal=license-key=your-keyNever commit the license key to version control.