Configuration
All server parameters and configuration options for LayerScale.
Command-Line Flags
Pass these flags after the Docker image name:
docker run --gpus all -p 8080:8080 \ -e LAYERSCALE_LICENSE_KEY=$LAYERSCALE_LICENSE_KEY \ layerscale/layerscale:latest \ --model /models/your-model \ --context 16384| Flag | Default | Description |
|---|---|---|
--model FILE | (required) | Path to model file or HuggingFace repo ID |
--context N | (model’s trained context) | Context window size in tokens; 0 = use the model’s trained context length |
--host HOST | 0.0.0.0 | Address to bind the HTTP server |
--port PORT | 8080 | Port to bind the HTTP server |
--cpu-offload-gb GB | 0 (all on GPU) | GiB of model weights to keep in CPU RAM. |
--kv-cache-type TYPE | fp16 | KV cache quantization: fp16, int8, or int4 |
--log-format FORMAT | text | Log output format: text or json |
--sessions N | 1 | Reserve N persistent sessions. Sessions are a hard memory allocation: the server commits N × --context × per-token-bytes of KV pool at startup. If that doesn’t fit the budget, per-seq context auto-clamps so all N sessions still fit. Capped at the tier limit (1 free, 64 Pro). |
--help | Show all available options |
Environment Variables
| Variable | Default | Description |
|---|---|---|
LAYERSCALE_LICENSE_KEY | (optional) | License key to unlock LayerScale Pro features. This is the only way to pass a license key; there is no --license-key flag. |
Context Window
The --context flag sets the maximum number of tokens the model can process at once. Larger contexts use more GPU memory.
If you omit --context (or pass 0), LayerScale uses the model’s trained context length — the value the model was pretrained on (e.g. 128K for Llama 3.1, 8K for Llama 3.0). This is usually what you want.
# Use the model's trained context length (default)# (no --context flag needed)
# 4K context (lower memory usage)--context 4096
# 32K context (for long documents or large streaming windows)--context 32768
# 128K context (requires significant VRAM)--context 131072The minimum supported context size is 256 tokens. The maximum is 1,048,576 tokens (model-dependent).
GPU Configuration
LayerScale supports NVIDIA GPUs (CUDA) and AMD GPUs (ROCm). Apple Silicon (Metal) support is coming soon.
CPU Offload
When the model does not fully fit in VRAM, the --cpu-offload-gb flag tells LayerScale how many GiB of model weights to keep in CPU RAM. The remaining weights stay resident on the GPU.
LayerScale reads the model’s metadata at startup to figure out its layer count and per-layer size, then converts your GB budget into a per-layer split.
# Default: all layers on GPU# (no --cpu-offload-gb flag needed)
# Spill 8 GiB of weights to CPU RAM (e.g. for a 70B model on a 48 GB GPU)--cpu-offload-gb 8
# Maximum spill: keep almost everything in RAM--cpu-offload-gb 60Layers that remain on CPU run slower than GPU layers but allow larger models to fit on smaller GPUs. The conversion is approximate (per-layer ≈ file_size / block_count); a few percent of slack is normal.
KV Cache Quantization
The KV cache stores intermediate attention states for each token in the context. Lowering cache precision reduces VRAM usage at the cost of minor quality degradation.
| Precision | Flag Value | Memory per Token | Reduction | Use Case |
|---|---|---|---|---|
| 16-bit | fp16 | 2 bytes | Baseline | Default, highest quality |
| 8-bit | int8 | ~1 byte | ~50% | Good balance of quality and memory |
| 4-bit | int4 | ~0.5 bytes | ~75% | Maximum memory savings |
# Use 8-bit KV cache--kv-cache-type int8
# Use 4-bit KV cache for maximum context length--kv-cache-type int4Concurrency
LayerScale runs every request — stateless completions and stateful sessions alike — through a single continuous-batching scheduler that owns one shared llama_context. Each in-flight request is a seq_id whose KV slice is committed at server startup; concurrent requests share GPU forward passes (one llama_decode call per scheduler iteration carries prefill chunks and decode tokens for every active slot).
There is no concurrency knob for stateless requests. The server auto-sizes the burst pool from the leftover VRAM after sessions are reserved (32 max on Pro, 2 on Free).
Sessions
Sessions are persistent, named conversations with their own KV slice. Use them when you need per-client state across requests (multi-turn chat, streaming data ingestion, Flash Queries).
The --sessions N flag controls how many sessions the server reserves at startup. Sessions are a hard memory allocation: the server commits N × --context × per-token-bytes of KV pool up front. Burst capacity for stateless requests is elastic — it gets whatever memory is left after sessions are reserved, capped at max_concurrent.
# Default: 1 session, full --context, max burst from leftover.--sessions 1 # implicit, can omit
# Reserve 8 sessions; per-seq context auto-clamps if needed.--sessions 8
# Hit the 65th /v1/sessions/init and you get HTTP 503 — sessions# are bounded by the operator-set ceiling.--sessions 64The trade-off: more sessions means each one gets less context (or burst capacity shrinks). Concrete examples on a 28 GB KV budget (NVIDIA L40S):
--context | --sessions | Per-seq context | Burst slots |
|---|---|---|---|
| 32768 | 1 (default) | 32768 (full) | 7 |
| 32768 | 8 | ~24576 (clamped) | 0 |
| 16384 | 8 | 16384 (full) | ~3 |
| 8192 | 8 | 8192 (full) | ~19 |
Pick based on workload: long-context single-stream → low --sessions; many concurrent persistent agents → high --sessions + lower --context.
The --sessions ceiling is enforced at session-create time. The (N+1)th /v1/sessions/init returns HTTP 503 with code: "max_sessions_reached".
Logging
# Structured JSON logs (recommended for production)--log-format json
# Human-readable text logs (default)--log-format textSession Limits
| Parameter | Value | Description |
|---|---|---|
| Max sessions | 10,000 | Maximum number of concurrent sessions |
| Max window size | 10,000 | Maximum entries in the streaming data buffer |
Example: Full Configuration
docker run --gpus all -p 8080:8080 \ -e LAYERSCALE_LICENSE_KEY=$LAYERSCALE_LICENSE_KEY \ -v /path/to/models:/models \ layerscale/layerscale:latest \ --model /models/llama-3.1-8b-instruct \ --context 32768 \ --kv-cache-type int8 \ --log-format json