Skip to content

Configuration

All server parameters and configuration options for LayerScale.

Command-Line Flags

Pass these flags after the Docker image name:

Terminal window
docker run --gpus all -p 8080:8080 \
-e LAYERSCALE_LICENSE_KEY=$LAYERSCALE_LICENSE_KEY \
layerscale/layerscale:latest \
--model /models/your-model \
--context 16384
FlagDefaultDescription
--model FILE(required)Path to model file or HuggingFace repo ID
--context N(model’s trained context)Context window size in tokens; 0 = use the model’s trained context length
--host HOST0.0.0.0Address to bind the HTTP server
--port PORT8080Port to bind the HTTP server
--cpu-offload-gb GB0 (all on GPU)GiB of model weights to keep in CPU RAM.
--kv-cache-type TYPEfp16KV cache quantization: fp16, int8, or int4
--log-format FORMATtextLog output format: text or json
--sessions N1Reserve N persistent sessions. Sessions are a hard memory allocation: the server commits N × --context × per-token-bytes of KV pool at startup. If that doesn’t fit the budget, per-seq context auto-clamps so all N sessions still fit. Capped at the tier limit (1 free, 64 Pro).
--helpShow all available options

Environment Variables

VariableDefaultDescription
LAYERSCALE_LICENSE_KEY(optional)License key to unlock LayerScale Pro features. This is the only way to pass a license key; there is no --license-key flag.

Context Window

The --context flag sets the maximum number of tokens the model can process at once. Larger contexts use more GPU memory.

If you omit --context (or pass 0), LayerScale uses the model’s trained context length — the value the model was pretrained on (e.g. 128K for Llama 3.1, 8K for Llama 3.0). This is usually what you want.

Terminal window
# Use the model's trained context length (default)
# (no --context flag needed)
# 4K context (lower memory usage)
--context 4096
# 32K context (for long documents or large streaming windows)
--context 32768
# 128K context (requires significant VRAM)
--context 131072

The minimum supported context size is 256 tokens. The maximum is 1,048,576 tokens (model-dependent).

GPU Configuration

LayerScale supports NVIDIA GPUs (CUDA) and AMD GPUs (ROCm). Apple Silicon (Metal) support is coming soon.

CPU Offload

When the model does not fully fit in VRAM, the --cpu-offload-gb flag tells LayerScale how many GiB of model weights to keep in CPU RAM. The remaining weights stay resident on the GPU.

LayerScale reads the model’s metadata at startup to figure out its layer count and per-layer size, then converts your GB budget into a per-layer split.

Terminal window
# Default: all layers on GPU
# (no --cpu-offload-gb flag needed)
# Spill 8 GiB of weights to CPU RAM (e.g. for a 70B model on a 48 GB GPU)
--cpu-offload-gb 8
# Maximum spill: keep almost everything in RAM
--cpu-offload-gb 60

Layers that remain on CPU run slower than GPU layers but allow larger models to fit on smaller GPUs. The conversion is approximate (per-layer ≈ file_size / block_count); a few percent of slack is normal.

KV Cache Quantization

The KV cache stores intermediate attention states for each token in the context. Lowering cache precision reduces VRAM usage at the cost of minor quality degradation.

PrecisionFlag ValueMemory per TokenReductionUse Case
16-bitfp162 bytesBaselineDefault, highest quality
8-bitint8~1 byte~50%Good balance of quality and memory
4-bitint4~0.5 bytes~75%Maximum memory savings
Terminal window
# Use 8-bit KV cache
--kv-cache-type int8
# Use 4-bit KV cache for maximum context length
--kv-cache-type int4

Concurrency

LayerScale runs every request — stateless completions and stateful sessions alike — through a single continuous-batching scheduler that owns one shared llama_context. Each in-flight request is a seq_id whose KV slice is committed at server startup; concurrent requests share GPU forward passes (one llama_decode call per scheduler iteration carries prefill chunks and decode tokens for every active slot).

There is no concurrency knob for stateless requests. The server auto-sizes the burst pool from the leftover VRAM after sessions are reserved (32 max on Pro, 2 on Free).

Sessions

Sessions are persistent, named conversations with their own KV slice. Use them when you need per-client state across requests (multi-turn chat, streaming data ingestion, Flash Queries).

The --sessions N flag controls how many sessions the server reserves at startup. Sessions are a hard memory allocation: the server commits N × --context × per-token-bytes of KV pool up front. Burst capacity for stateless requests is elastic — it gets whatever memory is left after sessions are reserved, capped at max_concurrent.

Terminal window
# Default: 1 session, full --context, max burst from leftover.
--sessions 1 # implicit, can omit
# Reserve 8 sessions; per-seq context auto-clamps if needed.
--sessions 8
# Hit the 65th /v1/sessions/init and you get HTTP 503 — sessions
# are bounded by the operator-set ceiling.
--sessions 64

The trade-off: more sessions means each one gets less context (or burst capacity shrinks). Concrete examples on a 28 GB KV budget (NVIDIA L40S):

--context--sessionsPer-seq contextBurst slots
327681 (default)32768 (full)7
327688~24576 (clamped)0
16384816384 (full)~3
819288192 (full)~19

Pick based on workload: long-context single-stream → low --sessions; many concurrent persistent agents → high --sessions + lower --context.

The --sessions ceiling is enforced at session-create time. The (N+1)th /v1/sessions/init returns HTTP 503 with code: "max_sessions_reached".

Logging

Terminal window
# Structured JSON logs (recommended for production)
--log-format json
# Human-readable text logs (default)
--log-format text

Session Limits

ParameterValueDescription
Max sessions10,000Maximum number of concurrent sessions
Max window size10,000Maximum entries in the streaming data buffer

Example: Full Configuration

Terminal window
docker run --gpus all -p 8080:8080 \
-e LAYERSCALE_LICENSE_KEY=$LAYERSCALE_LICENSE_KEY \
-v /path/to/models:/models \
layerscale/layerscale:latest \
--model /models/llama-3.1-8b-instruct \
--context 32768 \
--kv-cache-type int8 \
--log-format json