Configuration

All server parameters and configuration options for LayerScale.

Command-Line Flags

Pass these flags after the Docker image name:

docker run --gpus all -p 8080:8080 \
  -e LAYERSCALE_LICENSE_KEY=$LAYERSCALE_LICENSE_KEY \
  layerscale/layerscale:latest \
  --model /models/your-model \
  --context 16384

Flag	Default	Description
`--model FILE`	(required)	Path to model file or HuggingFace repo ID
`--context N`	(model’s trained context)	Context window size in tokens; `0` = use the model’s trained context length
`--host HOST`	`0.0.0.0`	Address to bind the HTTP server
`--port PORT`	`8080`	Port to bind the HTTP server
`--cpu-offload-gb GB`	`0` (all on GPU)	GiB of model weights to keep in CPU RAM.
`--kv-cache-type TYPE`	`fp16`	KV cache quantization: `fp16`, `int8`, or `int4`
`--log-format FORMAT`	`text`	Log output format: `text` or `json`
`--sessions N`	`0` (disabled)	Enable up to N persistent sessions. With the flag omitted (or `0`), `/v1/sessions/*` returns HTTP 503. Sessions are a hard memory allocation: the server commits `N × --context × per-token-bytes` of KV pool at startup. If that doesn’t fit the budget, per-seq context auto-clamps so all N sessions still fit. Must be `>= 1` when set; the free tier allows a single session.
`--tensor-parallel N`	`1` (single GPU)	Split one model across N GPUs with tensor parallelism. Requires N visible GPUs and an attention head count divisible by N.
`--help`		Show all available options

Environment Variables

Variable	Default	Description
`LAYERSCALE_LICENSE_KEY`	(required)	Your license key (free or Pro). The server refuses to start without one; get a free key here. Equivalent to passing `--license-key <KEY>` after the image name; use whichever suits your deployment. A free key runs the free tier, a Pro key unlocks LayerScale Pro features.

Context Window

The --context flag sets the maximum number of tokens the model can process at once. Larger contexts use more GPU memory.

If you omit --context (or pass 0), LayerScale uses the model’s trained context length: the value the model was pretrained on (e.g. 128K for Llama 3.1, 8K for Llama 3.0). This is usually what you want.

# Use the model's trained context length (default)
# (no --context flag needed)

# 4K context (lower memory usage)
--context 4096

# 32K context (for long documents or large streaming windows)
--context 32768

# 128K context (requires significant VRAM)
--context 131072

The minimum supported context size is 256 tokens. The maximum is 1,048,576 tokens (model-dependent). On the free tier, --context is capped at 32,768; LayerScale Pro unlocks the full range.

GPU Configuration

LayerScale supports NVIDIA GPUs (CUDA) and AMD GPUs (ROCm). Apple Silicon (Metal) support is coming soon.

Multi-GPU (Tensor Parallelism)

--tensor-parallel N splits a single model across N GPUs, sharding each layer’s weights so all GPUs cooperate on every forward pass. Use it when one model does not fit (or leaves too little KV headroom) on a single card. It requires N visible GPUs and a model whose attention head count is divisible by N. The KV budget is probed per device, so each request slot gets a larger context than a single card could hold.

CPU Offload

When the model does not fully fit in VRAM, the --cpu-offload-gb flag tells LayerScale how many GiB of model weights to keep in CPU RAM. The remaining weights stay resident on the GPU.

LayerScale reads the model’s metadata at startup to figure out its layer count and per-layer size, then converts your GB budget into a per-layer split.

# Default: all layers on GPU
# (no --cpu-offload-gb flag needed)

# Spill 8 GiB of weights to CPU RAM (e.g. for a 70B model on a 48 GB GPU)
--cpu-offload-gb 8

# Maximum spill: keep almost everything in RAM
--cpu-offload-gb 60

Layers that remain on CPU run slower than GPU layers but allow larger models to fit on smaller GPUs. The conversion is approximate (per-layer ≈ file_size / block_count); a few percent of slack is normal.

KV Cache Quantization

The KV cache stores intermediate attention states for each token in the context. Lowering cache precision reduces VRAM usage at the cost of minor quality degradation.

Precision	Flag Value	Memory per Token	Reduction	Use Case
16-bit	`fp16`	2 bytes	Baseline	Default, highest quality
8-bit	`int8`	~1 byte	~50%	Good balance of quality and memory
4-bit	`int4`	~0.5 bytes	~75%	Maximum memory savings

# Use 8-bit KV cache
--kv-cache-type int8

# Use 4-bit KV cache for maximum context length
--kv-cache-type int4

Sessions

Sessions are persistent, named conversations with their own KV slice. Use them when you need per-client state across requests (multi-turn chat, streaming data ingestion, Flash Queries).

The --sessions N flag controls how many sessions the server reserves at startup. Sessions are a hard memory allocation: the server commits N × --context × per-token-bytes of KV pool up front. Burst capacity for stateless requests is elastic: it gets whatever memory is left after sessions are reserved, capped at max_concurrent.

# Sessions are OFF by default; pass --sessions to enable them.
--sessions 1   # one persistent session at full --context

# Reserve 8 sessions; per-seq context auto-clamps if needed.
--sessions 8

# Requests past your --sessions count return HTTP 503; the
# ceiling is whatever you set here.
--sessions 16

The trade-off: more sessions means each one gets less context (or burst capacity shrinks). Concrete examples on a 28 GB KV budget (NVIDIA L40S):

`--context`	`--sessions`	Per-seq context	Burst slots
32768	1	32768 (full)	7
32768	8	~24576 (clamped)	0
16384	8	16384 (full)	~3
8192	8	8192 (full)	~19

Pick based on workload: long-context single-stream → low --sessions; many concurrent persistent agents → high --sessions + lower --context.

The --sessions ceiling is enforced at session-create time. The (N+1)th /v1/sessions/init returns HTTP 503 with the message "Maximum sessions reached".

Logging

# Structured JSON logs (recommended for production)
--log-format json

# Human-readable text logs (default)
--log-format text

Example: Full Configuration

docker run --gpus all -p 8080:8080 \
  -e LAYERSCALE_LICENSE_KEY=$LAYERSCALE_LICENSE_KEY \
  -v /path/to/models:/models \
  layerscale/layerscale:latest \
  --model /models/llama-3.1-8b-instruct \
  --context 32768 \
  --kv-cache-type int8 \
  --log-format json