Configuration

All server parameters and configuration options for LayerScale.

Command-Line Flags

Pass these flags after the Docker image name:

docker run --gpus all -p 8080:8080 \
  -e LAYERSCALE_LICENSE_KEY=your-license-key \
  layerscale/layerscale:latest \
  --model /models/your-model \
  --ctx-size 16384 \
  --gpu-layers 99

Flag	Default	Description
`--model FILE`	(required)	Path to model file or HuggingFace repo ID
`--ctx-size N`	`8192`	Context window size in tokens
`--host HOST`	`0.0.0.0`	Address to bind the HTTP server
`--port PORT`	`8080`	Port to bind the HTTP server
`--gpu-layers N`	`99`	Number of model layers to offload to GPU
`--flash-attn on\|off`	`on`	Enable or disable flash attention
`--kv-cache-type TYPE`	`f16`	KV cache quantization: `f16`, `q8_0`, or `q4_0`
`--pool-size N`	`4`	Number of pre-allocated GPU contexts
`--log-format FORMAT`	`text`	Log output format: `text` or `json`
`--help`		Show all available options

Environment Variables

Variable	Default	Description
`LAYERSCALE_LICENSE_KEY`	(required)	Your LayerScale license key
`LAYERSCALE_HOST`	`0.0.0.0`	Server bind address (overridden by `--host`)
`LAYERSCALE_PORT`	`8080`	Server bind port (overridden by `--port`)

Context Window

The --ctx-size flag sets the maximum number of tokens the model can process at once. Larger contexts use more GPU memory.

# 4K context (lower memory usage)
--ctx-size 4096

# 32K context (for long documents or large streaming windows)
--ctx-size 32768

# 128K context (maximum, requires significant VRAM)
--ctx-size 131072

The minimum supported context size is 256 tokens. The maximum is 1,048,576 tokens (model-dependent).

GPU Configuration

LayerScale supports NVIDIA GPUs (CUDA) and AMD GPUs (ROCm). Apple Silicon (Metal) support is coming soon.

Layer Offloading

The --gpu-layers flag controls how many transformer layers run on the GPU. Set it to a high number (like 99) to offload all layers. Set it to 0 for CPU-only inference.

# Full GPU offload (default)
--gpu-layers 99

# CPU only
--gpu-layers 0

# Partial offload (first 30 layers on GPU, rest on CPU)
--gpu-layers 30

Partial offloading is useful when the model does not fully fit in VRAM. Layers that remain on CPU run slower but allow larger models to load.

Flash Attention

Flash attention is enabled by default and reduces memory usage for long contexts. Disable it if you encounter compatibility issues:

--flash-attn off

KV Cache Quantization

The KV cache stores intermediate attention states for each token in the context. Quantizing the cache reduces VRAM usage at the cost of minor quality degradation.

Type	Memory per Token	Reduction	Use Case
`f16`	2 bytes	Baseline	Default, highest quality
`q8_0`	~1 byte	~50%	Good balance of quality and memory
`q4_0`	~0.5 bytes	~75%	Maximum memory savings

# Use 8-bit KV cache quantization
--kv-cache-type q8_0

# Use 4-bit KV cache quantization for maximum context length
--kv-cache-type q4_0

Context Pool

GPU memory allocation is expensive. Creating a new context means allocating a KV cache, initializing compute buffers, and warming up GPU kernels. On most hardware this takes 200-500ms per context. If that happened on every session creation, the first request to a new session would carry a significant latency penalty.

LayerScale avoids this by pre-allocating a pool of GPU contexts at startup. All the allocation cost is paid once, before the server accepts traffic. When a session is created, it borrows a context from the pool instantly. When a session is deleted, its context is returned to the pool (not destroyed), so the GPU memory is reused without reallocation.

This also serves stateless workloads. OpenAI-compatible and Anthropic-compatible requests (chat completions, messages) don’t create sessions. They borrow a context from the pool for the duration of the request, use it, and return it. The pool size determines how many of these stateless requests can run concurrently.

# Allocate 8 contexts (supports 8 concurrent sessions or stateless requests)
--pool-size 8

The default pool size is 4. Increasing it allows more concurrent sessions but uses more GPU memory. Each context allocates a full KV cache, so the memory cost is approximately pool_size * ctx_size * kv_cache_bytes_per_token.

Logging

# Structured JSON logs (recommended for production)
--log-format json

# Human-readable text logs (default)
--log-format text

Session Limits

Parameter	Value	Description
Max sessions	10,000	Maximum number of concurrent sessions
Max window size	10,000	Maximum entries in the streaming data buffer
Max generate tokens	8,192	Maximum tokens per generation request
Min context size	256	Minimum allowed context window

Example: Full Configuration

docker run --gpus all -p 8080:8080 \
  -e LAYERSCALE_LICENSE_KEY=your-license-key \
  -v /path/to/models:/models \
  layerscale/layerscale:latest \
  --model /models/llama-3.1-8b-instruct \
  --ctx-size 32768 \
  --gpu-layers 99 \
  --flash-attn on \
  --kv-cache-type q8_0 \
  --pool-size 8 \
  --log-format json