Skip to content

Configuration

All server parameters and configuration options for LayerScale.

Command-Line Flags

Pass these flags after the Docker image name:

Terminal window
docker run --gpus all -p 8080:8080 \
-e LAYERSCALE_LICENSE_KEY=your-license-key \
layerscale/layerscale:latest \
--model /models/your-model \
--ctx-size 16384 \
--gpu-layers 99
FlagDefaultDescription
--model FILE(required)Path to model file or HuggingFace repo ID
--ctx-size N8192Context window size in tokens
--host HOST0.0.0.0Address to bind the HTTP server
--port PORT8080Port to bind the HTTP server
--gpu-layers N99Number of model layers to offload to GPU
--flash-attn on|offonEnable or disable flash attention
--kv-cache-type TYPEf16KV cache quantization: f16, q8_0, or q4_0
--pool-size N4Number of pre-allocated GPU contexts
--log-format FORMATtextLog output format: text or json
--helpShow all available options

Environment Variables

VariableDefaultDescription
LAYERSCALE_LICENSE_KEY(required)Your LayerScale license key
LAYERSCALE_HOST0.0.0.0Server bind address (overridden by --host)
LAYERSCALE_PORT8080Server bind port (overridden by --port)

Context Window

The --ctx-size flag sets the maximum number of tokens the model can process at once. Larger contexts use more GPU memory.

Terminal window
# 4K context (lower memory usage)
--ctx-size 4096
# 32K context (for long documents or large streaming windows)
--ctx-size 32768
# 128K context (maximum, requires significant VRAM)
--ctx-size 131072

The minimum supported context size is 256 tokens. The maximum is 1,048,576 tokens (model-dependent).

GPU Configuration

LayerScale supports NVIDIA GPUs (CUDA) and AMD GPUs (ROCm). Apple Silicon (Metal) support is coming soon.

Layer Offloading

The --gpu-layers flag controls how many transformer layers run on the GPU. Set it to a high number (like 99) to offload all layers. Set it to 0 for CPU-only inference.

Terminal window
# Full GPU offload (default)
--gpu-layers 99
# CPU only
--gpu-layers 0
# Partial offload (first 30 layers on GPU, rest on CPU)
--gpu-layers 30

Partial offloading is useful when the model does not fully fit in VRAM. Layers that remain on CPU run slower but allow larger models to load.

Flash Attention

Flash attention is enabled by default and reduces memory usage for long contexts. Disable it if you encounter compatibility issues:

Terminal window
--flash-attn off

KV Cache Quantization

The KV cache stores intermediate attention states for each token in the context. Quantizing the cache reduces VRAM usage at the cost of minor quality degradation.

TypeMemory per TokenReductionUse Case
f162 bytesBaselineDefault, highest quality
q8_0~1 byte~50%Good balance of quality and memory
q4_0~0.5 bytes~75%Maximum memory savings
Terminal window
# Use 8-bit KV cache quantization
--kv-cache-type q8_0
# Use 4-bit KV cache quantization for maximum context length
--kv-cache-type q4_0

Context Pool

GPU memory allocation is expensive. Creating a new context means allocating a KV cache, initializing compute buffers, and warming up GPU kernels. On most hardware this takes 200-500ms per context. If that happened on every session creation, the first request to a new session would carry a significant latency penalty.

LayerScale avoids this by pre-allocating a pool of GPU contexts at startup. All the allocation cost is paid once, before the server accepts traffic. When a session is created, it borrows a context from the pool instantly. When a session is deleted, its context is returned to the pool (not destroyed), so the GPU memory is reused without reallocation.

This also serves stateless workloads. OpenAI-compatible and Anthropic-compatible requests (chat completions, messages) don’t create sessions. They borrow a context from the pool for the duration of the request, use it, and return it. The pool size determines how many of these stateless requests can run concurrently.

Terminal window
# Allocate 8 contexts (supports 8 concurrent sessions or stateless requests)
--pool-size 8

The default pool size is 4. Increasing it allows more concurrent sessions but uses more GPU memory. Each context allocates a full KV cache, so the memory cost is approximately pool_size * ctx_size * kv_cache_bytes_per_token.

Logging

Terminal window
# Structured JSON logs (recommended for production)
--log-format json
# Human-readable text logs (default)
--log-format text

Session Limits

ParameterValueDescription
Max sessions10,000Maximum number of concurrent sessions
Max window size10,000Maximum entries in the streaming data buffer
Max generate tokens8,192Maximum tokens per generation request
Min context size256Minimum allowed context window

Example: Full Configuration

Terminal window
docker run --gpus all -p 8080:8080 \
-e LAYERSCALE_LICENSE_KEY=your-license-key \
-v /path/to/models:/models \
layerscale/layerscale:latest \
--model /models/llama-3.1-8b-instruct \
--ctx-size 32768 \
--gpu-layers 99 \
--flash-attn on \
--kv-cache-type q8_0 \
--pool-size 8 \
--log-format json