Configuration
All server parameters and configuration options for LayerScale.
Command-Line Flags
Pass these flags after the Docker image name:
docker run --gpus all -p 8080:8080 \ -e LAYERSCALE_LICENSE_KEY=your-license-key \ layerscale/layerscale:latest \ --model /models/your-model \ --ctx-size 16384 \ --gpu-layers 99| Flag | Default | Description |
|---|---|---|
--model FILE | (required) | Path to model file or HuggingFace repo ID |
--ctx-size N | 8192 | Context window size in tokens |
--host HOST | 0.0.0.0 | Address to bind the HTTP server |
--port PORT | 8080 | Port to bind the HTTP server |
--gpu-layers N | 99 | Number of model layers to offload to GPU |
--flash-attn on|off | on | Enable or disable flash attention |
--kv-cache-type TYPE | f16 | KV cache quantization: f16, q8_0, or q4_0 |
--pool-size N | 4 | Number of pre-allocated GPU contexts |
--log-format FORMAT | text | Log output format: text or json |
--help | Show all available options |
Environment Variables
| Variable | Default | Description |
|---|---|---|
LAYERSCALE_LICENSE_KEY | (required) | Your LayerScale license key |
LAYERSCALE_HOST | 0.0.0.0 | Server bind address (overridden by --host) |
LAYERSCALE_PORT | 8080 | Server bind port (overridden by --port) |
Context Window
The --ctx-size flag sets the maximum number of tokens the model can process at once. Larger contexts use more GPU memory.
# 4K context (lower memory usage)--ctx-size 4096
# 32K context (for long documents or large streaming windows)--ctx-size 32768
# 128K context (maximum, requires significant VRAM)--ctx-size 131072The minimum supported context size is 256 tokens. The maximum is 1,048,576 tokens (model-dependent).
GPU Configuration
LayerScale supports NVIDIA GPUs (CUDA) and AMD GPUs (ROCm). Apple Silicon (Metal) support is coming soon.
Layer Offloading
The --gpu-layers flag controls how many transformer layers run on the GPU. Set it to a high number (like 99) to offload all layers. Set it to 0 for CPU-only inference.
# Full GPU offload (default)--gpu-layers 99
# CPU only--gpu-layers 0
# Partial offload (first 30 layers on GPU, rest on CPU)--gpu-layers 30Partial offloading is useful when the model does not fully fit in VRAM. Layers that remain on CPU run slower but allow larger models to load.
Flash Attention
Flash attention is enabled by default and reduces memory usage for long contexts. Disable it if you encounter compatibility issues:
--flash-attn offKV Cache Quantization
The KV cache stores intermediate attention states for each token in the context. Quantizing the cache reduces VRAM usage at the cost of minor quality degradation.
| Type | Memory per Token | Reduction | Use Case |
|---|---|---|---|
f16 | 2 bytes | Baseline | Default, highest quality |
q8_0 | ~1 byte | ~50% | Good balance of quality and memory |
q4_0 | ~0.5 bytes | ~75% | Maximum memory savings |
# Use 8-bit KV cache quantization--kv-cache-type q8_0
# Use 4-bit KV cache quantization for maximum context length--kv-cache-type q4_0Context Pool
GPU memory allocation is expensive. Creating a new context means allocating a KV cache, initializing compute buffers, and warming up GPU kernels. On most hardware this takes 200-500ms per context. If that happened on every session creation, the first request to a new session would carry a significant latency penalty.
LayerScale avoids this by pre-allocating a pool of GPU contexts at startup. All the allocation cost is paid once, before the server accepts traffic. When a session is created, it borrows a context from the pool instantly. When a session is deleted, its context is returned to the pool (not destroyed), so the GPU memory is reused without reallocation.
This also serves stateless workloads. OpenAI-compatible and Anthropic-compatible requests (chat completions, messages) don’t create sessions. They borrow a context from the pool for the duration of the request, use it, and return it. The pool size determines how many of these stateless requests can run concurrently.
# Allocate 8 contexts (supports 8 concurrent sessions or stateless requests)--pool-size 8The default pool size is 4. Increasing it allows more concurrent sessions but uses more GPU memory. Each context allocates a full KV cache, so the memory cost is approximately pool_size * ctx_size * kv_cache_bytes_per_token.
Logging
# Structured JSON logs (recommended for production)--log-format json
# Human-readable text logs (default)--log-format textSession Limits
| Parameter | Value | Description |
|---|---|---|
| Max sessions | 10,000 | Maximum number of concurrent sessions |
| Max window size | 10,000 | Maximum entries in the streaming data buffer |
| Max generate tokens | 8,192 | Maximum tokens per generation request |
| Min context size | 256 | Minimum allowed context window |
Example: Full Configuration
docker run --gpus all -p 8080:8080 \ -e LAYERSCALE_LICENSE_KEY=your-license-key \ -v /path/to/models:/models \ layerscale/layerscale:latest \ --model /models/llama-3.1-8b-instruct \ --ctx-size 32768 \ --gpu-layers 99 \ --flash-attn on \ --kv-cache-type q8_0 \ --pool-size 8 \ --log-format json