Models
How to load and run models with LayerScale.
Loading a Local Model
Mount a directory containing your model file and pass the path:
docker run --gpus all -p 8080:8080 \ -e LAYERSCALE_LICENSE_KEY=your-license-key \ -v /path/to/models:/models \ layerscale/layerscale:latest \ --model /models/llama-3.1-8b-instructLoading from HuggingFace
Pass a HuggingFace repository ID and LayerScale will download and convert the model automatically:
docker run --gpus all -p 8080:8080 \ -e LAYERSCALE_LICENSE_KEY=your-license-key \ layerscale/layerscale:latest \ --model meta-llama/Llama-3.1-8B-InstructThe model is cached after the first download. To persist the cache across container restarts, mount a volume:
docker run --gpus all -p 8080:8080 \ -e LAYERSCALE_LICENSE_KEY=your-license-key \ -v /path/to/cache:/home/layerscale/.cache \ layerscale/layerscale:latest \ --model meta-llama/Llama-3.1-8B-InstructSupported Architectures
LayerScale supports all major transformer-based architectures:
| Architecture | Example Models | Notes |
|---|---|---|
| Llama | Llama 3.1 8B/70B, Llama 3.2 1B/3B, Llama 4 Maverick 17Bx128E | Full support including chat templates |
| Mistral | Mistral 7B, Mixtral 8x7B | MoE architectures supported |
| Gemma | Gemma 2 2B/9B/27B | |
| Phi | Phi-3 Mini/Medium | |
| Command-R | Command-R 35B/104B | |
| StarCoder | StarCoder2 3B/7B/15B | Code-specialized |
| RWKV | RWKV-6 | Linear attention |
| Mamba | Mamba 1.4B/2.8B | State-space model |
Quantization
Models come in various quantization levels that trade quality for smaller file size and lower memory usage.
| Quantization | Bits | Relative Size | Quality | Recommended For |
|---|---|---|---|---|
| F16 | 16 | 100% | Best | Research, quality-critical tasks |
| Q8_0 | 8 | ~50% | Near-lossless | Production (best balance) |
| Q6_K | 6 | ~40% | Very good | Production with tighter memory |
| Q5_K_M | 5 | ~35% | Good | General use |
| Q4_K_M | 4 | ~28% | Acceptable | Memory-constrained environments |
| Q4_0 | 4 | ~25% | Lower | Maximum memory savings |
| Q2_K | 2 | ~15% | Degraded | Experimentation only |
VRAM Requirements (Approximate)
| Model Size | Q4_K_M | Q8_0 | F16 |
|---|---|---|---|
| 1B | ~1 GB | ~2 GB | ~3 GB |
| 7B | ~5 GB | ~8 GB | ~15 GB |
| 13B | ~8 GB | ~14 GB | ~27 GB |
| 34B | ~20 GB | ~36 GB | ~68 GB |
| 70B | ~40 GB | ~72 GB | ~140 GB |
These estimates are for model weights only. Add KV cache memory on top (see Configuration).
Recommended Models
General Purpose
# Llama 3.1 8B - good all-around model--model meta-llama/Llama-3.1-8B-Instruct
# Gemma 2 9B - strong reasoning--model google/gemma-2-9b-it
# Mistral 7B - fast and capable--model mistralai/Mistral-7B-Instruct-v0.3Tool Calling
# Llama 3.1 8B - reliable structured output--model meta-llama/Llama-3.1-8B-Instruct
# Llama 3.1 70B - best tool calling accuracy (requires 40+ GB VRAM with Q4)--model meta-llama/Llama-3.1-70B-InstructCode Generation
# StarCoder2 15B--model bigcode/starcoder2-15b
# Llama 3.1 8B - also strong at code tasks--model meta-llama/Llama-3.1-8B-InstructSmall/Edge Models
# Llama 3.2 3B - fits on most GPUs--model meta-llama/Llama-3.2-3B-Instruct
# Gemma 2 2B - minimal footprint--model google/gemma-2-2b-it
# Phi-3 Mini - 3.8B parameters--model microsoft/Phi-3-mini-4k-instructA Note on Reasoning Models
LayerScale is not recommended for use with reasoning models (any model that uses chain-of-thought token generation before producing an answer).
These models generate hundreds or thousands of “thinking” tokens before the actual response. This fundamentally conflicts with LayerScale’s architecture and the workloads it is designed for:
-
Latency. LayerScale is built for near-instant query responses. A reasoning model spends 500 to 5,000+ tokens thinking before emitting a single useful token, pushing latency into seconds regardless of how fast the engine is. The entire benefit of stateful inference and Flash Queries is negated.
-
Context budget. Reasoning tokens consume the same context space as data tokens. A model that generates 2,000 thinking tokens per query burns through context that would otherwise hold streaming data entries. In a 8K context, a single reasoning query can displace most of your accumulated data.
-
Flash Query incompatibility. Flash Queries work by generating short, cached answers that update after each data batch. Reasoning models cannot produce short answers. They must complete their full chain-of-thought before any output is usable, making background evaluation impractical and cache hits impossible.
-
Token cost scaling. In streaming workloads, queries fire frequently (every data batch for Flash Queries, every few seconds for interactive use). Reasoning overhead multiplies across every query. A workload that generates 100 queries over a session would produce 200,000+ wasted thinking tokens.
For LayerScale workloads, use instruction-tuned models that produce direct answers. Models like Llama 3.1 Instruct, Mistral Instruct, and Gemma-IT are designed to respond concisely when prompted to, which is exactly what streaming and Flash Query evaluation requires.
Chat Templates
LayerScale automatically detects the chat template from the model metadata. The correct prompt formatting (special tokens, role markers, etc.) is applied automatically for each model family.
If you are using the OpenAI or Anthropic API endpoints, messages are formatted using the model’s chat template. You do not need to handle prompt formatting yourself.
Listing Loaded Models
Query the models endpoint to see what model is currently loaded:
curl http://localhost:8080/v1/models{ "object": "list", "data": [ { "id": "Llama-4-Maverick-17B-128E-Original", "object": "model", "owned_by": "meta-llama" } ]}