Getting Started
Get LayerScale running and send your first query in under 5 minutes.
Prerequisites
- Docker with GPU container support (NVIDIA Container Toolkit or AMD ROCm)
- An NVIDIA or AMD GPU with at least 8 GB VRAM (for 7B parameter models)
- A LayerScale license key
Apple Silicon (Metal) support is coming soon.
1. Pull the Docker Image
# NVIDIA (CUDA)docker pull layerscale/layerscale:latest
# AMD (ROCm)docker pull layerscale/layerscale:latest-rocm2. Start the Server
Run LayerScale with a model from HuggingFace:
docker run --gpus all -p 8080:8080 \ -e LAYERSCALE_LICENSE_KEY=your-license-key \ layerscale/layerscale:latest \ --model meta-llama/Llama-3.1-8B-InstructThe server downloads the model on first run (this takes a few minutes depending on your connection). Subsequent starts use the cached model.
To use a local model file instead, mount a volume:
docker run --gpus all -p 8080:8080 \ -e LAYERSCALE_LICENSE_KEY=your-license-key \ -v /path/to/models:/models \ layerscale/layerscale:latest \ --model /models/llama-3.1-8b-instruct3. Verify the Server Is Running
curl http://localhost:8080/v1/healthA 200 OK with an empty body means the model is loaded and the server is ready. A 503 means the model is still loading.
4. Send Your First Query
Using the OpenAI-compatible chat completions endpoint:
curl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "meta-llama/Llama-4-Maverick-17B-128E-Original", "messages": [ {"role": "user", "content": "What is the capital of France?"} ], "max_tokens": 100 }'5. Try Streaming
Add "stream": true to get tokens as they are generated:
curl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "meta-llama/Llama-4-Maverick-17B-128E-Original", "messages": [ {"role": "user", "content": "Explain how transformers work in three sentences."} ], "max_tokens": 200, "stream": true }'Tokens arrive as Server-Sent Events (SSE). Each event contains a JSON chunk with the next token in choices[0].delta.content.
Next Steps
- Configuration for all server parameters and tuning options
- Models for loading different model architectures
- API Reference for the full endpoint reference
- OpenAI Compatible to use the OpenAI Python SDK with LayerScale
- Streaming for continuous data ingestion workflows