Getting Started
Get LayerScale running and send your first query in under 5 minutes.
Prerequisites
- Docker with GPU container support (NVIDIA Container Toolkit or AMD ROCm)
- An NVIDIA or AMD GPU with at least 8 GB VRAM (for 7B parameter models)
- A LayerScale license key (free, takes 10 seconds)
- A HuggingFace access token if using gated models (e.g. Llama)
Apple Silicon (Metal) support is coming soon.
1. Pull the Docker Image
NVIDIA (CUDA):
docker pull layerscale/layerscale:latestAMD (ROCm):
docker pull layerscale/layerscale:latest-rocm2. Start the Server
Run LayerScale with a model from HuggingFace:
docker run --rm \ --gpus all \ -p 8080:8080 \ -e HF_TOKEN=$HF_TOKEN \ -e LAYERSCALE_LICENSE_KEY="$LAYERSCALE_LICENSE_KEY" \ -v ~/.cache/huggingface:/root/.cache/huggingface \ layerscale/layerscale:latest \ --model meta-llama/Llama-3.1-8B-Instruct \ --host 0.0.0.0 \ --port 8080docker run --rm \ --device=/dev/kfd \ --device=/dev/dri \ --group-add=video \ --group-add=render \ --security-opt seccomp=unconfined \ --ipc=host \ --shm-size=16G \ -p 8080:8080 \ -e HF_TOKEN=$HF_TOKEN \ -e LAYERSCALE_LICENSE_KEY="$LAYERSCALE_LICENSE_KEY" \ -v ~/.cache/huggingface:/root/.cache/huggingface \ layerscale/layerscale:latest-rocm \ --model meta-llama/Llama-3.1-8B-Instruct \ --host 0.0.0.0 \ --port 8080Don’t have a license key yet? Get one here, it’s free and takes 10 seconds.
The server downloads the model on first run (this takes a few minutes depending on your connection). The -v flag mounts your local HuggingFace cache so subsequent starts skip the download.
To use a local model file instead, mount a volume:
docker run --rm \ --gpus all \ -p 8080:8080 \ -e LAYERSCALE_LICENSE_KEY="$LAYERSCALE_LICENSE_KEY" \ -v /path/to/models:/models \ layerscale/layerscale:latest \ --model /models/llama-3.1-8b-instruct \ --host 0.0.0.0 \ --port 8080docker run --rm \ --device=/dev/kfd \ --device=/dev/dri \ --group-add=video \ --group-add=render \ --security-opt seccomp=unconfined \ --ipc=host \ --shm-size=16G \ -p 8080:8080 \ -e LAYERSCALE_LICENSE_KEY="$LAYERSCALE_LICENSE_KEY" \ -v /path/to/models:/models \ layerscale/layerscale:latest-rocm \ --model /models/llama-3.1-8b-instruct \ --host 0.0.0.0 \ --port 80803. Verify the Server Is Running
curl http://localhost:8080/v1/healthA 200 OK with an empty body means the model is loaded and the server is ready. A 503 means the model is still loading.
4. Send Your First Query
Using the OpenAI-compatible chat completions endpoint:
curl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "meta-llama/Llama-3.1-8B-Instruct", "messages": [ {"role": "user", "content": "What is the capital of France?"} ], "max_tokens": 100 }'5. Try Streaming
Add "stream": true to get tokens as they are generated:
curl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "meta-llama/Llama-3.1-8B-Instruct", "messages": [ {"role": "user", "content": "Explain how transformers work in three sentences."} ], "max_tokens": 200, "stream": true }'Tokens arrive as Server-Sent Events (SSE). Each event contains a JSON chunk with the next token in choices[0].delta.content.
Next Steps
- Configuration for all server parameters and tuning options
- Models for loading different model architectures
- API Reference for the full endpoint reference
- OpenAI Compatible to use the OpenAI Python SDK with LayerScale
- Streaming for continuous data ingestion workflows