Skip to content

Getting Started

Get LayerScale running and send your first query in under 5 minutes.

Prerequisites

  • Docker with GPU container support (NVIDIA Container Toolkit or AMD ROCm)
  • An NVIDIA or AMD GPU with at least 8 GB VRAM (for 7B parameter models)
  • A LayerScale license key

Apple Silicon (Metal) support is coming soon.

1. Pull the Docker Image

Terminal window
# NVIDIA (CUDA)
docker pull layerscale/layerscale:latest
# AMD (ROCm)
docker pull layerscale/layerscale:latest-rocm

2. Start the Server

Run LayerScale with a model from HuggingFace:

Terminal window
docker run --gpus all -p 8080:8080 \
-e LAYERSCALE_LICENSE_KEY=your-license-key \
layerscale/layerscale:latest \
--model meta-llama/Llama-3.1-8B-Instruct

The server downloads the model on first run (this takes a few minutes depending on your connection). Subsequent starts use the cached model.

To use a local model file instead, mount a volume:

Terminal window
docker run --gpus all -p 8080:8080 \
-e LAYERSCALE_LICENSE_KEY=your-license-key \
-v /path/to/models:/models \
layerscale/layerscale:latest \
--model /models/llama-3.1-8b-instruct

3. Verify the Server Is Running

Terminal window
curl http://localhost:8080/v1/health

A 200 OK with an empty body means the model is loaded and the server is ready. A 503 means the model is still loading.

4. Send Your First Query

Using the OpenAI-compatible chat completions endpoint:

Terminal window
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-4-Maverick-17B-128E-Original",
"messages": [
{"role": "user", "content": "What is the capital of France?"}
],
"max_tokens": 100
}'

5. Try Streaming

Add "stream": true to get tokens as they are generated:

Terminal window
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-4-Maverick-17B-128E-Original",
"messages": [
{"role": "user", "content": "Explain how transformers work in three sentences."}
],
"max_tokens": 200,
"stream": true
}'

Tokens arrive as Server-Sent Events (SSE). Each event contains a JSON chunk with the next token in choices[0].delta.content.

Next Steps