Skip to content

Getting Started

Get LayerScale running and send your first query in under 5 minutes.

Prerequisites

  • Docker with GPU container support (NVIDIA Container Toolkit or AMD ROCm)
  • An NVIDIA or AMD GPU with at least 8 GB VRAM (for 7B parameter models)
  • A LayerScale license key (free, takes 10 seconds)
  • A HuggingFace access token if using gated models (e.g. Llama)

Apple Silicon (Metal) support is coming soon.

1. Pull the Docker Image

NVIDIA (CUDA):

Terminal window
docker pull layerscale/layerscale:latest

AMD (ROCm):

Terminal window
docker pull layerscale/layerscale:latest-rocm

2. Start the Server

Run LayerScale with a model from HuggingFace:

Terminal window
docker run --rm \
--gpus all \
-p 8080:8080 \
-e HF_TOKEN=$HF_TOKEN \
-e LAYERSCALE_LICENSE_KEY="$LAYERSCALE_LICENSE_KEY" \
-v ~/.cache/huggingface:/root/.cache/huggingface \
layerscale/layerscale:latest \
--model meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 8080

Don’t have a license key yet? Get one here, it’s free and takes 10 seconds.

The server downloads the model on first run (this takes a few minutes depending on your connection). The -v flag mounts your local HuggingFace cache so subsequent starts skip the download.

To use a local model file instead, mount a volume:

Terminal window
docker run --rm \
--gpus all \
-p 8080:8080 \
-e LAYERSCALE_LICENSE_KEY="$LAYERSCALE_LICENSE_KEY" \
-v /path/to/models:/models \
layerscale/layerscale:latest \
--model /models/llama-3.1-8b-instruct \
--host 0.0.0.0 \
--port 8080

3. Verify the Server Is Running

Terminal window
curl http://localhost:8080/v1/health

A 200 OK with an empty body means the model is loaded and the server is ready. A 503 means the model is still loading.

4. Send Your First Query

Using the OpenAI-compatible chat completions endpoint:

Terminal window
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [
{"role": "user", "content": "What is the capital of France?"}
],
"max_tokens": 100
}'

5. Try Streaming

Add "stream": true to get tokens as they are generated:

Terminal window
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [
{"role": "user", "content": "Explain how transformers work in three sentences."}
],
"max_tokens": 200,
"stream": true
}'

Tokens arrive as Server-Sent Events (SSE). Each event contains a JSON chunk with the next token in choices[0].delta.content.

Next Steps