Getting Started

Get LayerScale running and send your first query in under 5 minutes.

Prerequisites

Docker with GPU container support (NVIDIA Container Toolkit or AMD ROCm)
An NVIDIA or AMD GPU with at least 8 GB VRAM (for 7B parameter models)
A LayerScale license key (free, takes 10 seconds)
A HuggingFace access token if using gated models (e.g. Llama)

Apple Silicon (Metal) support is coming soon.

1. Pull the Docker Image

NVIDIA (CUDA):

docker pull layerscale/layerscale:latest

AMD (ROCm):

docker pull layerscale/layerscale:latest-rocm

2. Start the Server

Run LayerScale with a model from HuggingFace:

NVIDIA (CUDA)
AMD (ROCm)

docker run --rm \
  --gpus all \
  -p 8080:8080 \
  -e HF_TOKEN=$HF_TOKEN \
  -e LAYERSCALE_LICENSE_KEY="$LAYERSCALE_LICENSE_KEY" \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  layerscale/layerscale:latest \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 8080

docker run --rm \
  --device=/dev/kfd \
  --device=/dev/dri \
  --group-add=video \
  --group-add=render \
  --security-opt seccomp=unconfined \
  --ipc=host \
  --shm-size=16G \
  -p 8080:8080 \
  -e HF_TOKEN=$HF_TOKEN \
  -e LAYERSCALE_LICENSE_KEY="$LAYERSCALE_LICENSE_KEY" \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  layerscale/layerscale:latest-rocm \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 8080

Don’t have a license key yet? Get one here, it’s free and takes 10 seconds.

The server downloads the model on first run (this takes a few minutes depending on your connection). The -v flag mounts your local HuggingFace cache so subsequent starts skip the download.

To use a local model file instead, mount a volume:

NVIDIA (CUDA)
AMD (ROCm)

docker run --rm \
  --gpus all \
  -p 8080:8080 \
  -e LAYERSCALE_LICENSE_KEY="$LAYERSCALE_LICENSE_KEY" \
  -v /path/to/models:/models \
  layerscale/layerscale:latest \
  --model /models/llama-3.1-8b-instruct \
  --host 0.0.0.0 \
  --port 8080

docker run --rm \
  --device=/dev/kfd \
  --device=/dev/dri \
  --group-add=video \
  --group-add=render \
  --security-opt seccomp=unconfined \
  --ipc=host \
  --shm-size=16G \
  -p 8080:8080 \
  -e LAYERSCALE_LICENSE_KEY="$LAYERSCALE_LICENSE_KEY" \
  -v /path/to/models:/models \
  layerscale/layerscale:latest-rocm \
  --model /models/llama-3.1-8b-instruct \
  --host 0.0.0.0 \
  --port 8080

3. Verify the Server Is Running

curl http://localhost:8080/v1/health

A 200 OK with an empty body means the model is loaded and the server is ready. A 503 means the model is still loading.

4. Send Your First Query

Using the OpenAI-compatible chat completions endpoint:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "user", "content": "What is the capital of France?"}
    ],
    "max_tokens": 100
  }'

5. Try Streaming

Add "stream": true to get tokens as they are generated:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "user", "content": "Explain how transformers work in three sentences."}
    ],
    "max_tokens": 200,
    "stream": true
  }'

Tokens arrive as Server-Sent Events (SSE). Each event contains a JSON chunk with the next token in choices[0].delta.content.

Next Steps

Configuration for all server parameters and tuning options
Models for loading different model architectures
API Reference for the full endpoint reference
OpenAI Compatible to use the OpenAI Python SDK with LayerScale
Streaming for continuous data ingestion workflows