Getting Started

Get LayerScale running and send your first query in under 5 minutes.

Prerequisites

Docker with GPU container support (NVIDIA Container Toolkit or AMD ROCm)
An NVIDIA or AMD GPU with at least 8 GB VRAM (for 7B parameter models)
A LayerScale license key

Apple Silicon (Metal) support is coming soon.

1. Pull the Docker Image

# NVIDIA (CUDA)
docker pull layerscale/layerscale:latest

# AMD (ROCm)
docker pull layerscale/layerscale:latest-rocm

2. Start the Server

Run LayerScale with a model from HuggingFace:

docker run --gpus all -p 8080:8080 \
  -e LAYERSCALE_LICENSE_KEY=your-license-key \
  layerscale/layerscale:latest \
  --model meta-llama/Llama-3.1-8B-Instruct

The server downloads the model on first run (this takes a few minutes depending on your connection). Subsequent starts use the cached model.

To use a local model file instead, mount a volume:

docker run --gpus all -p 8080:8080 \
  -e LAYERSCALE_LICENSE_KEY=your-license-key \
  -v /path/to/models:/models \
  layerscale/layerscale:latest \
  --model /models/llama-3.1-8b-instruct

3. Verify the Server Is Running

curl http://localhost:8080/v1/health

A 200 OK with an empty body means the model is loaded and the server is ready. A 503 means the model is still loading.

4. Send Your First Query

Using the OpenAI-compatible chat completions endpoint:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-4-Maverick-17B-128E-Original",
    "messages": [
      {"role": "user", "content": "What is the capital of France?"}
    ],
    "max_tokens": 100
  }'

5. Try Streaming

Add "stream": true to get tokens as they are generated:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-4-Maverick-17B-128E-Original",
    "messages": [
      {"role": "user", "content": "Explain how transformers work in three sentences."}
    ],
    "max_tokens": 200,
    "stream": true
  }'

Tokens arrive as Server-Sent Events (SSE). Each event contains a JSON chunk with the next token in choices[0].delta.content.

Next Steps

Configuration for all server parameters and tuning options
Models for loading different model architectures
API Reference for the full endpoint reference
OpenAI Compatible to use the OpenAI Python SDK with LayerScale
Streaming for continuous data ingestion workflows