Troubleshooting

Common issues and solutions when running LayerScale.

GPU Memory Errors

”CUDA out of memory”

The model and KV cache don’t fit in VRAM.

Solutions:

Reduce context size. Halving the context size roughly halves KV cache memory.

--ctx-size 4096    # instead of 8192

Use KV cache quantization. Reduces cache memory by 50-75%.

--kv-cache-type q8_0    # 50% reduction
--kv-cache-type q4_0    # 75% reduction

Reduce pool size. Each context allocates its own KV cache.

--pool-size 2    # instead of 4

Use a smaller quantized model. A Q4_K_M model uses roughly half the VRAM of a Q8_0 model.
Use partial GPU offloading. Keep some layers on CPU.

--gpu-layers 20    # instead of 99

”Failed to allocate context”

The context pool is exhausted (all contexts are in use by active sessions).

Solutions:

Increase pool size: --pool-size 8
Delete unused sessions: DELETE /v1/sessions/{id}
Check for leaked sessions (sessions created but never deleted)

Monitoring GPU Memory

Check GPU memory usage on the host:

nvidia-smi

Inside Docker:

docker exec layerscale nvidia-smi

Model Loading

”Model file not found”

The model path is incorrect or the volume mount is missing.

# Verify the file exists on the host
ls -la /path/to/models/your-model

# Verify the mount is correct
docker run --rm -v /path/to/models:/models alpine ls -la /models/

”Failed to load model”

Common causes:

Corrupted model file. Re-download the model.
Unsupported architecture. Check supported architectures.
Insufficient memory. The model doesn’t fit in available RAM + VRAM. Check nvidia-smi and free -h.

HuggingFace Download Failures

If model download fails:

Check network access. The container needs internet access to reach huggingface.co.
Disk space. Downloads go to /home/layerscale/.cache. Make sure there’s enough space.
Rate limits. HuggingFace may rate-limit downloads. Mount a cache volume to avoid re-downloading:

-v /path/to/cache:/home/layerscale/.cache

Slow Model Loading

First load is always slower because the model needs to be loaded into GPU memory. Subsequent loads from cached model files are faster.

If loading is consistently slow:

Check disk I/O. NFS or network-mounted storage is slower than local SSD.
Ensure the model file is on a fast local disk.

Connection Issues

”Connection refused” on localhost:8080

The server hasn’t finished starting yet, or it crashed during startup.

Check logs:

docker logs layerscale

Wait for the health check. Model loading takes time, especially on first run with HuggingFace models.
Check the port mapping:

docker ps    # verify port 8080 is mapped

SSE Connection Drops

SSE connections to /events may drop due to:

Proxy timeouts. Increase proxy_read_timeout in your reverse proxy. SSE connections should stay open for hours.
Client-side timeouts. Use EventSource or equivalent with automatic reconnection.
Container restarts. Sessions are not persisted across container restarts. Reconnect and create a new session.

WebSocket Connection Failures

Reverse proxy not configured for WebSocket. You need Upgrade and Connection headers. See Deployment.
Maximum connections reached. Each session supports up to 100 WebSocket connections.

Performance

Slow Token Generation

Check GPU utilization:

nvidia-smi -l 1    # update every second

If GPU utilization is low, the bottleneck may be CPU tokenization or I/O.

Enable flash attention (enabled by default):

--flash-attn on

Reduce context size. Attention scales quadratically with context length. If you don’t need 32K context, use 8K or 16K.
Use full GPU offloading. CPU layers are significantly slower:

--gpu-layers 99

High Latency on First Request

The first request after server start is slower because:

Context pool is being warmed up
GPU compute kernels are being optimized (if applicable)

Subsequent requests will be faster. This is normal.

Slow Streaming Data Processing

If the stream processor falls behind (queue_size keeps growing):

Push less frequently. Batch multiple data points per push instead of pushing one at a time.
Reduce Flash Query count. Each query adds evaluation time after every batch.
Lower Flash Query max_tokens. Shorter answers evaluate faster.
Increase context size. More room for the sliding window reduces reprocessing frequency.

Comparing Performance

To measure baseline token generation speed:

# Non-streaming, measure total time
time curl -s http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-4-Maverick-17B-128E-Original",
    "messages": [{"role": "user", "content": "Count from 1 to 100"}],
    "max_tokens": 500
  }' | python3 -c "
import sys, json
r = json.load(sys.stdin)
tokens = r['usage']['completion_tokens']
print(f'{tokens} tokens generated')
"

Common Error Codes

Code	Meaning	Solution
400	Bad request	Check request body format. Ensure required fields are present.
401	Unauthorized	Check your bearer token or API key.
404	Not found	Session ID is invalid or session was deleted.
409	Conflict	Session is busy with another request. Wait and retry.
503	Service unavailable	Server is starting or model is loading. Wait for health check to pass.

Docker Issues

Container Exits Immediately

Check logs:

docker logs layerscale

Common causes:

Missing --model flag
Invalid license key
GPU not accessible (missing --gpus all)

“nvidia-container-cli: initialization error”

The NVIDIA Container Toolkit is not installed or the driver is not loaded.

# Check driver
nvidia-smi

# Reinstall toolkit
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Permission Denied on Model File

The LayerScale container runs as a non-root user. Ensure the model file is readable:

chmod 644 /path/to/models/your-model

Getting Help

If you’re stuck:

Check the server logs (docker logs layerscale)
Verify GPU access (docker run --gpus all nvidia/cuda:12.9.1-base-ubuntu22.04 nvidia-smi)
Test the health endpoint (curl http://localhost:8080/v1/health)
Try a smaller model first to rule out memory issues