Skip to content

Troubleshooting

Common issues and solutions when running LayerScale.

GPU Memory Errors

”CUDA out of memory”

The model and KV cache don’t fit in VRAM.

Solutions:

  1. Reduce context size. Halving the context size roughly halves KV cache memory.
Terminal window
--ctx-size 4096 # instead of 8192
  1. Use KV cache quantization. Reduces cache memory by 50-75%.
Terminal window
--kv-cache-type q8_0 # 50% reduction
--kv-cache-type q4_0 # 75% reduction
  1. Reduce pool size. Each context allocates its own KV cache.
Terminal window
--pool-size 2 # instead of 4
  1. Use a smaller quantized model. A Q4_K_M model uses roughly half the VRAM of a Q8_0 model.

  2. Use partial GPU offloading. Keep some layers on CPU.

Terminal window
--gpu-layers 20 # instead of 99

”Failed to allocate context”

The context pool is exhausted (all contexts are in use by active sessions).

Solutions:

  • Increase pool size: --pool-size 8
  • Delete unused sessions: DELETE /v1/sessions/{id}
  • Check for leaked sessions (sessions created but never deleted)

Monitoring GPU Memory

Check GPU memory usage on the host:

Terminal window
nvidia-smi

Inside Docker:

Terminal window
docker exec layerscale nvidia-smi

Model Loading

”Model file not found”

The model path is incorrect or the volume mount is missing.

Terminal window
# Verify the file exists on the host
ls -la /path/to/models/your-model
# Verify the mount is correct
docker run --rm -v /path/to/models:/models alpine ls -la /models/

”Failed to load model”

Common causes:

  • Corrupted model file. Re-download the model.
  • Unsupported architecture. Check supported architectures.
  • Insufficient memory. The model doesn’t fit in available RAM + VRAM. Check nvidia-smi and free -h.

HuggingFace Download Failures

If model download fails:

  1. Check network access. The container needs internet access to reach huggingface.co.
  2. Disk space. Downloads go to /home/layerscale/.cache. Make sure there’s enough space.
  3. Rate limits. HuggingFace may rate-limit downloads. Mount a cache volume to avoid re-downloading:
Terminal window
-v /path/to/cache:/home/layerscale/.cache

Slow Model Loading

First load is always slower because the model needs to be loaded into GPU memory. Subsequent loads from cached model files are faster.

If loading is consistently slow:

  • Check disk I/O. NFS or network-mounted storage is slower than local SSD.
  • Ensure the model file is on a fast local disk.

Connection Issues

”Connection refused” on localhost:8080

The server hasn’t finished starting yet, or it crashed during startup.

  1. Check logs:
Terminal window
docker logs layerscale
  1. Wait for the health check. Model loading takes time, especially on first run with HuggingFace models.

  2. Check the port mapping:

Terminal window
docker ps # verify port 8080 is mapped

SSE Connection Drops

SSE connections to /events may drop due to:

  • Proxy timeouts. Increase proxy_read_timeout in your reverse proxy. SSE connections should stay open for hours.
  • Client-side timeouts. Use EventSource or equivalent with automatic reconnection.
  • Container restarts. Sessions are not persisted across container restarts. Reconnect and create a new session.

WebSocket Connection Failures

  • Reverse proxy not configured for WebSocket. You need Upgrade and Connection headers. See Deployment.
  • Maximum connections reached. Each session supports up to 100 WebSocket connections.

Performance

Slow Token Generation

  1. Check GPU utilization:
Terminal window
nvidia-smi -l 1 # update every second

If GPU utilization is low, the bottleneck may be CPU tokenization or I/O.

  1. Enable flash attention (enabled by default):
Terminal window
--flash-attn on
  1. Reduce context size. Attention scales quadratically with context length. If you don’t need 32K context, use 8K or 16K.

  2. Use full GPU offloading. CPU layers are significantly slower:

Terminal window
--gpu-layers 99

High Latency on First Request

The first request after server start is slower because:

  • Context pool is being warmed up
  • GPU compute kernels are being optimized (if applicable)

Subsequent requests will be faster. This is normal.

Slow Streaming Data Processing

If the stream processor falls behind (queue_size keeps growing):

  1. Push less frequently. Batch multiple data points per push instead of pushing one at a time.
  2. Reduce Flash Query count. Each query adds evaluation time after every batch.
  3. Lower Flash Query max_tokens. Shorter answers evaluate faster.
  4. Increase context size. More room for the sliding window reduces reprocessing frequency.

Comparing Performance

To measure baseline token generation speed:

Terminal window
# Non-streaming, measure total time
time curl -s http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-4-Maverick-17B-128E-Original",
"messages": [{"role": "user", "content": "Count from 1 to 100"}],
"max_tokens": 500
}' | python3 -c "
import sys, json
r = json.load(sys.stdin)
tokens = r['usage']['completion_tokens']
print(f'{tokens} tokens generated')
"

Common Error Codes

CodeMeaningSolution
400Bad requestCheck request body format. Ensure required fields are present.
401UnauthorizedCheck your bearer token or API key.
404Not foundSession ID is invalid or session was deleted.
409ConflictSession is busy with another request. Wait and retry.
503Service unavailableServer is starting or model is loading. Wait for health check to pass.

Docker Issues

Container Exits Immediately

Check logs:

Terminal window
docker logs layerscale

Common causes:

  • Missing --model flag
  • Invalid license key
  • GPU not accessible (missing --gpus all)

“nvidia-container-cli: initialization error”

The NVIDIA Container Toolkit is not installed or the driver is not loaded.

Terminal window
# Check driver
nvidia-smi
# Reinstall toolkit
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Permission Denied on Model File

The LayerScale container runs as a non-root user. Ensure the model file is readable:

Terminal window
chmod 644 /path/to/models/your-model

Getting Help

If you’re stuck:

  1. Check the server logs (docker logs layerscale)
  2. Verify GPU access (docker run --gpus all nvidia/cuda:12.9.1-base-ubuntu22.04 nvidia-smi)
  3. Test the health endpoint (curl http://localhost:8080/v1/health)
  4. Try a smaller model first to rule out memory issues