Troubleshooting
Common issues and solutions when running LayerScale.
GPU Memory Errors
”CUDA out of memory”
The model and KV cache don’t fit in VRAM.
Solutions:
- Reduce context size. Halving the context size roughly halves KV cache memory.
--ctx-size 4096 # instead of 8192- Use KV cache quantization. Reduces cache memory by 50-75%.
--kv-cache-type q8_0 # 50% reduction--kv-cache-type q4_0 # 75% reduction- Reduce pool size. Each context allocates its own KV cache.
--pool-size 2 # instead of 4-
Use a smaller quantized model. A Q4_K_M model uses roughly half the VRAM of a Q8_0 model.
-
Use partial GPU offloading. Keep some layers on CPU.
--gpu-layers 20 # instead of 99”Failed to allocate context”
The context pool is exhausted (all contexts are in use by active sessions).
Solutions:
- Increase pool size:
--pool-size 8 - Delete unused sessions:
DELETE /v1/sessions/{id} - Check for leaked sessions (sessions created but never deleted)
Monitoring GPU Memory
Check GPU memory usage on the host:
nvidia-smiInside Docker:
docker exec layerscale nvidia-smiModel Loading
”Model file not found”
The model path is incorrect or the volume mount is missing.
# Verify the file exists on the hostls -la /path/to/models/your-model
# Verify the mount is correctdocker run --rm -v /path/to/models:/models alpine ls -la /models/”Failed to load model”
Common causes:
- Corrupted model file. Re-download the model.
- Unsupported architecture. Check supported architectures.
- Insufficient memory. The model doesn’t fit in available RAM + VRAM. Check
nvidia-smiandfree -h.
HuggingFace Download Failures
If model download fails:
- Check network access. The container needs internet access to reach huggingface.co.
- Disk space. Downloads go to
/home/layerscale/.cache. Make sure there’s enough space. - Rate limits. HuggingFace may rate-limit downloads. Mount a cache volume to avoid re-downloading:
-v /path/to/cache:/home/layerscale/.cacheSlow Model Loading
First load is always slower because the model needs to be loaded into GPU memory. Subsequent loads from cached model files are faster.
If loading is consistently slow:
- Check disk I/O. NFS or network-mounted storage is slower than local SSD.
- Ensure the model file is on a fast local disk.
Connection Issues
”Connection refused” on localhost:8080
The server hasn’t finished starting yet, or it crashed during startup.
- Check logs:
docker logs layerscale-
Wait for the health check. Model loading takes time, especially on first run with HuggingFace models.
-
Check the port mapping:
docker ps # verify port 8080 is mappedSSE Connection Drops
SSE connections to /events may drop due to:
- Proxy timeouts. Increase
proxy_read_timeoutin your reverse proxy. SSE connections should stay open for hours. - Client-side timeouts. Use
EventSourceor equivalent with automatic reconnection. - Container restarts. Sessions are not persisted across container restarts. Reconnect and create a new session.
WebSocket Connection Failures
- Reverse proxy not configured for WebSocket. You need
UpgradeandConnectionheaders. See Deployment. - Maximum connections reached. Each session supports up to 100 WebSocket connections.
Performance
Slow Token Generation
- Check GPU utilization:
nvidia-smi -l 1 # update every secondIf GPU utilization is low, the bottleneck may be CPU tokenization or I/O.
- Enable flash attention (enabled by default):
--flash-attn on-
Reduce context size. Attention scales quadratically with context length. If you don’t need 32K context, use 8K or 16K.
-
Use full GPU offloading. CPU layers are significantly slower:
--gpu-layers 99High Latency on First Request
The first request after server start is slower because:
- Context pool is being warmed up
- GPU compute kernels are being optimized (if applicable)
Subsequent requests will be faster. This is normal.
Slow Streaming Data Processing
If the stream processor falls behind (queue_size keeps growing):
- Push less frequently. Batch multiple data points per push instead of pushing one at a time.
- Reduce Flash Query count. Each query adds evaluation time after every batch.
- Lower Flash Query max_tokens. Shorter answers evaluate faster.
- Increase context size. More room for the sliding window reduces reprocessing frequency.
Comparing Performance
To measure baseline token generation speed:
# Non-streaming, measure total timetime curl -s http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "meta-llama/Llama-4-Maverick-17B-128E-Original", "messages": [{"role": "user", "content": "Count from 1 to 100"}], "max_tokens": 500 }' | python3 -c "import sys, jsonr = json.load(sys.stdin)tokens = r['usage']['completion_tokens']print(f'{tokens} tokens generated')"Common Error Codes
| Code | Meaning | Solution |
|---|---|---|
| 400 | Bad request | Check request body format. Ensure required fields are present. |
| 401 | Unauthorized | Check your bearer token or API key. |
| 404 | Not found | Session ID is invalid or session was deleted. |
| 409 | Conflict | Session is busy with another request. Wait and retry. |
| 503 | Service unavailable | Server is starting or model is loading. Wait for health check to pass. |
Docker Issues
Container Exits Immediately
Check logs:
docker logs layerscaleCommon causes:
- Missing
--modelflag - Invalid license key
- GPU not accessible (missing
--gpus all)
“nvidia-container-cli: initialization error”
The NVIDIA Container Toolkit is not installed or the driver is not loaded.
# Check drivernvidia-smi
# Reinstall toolkitsudo apt-get install -y nvidia-container-toolkitsudo nvidia-ctk runtime configure --runtime=dockersudo systemctl restart dockerPermission Denied on Model File
The LayerScale container runs as a non-root user. Ensure the model file is readable:
chmod 644 /path/to/models/your-modelGetting Help
If you’re stuck:
- Check the server logs (
docker logs layerscale) - Verify GPU access (
docker run --gpus all nvidia/cuda:12.9.1-base-ubuntu22.04 nvidia-smi) - Test the health endpoint (
curl http://localhost:8080/v1/health) - Try a smaller model first to rule out memory issues