API Reference
Complete reference for all LayerScale API endpoints.
Base URL
http://localhost:8080Authentication
Authentication is enforced when LayerScale runs with a Pro license key. The free tier accepts requests without authentication. The /health, /v1/health, and /metrics endpoints are always accessible without authentication.
When authentication is required, pass your license key as a bearer token:
Authorization: Bearer your-license-keyOr using the Anthropic-style header:
x-api-key: your-license-keyHealth
GET /v1/health
Check if the server is ready to accept requests.
curl http://localhost:8080/v1/healthResponse 200: Empty body. The model is loaded and the server is ready.
Response 503: Server is still loading the model.
Chat Completions (OpenAI-Compatible)
POST /v1/chat/completions
Generate a chat completion. Fully compatible with the OpenAI API format.
Request:
{ "model": "meta-llama/Llama-4-Maverick-17B-128E-Original", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is 2 + 2?"} ], "max_tokens": 100, "temperature": 0.7, "top_p": 0.9, "stream": false, "stop": ["\n\n"], "tools": []}Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
model | string | No | Model identifier (accepted but not used for routing) | |
messages | array | Yes | Array of message objects with role and content | |
max_tokens | integer | No | 256 | Maximum tokens to generate |
temperature | float | No | 0.7 | Sampling temperature (0.0 to 2.0) |
top_p | float | No | 1.0 | Nucleus sampling threshold |
stream | boolean | No | false | Enable SSE streaming |
stop | array | string | No | [] | Stop sequences |
tools | array | No | [] | Tool definitions for function calling |
Response (non-streaming):
{ "id": "chatcmpl-abc123", "object": "chat.completion", "created": 1711000000, "model": "meta-llama/Llama-4-Maverick-17B-128E-Original", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "2 + 2 equals 4." }, "finish_reason": "stop" } ], "usage": { "prompt_tokens": 25, "completion_tokens": 8, "total_tokens": 33 }}Response (streaming):
Each SSE event contains a chunk:
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"2"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":" +"},"finish_reason":null}]}
data: [DONE]Tool calling response:
When the model invokes a tool, the response includes tool_calls instead of text content:
{ "choices": [ { "message": { "role": "assistant", "content": null, "tool_calls": [ { "id": "call_abc123", "type": "function", "function": { "name": "get_weather", "arguments": "{\"location\": \"San Francisco\"}" } } ] }, "finish_reason": "tool_calls" } ]}Messages (Anthropic-Compatible)
POST /v1/messages
Generate a message response. Fully compatible with the Anthropic API format.
Request:
{ "model": "meta-llama/Llama-4-Maverick-17B-128E-Original", "max_tokens": 1024, "system": "You are a helpful assistant.", "messages": [ {"role": "user", "content": "What is 2 + 2?"} ], "temperature": 0.7, "stream": false, "tools": []}Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
model | string | No | Model identifier | |
messages | array | Yes | Array of message objects | |
max_tokens | integer | Yes | Maximum tokens to generate | |
system | string | No | System prompt | |
temperature | float | No | 0.7 | Sampling temperature |
top_p | float | No | 1.0 | Nucleus sampling threshold |
stream | boolean | No | false | Enable SSE streaming |
stop_sequences | array | No | [] | Stop sequences |
tools | array | No | [] | Tool definitions |
Response (non-streaming):
{ "id": "msg_abc123", "type": "message", "role": "assistant", "content": [ { "type": "text", "text": "2 + 2 equals 4." } ], "model": "meta-llama/Llama-4-Maverick-17B-128E-Original", "stop_reason": "end_turn", "usage": { "input_tokens": 25, "output_tokens": 8 }}Response (streaming):
SSE events follow the Anthropic streaming format:
event: message_startdata: {"type":"message_start","message":{"id":"msg_abc123","type":"message","role":"assistant","content":[],"model":"meta-llama/Llama-4-Maverick-17B-128E-Original","usage":{"input_tokens":25}}}
event: content_block_startdata: {"type":"content_block_start","index":0,"content_block":{"type":"text","text":""}}
event: content_block_deltadata: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"2 + 2"}}
event: content_block_deltadata: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":" equals 4."}}
event: content_block_stopdata: {"type":"content_block_stop","index":0}
event: message_deltadata: {"type":"message_delta","delta":{"stop_reason":"end_turn"},"usage":{"output_tokens":8}}
event: message_stopdata: {"type":"message_stop"}Tool use response:
{ "content": [ { "type": "tool_use", "id": "toolu_abc123", "name": "get_weather", "input": {"location": "San Francisco"} } ], "stop_reason": "tool_use"}Models
GET /v1/models
List available models.
curl http://localhost:8080/v1/modelsResponse:
{ "object": "list", "data": [ { "id": "Llama-4-Maverick-17B-128E-Original", "object": "model", "created": 0, "owned_by": "layerscale" } ]}The id is derived from the loaded model file name (without extension).
Sessions
POST /v1/sessions/init
Create a new session. LayerScale’s stateful inference engine maintains context across multiple requests without re-processing the prompt.
Request:
{ "type": "ohlcv", "prompt": "You are a financial analyst. Analyze the following market data as it arrives.", "flash": [ {"query": "Is the trend bullish or bearish?", "max_tokens": 4}, {"query": "What is the current support level?", "max_tokens": 16} ]}Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
type | string | Yes | Stream data type: ohlcv, iot, spatial, event, or vitals |
prompt | string | No | Initial system prompt to process and cache |
context | integer | No | Context window size in tokens (default: server default) |
window_size | integer | No | Maximum entries in the data buffer (auto-computed from context size and data type if omitted) |
flash | array | No | Flash queries to register (see Flash Queries) |
flash[].query | string | Yes | The question text |
flash[].max_tokens | integer | No | Maximum answer tokens (default: 32) |
Response:
{ "session_id": "sess_a1b2c3d4e5f6", "type": "ohlcv", "n_tokens": 42, "context": 8192, "window_size": 370, "flash_queries": 2, "pos_max": 41}GET /v1/sessions
List all active sessions.
curl http://localhost:8080/v1/sessionsResponse:
{ "sessions": [ { "session_id": "sess_a1b2c3d4e5f6", "n_tokens": 1024, "context": 8192, "prefix_end": 42, "cache_usage": 0.125, "created_at": 1711000000000, "last_used_at": 1711000060000, "in_use": false } ], "count": 1, "max_sessions": 10000}GET /v1/sessions/{session_id}/state
Get detailed state of a session including token position, context usage, and configuration.
curl http://localhost:8080/v1/sessions/sess_a1b2c3d4e5f6/stateResponse:
{ "session_id": "sess_a1b2c3d4e5f6", "type": "ohlcv", "n_tokens": 2048, "context": 8192, "pos_min": 0, "pos_max": 2047, "pos_next": 2048, "prefix_end": 42, "data_region": { "start": 42, "end": 2048, "window_size": 370 }, "data_count": 85, "cache_usage": 0.25, "created_at": 1711000000000, "last_used_at": 1711000060000, "in_use": false, "context_text": "..."}context_text is the tokens decoded back to text, intended for debugging. Expect it to be large on long-running sessions.
DELETE /v1/sessions/{session_id}
Delete a session and free its GPU memory.
curl -X DELETE http://localhost:8080/v1/sessions/sess_a1b2c3d4e5f6Response:
{ "success": true}POST /v1/sessions/{session_id}/append
Append text to the session context. The text is tokenized and processed into the session state.
Request:
{ "text": "Additional context to append to this session."}Response:
{ "n_tokens_added": 12, "total_tokens": 2112, "pos_max": 2111}POST /v1/sessions/{session_id}/generate
Generate text from the current session state. Supports streaming via SSE.
Request:
{ "prompt": "Based on the data so far, what is the trend?", "max_tokens": 128, "stream": true}Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
prompt | string | No | "" | Optional question to append before generating |
max_tokens | integer | No | 128 | Maximum tokens to generate |
stream | boolean | No | false | Stream response as SSE |
fast_answer | boolean | No | false | Attempt sub-10ms speculative exit for single-word answers when a cached ready-position is available |
clear_after | integer | No | If set, truncate the session at this position before generating (combines clear + generate in one call) | |
stop | array | No | [] | Stop sequences (strings) |
If the prompt matches a cached Flash Query, the cached answer is returned immediately with near-zero latency.
Response (non-streaming):
{ "text": "The trend is bullish with strong support at 185.50.", "n_tokens": 12, "total_tokens": 2113, "pos_max": 2112}Response (streaming):
Per-token events arrive as SSE chunks, followed by a final done event:
data: {"token": "The", "pos": 2101}data: {"token": " trend", "pos": 2102}data: {"token": " is", "pos": 2103}data: {"done": true, "text": "The trend is ...", "n_tokens": 12}If the answer comes from the Flash Query cache:
{ "text": "Bullish", "n_tokens": 1, "total_tokens": 2101, "pos_max": 2100, "flash": true, "flash_id": 1, "data_version": 15, "confidence": 0.92}If a speculative exit short-circuits generation (with fast_answer: true):
{ "text": "Bullish", "n_tokens": 1, "total_tokens": 2101, "pos_max": 2100, "speculative": true, "logit_gap": 3.7}Streaming Data
POST /v1/sessions/{session_id}/stream/push
Push data into a session’s streaming buffer. This endpoint is non-blocking. Data is queued and processed in the background. The data format is determined by the session’s type (set at creation). Only send the data array.
OHLCV (Financial Market Data):
{ "data": [ {"o": 185.50, "h": 186.20, "l": 185.10, "c": 185.80, "v": 12500} ]}| Field | Type | Description |
|---|---|---|
timestamp | integer | Unix timestamp (optional) |
o | float | Open price |
h | float | High price |
l | float | Low price |
c | float | Close price |
v | float | Volume |
IoT (Industrial Sensor Telemetry):
{ "data": [ {"sid": "temp-01", "val": 72.5, "lo": 60.0, "hi": 85.0} ]}| Field | Type | Description |
|---|---|---|
timestamp | integer | Unix timestamp (optional) |
sid | string | Sensor identifier |
val | float | Current reading |
lo | float | Low threshold or observed minimum |
hi | float | High threshold or observed maximum |
Spatial (Autonomous Systems):
{ "data": [ {"x": 37.7749, "y": -122.4194, "z": 150.0, "spd": 12.5, "hdg": 270.0} ]}| Field | Type | Description |
|---|---|---|
timestamp | integer | Unix timestamp (optional) |
x | float | X position or latitude |
y | float | Y position or longitude |
z | float | Z position or altitude |
spd | float | Speed |
hdg | float | Heading in degrees (0-360) |
Event (Security and Monitoring):
{ "data": [ {"src": "10.0.1.5", "sev": 3, "cat": "auth_failure", "cnt": 12} ]}| Field | Type | Description |
|---|---|---|
timestamp | integer | Unix timestamp (optional) |
src | string | Source identifier (IP, hostname, service) |
sev | integer | Severity level (0 = info, 1 = low, 2 = medium, 3 = high, 4 = critical) |
cat | string | Event category |
cnt | integer | Event count in window |
Vitals (Healthcare Monitoring):
{ "data": [ {"hr": 72.0, "bp_s": 120.0, "bp_d": 80.0, "spo2": 98.5, "temp": 36.8} ]}| Field | Type | Description |
|---|---|---|
timestamp | integer | Unix timestamp (optional) |
hr | float | Heart rate (bpm) |
bp_s | float | Systolic blood pressure (mmHg) |
bp_d | float | Diastolic blood pressure (mmHg) |
spo2 | float | Oxygen saturation (%) |
temp | float | Body temperature (Celsius) |
Response:
{ "pushed": 2, "dropped": 0, "queue_size": 42, "queue_capacity": 4096, "total_enqueued": 1500, "total_dropped": 0}| Field | Type | Description |
|---|---|---|
pushed | integer | Number of entries accepted |
dropped | integer | Entries dropped due to full buffer (overwritten) |
queue_size | integer | Current entries waiting to be processed |
queue_capacity | integer | Maximum ring buffer capacity |
total_enqueued | integer | Lifetime total entries enqueued |
total_dropped | integer | Lifetime total entries dropped |
GET /v1/sessions/{session_id}/stream/status
Get the current streaming status for a session.
curl http://localhost:8080/v1/sessions/sess_a1b2c3d4e5f6/stream/statusResponse:
{ "type": "ohlcv", "streaming": { "initialized": true, "running": true, "error": false, "queue_size": 42, "queue_capacity": 4096, "window_size": 370, "process_interval_us": 10000 }, "statistics": { "total_enqueued": 1500, "total_dropped": 0, "batches_processed": 30, "items_processed": 1500, "total_process_time_us": 450000, "avg_process_time_us": 15000 }, "data": { "initialized": true, "data_count": 85, "total_data_tokens": 3400, "pos_next": 3442 }}Flash Queries
POST /v1/sessions/{session_id}/flash
Register a new Flash Query on an existing session. A session may hold up to 20 concurrent Flash Queries.
Request:
{ "query": "Is volatility increasing?", "max_tokens": 8}max_tokens is clamped to the range [1, 256] and defaults to 32.
Response:
{ "id": 3, "query": "Is volatility increasing?", "max_tokens": 8, "n_tokens": 4, "total_flash_queries": 3}GET /v1/sessions/{session_id}/flash
List all registered Flash Queries for a session.
curl http://localhost:8080/v1/sessions/sess_a1b2c3d4e5f6/flashResponse:
{ "flash_queries": [ { "id": 1, "query": "Is the trend bullish or bearish?", "max_tokens": 4, "n_tokens": 9, "has_answer": true, "value": "Bullish", "data_version": 15, "confidence": 0.92, "evaluated_at": 1711000000000 }, { "id": 2, "query": "What is the current support level?", "max_tokens": 16, "n_tokens": 12, "has_answer": true, "value": "Support at 185.10", "data_version": 15, "confidence": 0.87, "evaluated_at": 1711000000000 } ], "count": 2, "current_data_version": 15}DELETE /v1/sessions/{session_id}/flash/{query_id}
Remove a Flash Query.
curl -X DELETE http://localhost:8080/v1/sessions/sess_a1b2c3d4e5f6/flash/1GET /v1/sessions/{session_id}/events
Subscribe to a Server-Sent Events stream for real-time Flash Query results.
curl -N http://localhost:8080/v1/sessions/sess_a1b2c3d4e5f6/eventsOn connect the server emits a single connected event, replays the current cached answer for every query that has one, and then streams flash_ready events as answers update.
Events:
event: connecteddata: {"data_version":15,"flash_queries":3}
event: flash_readydata: {"id":1,"query":"Is the trend bullish or bearish?","value":"Bullish","data_version":15,"confidence":0.92,"evaluated_at":1711000000000}
event: flash_readydata: {"id":2,"query":"What is the current support level?","value":"Support at 185.10","data_version":15,"confidence":0.87,"evaluated_at":1711000000000}WebSocket
WebSocket /v1/sessions/{session_id}/ws
Bidirectional WebSocket connection that combines data push and event delivery in a single connection.
Connect:
ws://localhost:8080/v1/sessions/sess_a1b2c3d4e5f6/wsInbound messages (client to server):
Push data:
{ "type": "push", "data": [ {"o": 185.50, "h": 186.20, "l": 185.10, "c": 185.80, "v": 12500} ]}Ping:
{ "type": "ping"}Outbound messages (server to client):
Connection established (followed immediately by flash_ready events for every query with a cached answer):
{ "type": "connected", "data": { "session_id": "sess_a1b2c3d4e5f6", "data_version": 15, "streaming": true, "flash_queries": 3 }}Flash Query result:
{ "type": "flash_ready", "data": { "id": 1, "query": "Is the trend bullish or bearish?", "value": "Bullish", "data_version": 15, "confidence": 0.92, "evaluated_at": 1711000000000 }}Error:
{ "type": "error", "data": { "message": "Session not found", "code": 404 }}Metrics (LayerScale Pro)
GET /metrics
Prometheus-format metrics for scraping. Covered only by LayerScale Pro.
curl http://localhost:8080/metricsReturns text/plain Prometheus exposition output. Request counters, latency histograms, context pool occupancy, Flash Query cache hit rates, and streaming queue depths are exported here. On a free license this endpoint returns 402 license_required.
Error Responses
Most endpoints return errors in the form:
{ "error": { "message": "Session not found", "code": 404 }}License-gated errors (free-tier limits, missing or invalid Pro key) use a string slug in code:
{ "error": { "code": "license_required", "message": "LayerScale allows 1 free session. Upgrade to LayerScale Pro for unlimited sessions. Visit https://layerscale.ai for details." }}The Anthropic-compatible endpoint POST /v1/messages returns Anthropic-shaped errors:
{ "type": "error", "error": { "type": "invalid_request_error", "message": "max_tokens is required" }}| HTTP Status | Meaning |
|---|---|
| 400 | Bad request (missing or invalid parameters) |
| 401 | Unauthorized (invalid or missing token) |
| 402 | License required (free-tier limit hit, e.g. second session or context > 32K) |
| 404 | Resource not found (invalid session ID) |
| 409 | Conflict (session is busy with another request) |
| 503 | Service unavailable (model not loaded, server starting) |