Skip to content

API Reference

Complete reference for all LayerScale API endpoints.

Base URL

http://localhost:8080

Authentication

Authentication is enforced when LayerScale runs with a Pro license key. The free tier accepts requests without authentication. The /health, /v1/health, and /metrics endpoints are always accessible without authentication.

When authentication is required, pass your license key as a bearer token:

Authorization: Bearer your-license-key

Or using the Anthropic-style header:

x-api-key: your-license-key

Health

GET /v1/health

Check if the server is ready to accept requests.

Terminal window
curl http://localhost:8080/v1/health

Response 200: Empty body. The model is loaded and the server is ready.

Response 503: Server is still loading the model.


Chat Completions (OpenAI-Compatible)

POST /v1/chat/completions

Generate a chat completion. Fully compatible with the OpenAI API format.

Request:

{
"model": "meta-llama/Llama-4-Maverick-17B-128E-Original",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is 2 + 2?"}
],
"max_tokens": 100,
"temperature": 0.7,
"top_p": 0.9,
"stream": false,
"stop": ["\n\n"],
"tools": []
}

Parameters:

ParameterTypeRequiredDefaultDescription
modelstringNoModel identifier (accepted but not used for routing)
messagesarrayYesArray of message objects with role and content
max_tokensintegerNo256Maximum tokens to generate
temperaturefloatNo0.7Sampling temperature (0.0 to 2.0)
top_pfloatNo1.0Nucleus sampling threshold
streambooleanNofalseEnable SSE streaming
stoparray | stringNo[]Stop sequences
toolsarrayNo[]Tool definitions for function calling

Response (non-streaming):

{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1711000000,
"model": "meta-llama/Llama-4-Maverick-17B-128E-Original",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "2 + 2 equals 4."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 25,
"completion_tokens": 8,
"total_tokens": 33
}
}

Response (streaming):

Each SSE event contains a chunk:

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"2"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":" +"},"finish_reason":null}]}
data: [DONE]

Tool calling response:

When the model invokes a tool, the response includes tool_calls instead of text content:

{
"choices": [
{
"message": {
"role": "assistant",
"content": null,
"tool_calls": [
{
"id": "call_abc123",
"type": "function",
"function": {
"name": "get_weather",
"arguments": "{\"location\": \"San Francisco\"}"
}
}
]
},
"finish_reason": "tool_calls"
}
]
}

Messages (Anthropic-Compatible)

POST /v1/messages

Generate a message response. Fully compatible with the Anthropic API format.

Request:

{
"model": "meta-llama/Llama-4-Maverick-17B-128E-Original",
"max_tokens": 1024,
"system": "You are a helpful assistant.",
"messages": [
{"role": "user", "content": "What is 2 + 2?"}
],
"temperature": 0.7,
"stream": false,
"tools": []
}

Parameters:

ParameterTypeRequiredDefaultDescription
modelstringNoModel identifier
messagesarrayYesArray of message objects
max_tokensintegerYesMaximum tokens to generate
systemstringNoSystem prompt
temperaturefloatNo0.7Sampling temperature
top_pfloatNo1.0Nucleus sampling threshold
streambooleanNofalseEnable SSE streaming
stop_sequencesarrayNo[]Stop sequences
toolsarrayNo[]Tool definitions

Response (non-streaming):

{
"id": "msg_abc123",
"type": "message",
"role": "assistant",
"content": [
{
"type": "text",
"text": "2 + 2 equals 4."
}
],
"model": "meta-llama/Llama-4-Maverick-17B-128E-Original",
"stop_reason": "end_turn",
"usage": {
"input_tokens": 25,
"output_tokens": 8
}
}

Response (streaming):

SSE events follow the Anthropic streaming format:

event: message_start
data: {"type":"message_start","message":{"id":"msg_abc123","type":"message","role":"assistant","content":[],"model":"meta-llama/Llama-4-Maverick-17B-128E-Original","usage":{"input_tokens":25}}}
event: content_block_start
data: {"type":"content_block_start","index":0,"content_block":{"type":"text","text":""}}
event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"2 + 2"}}
event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":" equals 4."}}
event: content_block_stop
data: {"type":"content_block_stop","index":0}
event: message_delta
data: {"type":"message_delta","delta":{"stop_reason":"end_turn"},"usage":{"output_tokens":8}}
event: message_stop
data: {"type":"message_stop"}

Tool use response:

{
"content": [
{
"type": "tool_use",
"id": "toolu_abc123",
"name": "get_weather",
"input": {"location": "San Francisco"}
}
],
"stop_reason": "tool_use"
}

Models

GET /v1/models

List available models.

Terminal window
curl http://localhost:8080/v1/models

Response:

{
"object": "list",
"data": [
{
"id": "Llama-4-Maverick-17B-128E-Original",
"object": "model",
"created": 0,
"owned_by": "layerscale"
}
]
}

The id is derived from the loaded model file name (without extension).


Sessions

POST /v1/sessions/init

Create a new session. LayerScale’s stateful inference engine maintains context across multiple requests without re-processing the prompt.

Request:

{
"type": "ohlcv",
"prompt": "You are a financial analyst. Analyze the following market data as it arrives.",
"flash": [
{"query": "Is the trend bullish or bearish?", "max_tokens": 4},
{"query": "What is the current support level?", "max_tokens": 16}
]
}

Parameters:

ParameterTypeRequiredDescription
typestringYesStream data type: ohlcv, iot, spatial, event, or vitals
promptstringNoInitial system prompt to process and cache
contextintegerNoContext window size in tokens (default: server default)
window_sizeintegerNoMaximum entries in the data buffer (auto-computed from context size and data type if omitted)
flasharrayNoFlash queries to register (see Flash Queries)
flash[].querystringYesThe question text
flash[].max_tokensintegerNoMaximum answer tokens (default: 32)

Response:

{
"session_id": "sess_a1b2c3d4e5f6",
"type": "ohlcv",
"n_tokens": 42,
"context": 8192,
"window_size": 370,
"flash_queries": 2,
"pos_max": 41
}

GET /v1/sessions

List all active sessions.

Terminal window
curl http://localhost:8080/v1/sessions

Response:

{
"sessions": [
{
"session_id": "sess_a1b2c3d4e5f6",
"n_tokens": 1024,
"context": 8192,
"prefix_end": 42,
"cache_usage": 0.125,
"created_at": 1711000000000,
"last_used_at": 1711000060000,
"in_use": false
}
],
"count": 1,
"max_sessions": 10000
}

GET /v1/sessions/{session_id}/state

Get detailed state of a session including token position, context usage, and configuration.

Terminal window
curl http://localhost:8080/v1/sessions/sess_a1b2c3d4e5f6/state

Response:

{
"session_id": "sess_a1b2c3d4e5f6",
"type": "ohlcv",
"n_tokens": 2048,
"context": 8192,
"pos_min": 0,
"pos_max": 2047,
"pos_next": 2048,
"prefix_end": 42,
"data_region": {
"start": 42,
"end": 2048,
"window_size": 370
},
"data_count": 85,
"cache_usage": 0.25,
"created_at": 1711000000000,
"last_used_at": 1711000060000,
"in_use": false,
"context_text": "..."
}

context_text is the tokens decoded back to text, intended for debugging. Expect it to be large on long-running sessions.

DELETE /v1/sessions/{session_id}

Delete a session and free its GPU memory.

Terminal window
curl -X DELETE http://localhost:8080/v1/sessions/sess_a1b2c3d4e5f6

Response:

{
"success": true
}

POST /v1/sessions/{session_id}/append

Append text to the session context. The text is tokenized and processed into the session state.

Request:

{
"text": "Additional context to append to this session."
}

Response:

{
"n_tokens_added": 12,
"total_tokens": 2112,
"pos_max": 2111
}

POST /v1/sessions/{session_id}/generate

Generate text from the current session state. Supports streaming via SSE.

Request:

{
"prompt": "Based on the data so far, what is the trend?",
"max_tokens": 128,
"stream": true
}

Parameters:

ParameterTypeRequiredDefaultDescription
promptstringNo""Optional question to append before generating
max_tokensintegerNo128Maximum tokens to generate
streambooleanNofalseStream response as SSE
fast_answerbooleanNofalseAttempt sub-10ms speculative exit for single-word answers when a cached ready-position is available
clear_afterintegerNoIf set, truncate the session at this position before generating (combines clear + generate in one call)
stoparrayNo[]Stop sequences (strings)

If the prompt matches a cached Flash Query, the cached answer is returned immediately with near-zero latency.

Response (non-streaming):

{
"text": "The trend is bullish with strong support at 185.50.",
"n_tokens": 12,
"total_tokens": 2113,
"pos_max": 2112
}

Response (streaming):

Per-token events arrive as SSE chunks, followed by a final done event:

data: {"token": "The", "pos": 2101}
data: {"token": " trend", "pos": 2102}
data: {"token": " is", "pos": 2103}
data: {"done": true, "text": "The trend is ...", "n_tokens": 12}

If the answer comes from the Flash Query cache:

{
"text": "Bullish",
"n_tokens": 1,
"total_tokens": 2101,
"pos_max": 2100,
"flash": true,
"flash_id": 1,
"data_version": 15,
"confidence": 0.92
}

If a speculative exit short-circuits generation (with fast_answer: true):

{
"text": "Bullish",
"n_tokens": 1,
"total_tokens": 2101,
"pos_max": 2100,
"speculative": true,
"logit_gap": 3.7
}

Streaming Data

POST /v1/sessions/{session_id}/stream/push

Push data into a session’s streaming buffer. This endpoint is non-blocking. Data is queued and processed in the background. The data format is determined by the session’s type (set at creation). Only send the data array.

OHLCV (Financial Market Data):

{
"data": [
{"o": 185.50, "h": 186.20, "l": 185.10, "c": 185.80, "v": 12500}
]
}
FieldTypeDescription
timestampintegerUnix timestamp (optional)
ofloatOpen price
hfloatHigh price
lfloatLow price
cfloatClose price
vfloatVolume

IoT (Industrial Sensor Telemetry):

{
"data": [
{"sid": "temp-01", "val": 72.5, "lo": 60.0, "hi": 85.0}
]
}
FieldTypeDescription
timestampintegerUnix timestamp (optional)
sidstringSensor identifier
valfloatCurrent reading
lofloatLow threshold or observed minimum
hifloatHigh threshold or observed maximum

Spatial (Autonomous Systems):

{
"data": [
{"x": 37.7749, "y": -122.4194, "z": 150.0, "spd": 12.5, "hdg": 270.0}
]
}
FieldTypeDescription
timestampintegerUnix timestamp (optional)
xfloatX position or latitude
yfloatY position or longitude
zfloatZ position or altitude
spdfloatSpeed
hdgfloatHeading in degrees (0-360)

Event (Security and Monitoring):

{
"data": [
{"src": "10.0.1.5", "sev": 3, "cat": "auth_failure", "cnt": 12}
]
}
FieldTypeDescription
timestampintegerUnix timestamp (optional)
srcstringSource identifier (IP, hostname, service)
sevintegerSeverity level (0 = info, 1 = low, 2 = medium, 3 = high, 4 = critical)
catstringEvent category
cntintegerEvent count in window

Vitals (Healthcare Monitoring):

{
"data": [
{"hr": 72.0, "bp_s": 120.0, "bp_d": 80.0, "spo2": 98.5, "temp": 36.8}
]
}
FieldTypeDescription
timestampintegerUnix timestamp (optional)
hrfloatHeart rate (bpm)
bp_sfloatSystolic blood pressure (mmHg)
bp_dfloatDiastolic blood pressure (mmHg)
spo2floatOxygen saturation (%)
tempfloatBody temperature (Celsius)

Response:

{
"pushed": 2,
"dropped": 0,
"queue_size": 42,
"queue_capacity": 4096,
"total_enqueued": 1500,
"total_dropped": 0
}
FieldTypeDescription
pushedintegerNumber of entries accepted
droppedintegerEntries dropped due to full buffer (overwritten)
queue_sizeintegerCurrent entries waiting to be processed
queue_capacityintegerMaximum ring buffer capacity
total_enqueuedintegerLifetime total entries enqueued
total_droppedintegerLifetime total entries dropped

GET /v1/sessions/{session_id}/stream/status

Get the current streaming status for a session.

Terminal window
curl http://localhost:8080/v1/sessions/sess_a1b2c3d4e5f6/stream/status

Response:

{
"type": "ohlcv",
"streaming": {
"initialized": true,
"running": true,
"error": false,
"queue_size": 42,
"queue_capacity": 4096,
"window_size": 370,
"process_interval_us": 10000
},
"statistics": {
"total_enqueued": 1500,
"total_dropped": 0,
"batches_processed": 30,
"items_processed": 1500,
"total_process_time_us": 450000,
"avg_process_time_us": 15000
},
"data": {
"initialized": true,
"data_count": 85,
"total_data_tokens": 3400,
"pos_next": 3442
}
}

Flash Queries

POST /v1/sessions/{session_id}/flash

Register a new Flash Query on an existing session. A session may hold up to 20 concurrent Flash Queries.

Request:

{
"query": "Is volatility increasing?",
"max_tokens": 8
}

max_tokens is clamped to the range [1, 256] and defaults to 32.

Response:

{
"id": 3,
"query": "Is volatility increasing?",
"max_tokens": 8,
"n_tokens": 4,
"total_flash_queries": 3
}

GET /v1/sessions/{session_id}/flash

List all registered Flash Queries for a session.

Terminal window
curl http://localhost:8080/v1/sessions/sess_a1b2c3d4e5f6/flash

Response:

{
"flash_queries": [
{
"id": 1,
"query": "Is the trend bullish or bearish?",
"max_tokens": 4,
"n_tokens": 9,
"has_answer": true,
"value": "Bullish",
"data_version": 15,
"confidence": 0.92,
"evaluated_at": 1711000000000
},
{
"id": 2,
"query": "What is the current support level?",
"max_tokens": 16,
"n_tokens": 12,
"has_answer": true,
"value": "Support at 185.10",
"data_version": 15,
"confidence": 0.87,
"evaluated_at": 1711000000000
}
],
"count": 2,
"current_data_version": 15
}

DELETE /v1/sessions/{session_id}/flash/{query_id}

Remove a Flash Query.

Terminal window
curl -X DELETE http://localhost:8080/v1/sessions/sess_a1b2c3d4e5f6/flash/1

GET /v1/sessions/{session_id}/events

Subscribe to a Server-Sent Events stream for real-time Flash Query results.

Terminal window
curl -N http://localhost:8080/v1/sessions/sess_a1b2c3d4e5f6/events

On connect the server emits a single connected event, replays the current cached answer for every query that has one, and then streams flash_ready events as answers update.

Events:

event: connected
data: {"data_version":15,"flash_queries":3}
event: flash_ready
data: {"id":1,"query":"Is the trend bullish or bearish?","value":"Bullish","data_version":15,"confidence":0.92,"evaluated_at":1711000000000}
event: flash_ready
data: {"id":2,"query":"What is the current support level?","value":"Support at 185.10","data_version":15,"confidence":0.87,"evaluated_at":1711000000000}

WebSocket

WebSocket /v1/sessions/{session_id}/ws

Bidirectional WebSocket connection that combines data push and event delivery in a single connection.

Connect:

ws://localhost:8080/v1/sessions/sess_a1b2c3d4e5f6/ws

Inbound messages (client to server):

Push data:

{
"type": "push",
"data": [
{"o": 185.50, "h": 186.20, "l": 185.10, "c": 185.80, "v": 12500}
]
}

Ping:

{
"type": "ping"
}

Outbound messages (server to client):

Connection established (followed immediately by flash_ready events for every query with a cached answer):

{
"type": "connected",
"data": {
"session_id": "sess_a1b2c3d4e5f6",
"data_version": 15,
"streaming": true,
"flash_queries": 3
}
}

Flash Query result:

{
"type": "flash_ready",
"data": {
"id": 1,
"query": "Is the trend bullish or bearish?",
"value": "Bullish",
"data_version": 15,
"confidence": 0.92,
"evaluated_at": 1711000000000
}
}

Error:

{
"type": "error",
"data": {
"message": "Session not found",
"code": 404
}
}

Metrics (LayerScale Pro)

GET /metrics

Prometheus-format metrics for scraping. Covered only by LayerScale Pro.

Terminal window
curl http://localhost:8080/metrics

Returns text/plain Prometheus exposition output. Request counters, latency histograms, context pool occupancy, Flash Query cache hit rates, and streaming queue depths are exported here. On a free license this endpoint returns 402 license_required.


Error Responses

Most endpoints return errors in the form:

{
"error": {
"message": "Session not found",
"code": 404
}
}

License-gated errors (free-tier limits, missing or invalid Pro key) use a string slug in code:

{
"error": {
"code": "license_required",
"message": "LayerScale allows 1 free session. Upgrade to LayerScale Pro for unlimited sessions. Visit https://layerscale.ai for details."
}
}

The Anthropic-compatible endpoint POST /v1/messages returns Anthropic-shaped errors:

{
"type": "error",
"error": {
"type": "invalid_request_error",
"message": "max_tokens is required"
}
}
HTTP StatusMeaning
400Bad request (missing or invalid parameters)
401Unauthorized (invalid or missing token)
402License required (free-tier limit hit, e.g. second session or context > 32K)
404Resource not found (invalid session ID)
409Conflict (session is busy with another request)
503Service unavailable (model not loaded, server starting)