API Reference
Complete reference for all LayerScale API endpoints.
Base URL
http://localhost:8080Authentication
All endpoints require authentication. Pass your API key as a bearer token:
Authorization: Bearer your-api-keyOr using the Anthropic-style header:
x-api-key: your-api-keyHealth
GET /v1/health
Check if the server is ready to accept requests.
curl http://localhost:8080/v1/healthResponse 200: Empty body. The model is loaded and the server is ready.
Response 503: Server is still loading the model.
Chat Completions (OpenAI-Compatible)
POST /v1/chat/completions
Generate a chat completion. Fully compatible with the OpenAI API format.
Request:
{ "model": "meta-llama/Llama-4-Maverick-17B-128E-Original", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is 2 + 2?"} ], "max_tokens": 100, "temperature": 0.7, "top_p": 0.9, "stream": false, "stop": ["\n\n"], "tools": []}Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
model | string | No | Model identifier (accepted but not used for routing) | |
messages | array | Yes | Array of message objects with role and content | |
max_tokens | integer | No | 512 | Maximum tokens to generate |
temperature | float | No | 0.7 | Sampling temperature (0.0 to 2.0) |
top_p | float | No | 0.9 | Nucleus sampling threshold |
stream | boolean | No | false | Enable SSE streaming |
stop | array | No | [] | Stop sequences |
tools | array | No | [] | Tool definitions for function calling |
Response (non-streaming):
{ "id": "chatcmpl-abc123", "object": "chat.completion", "created": 1711000000, "model": "meta-llama/Llama-4-Maverick-17B-128E-Original", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "2 + 2 equals 4." }, "finish_reason": "stop" } ], "usage": { "prompt_tokens": 25, "completion_tokens": 8, "total_tokens": 33 }}Response (streaming):
Each SSE event contains a chunk:
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"2"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":" +"},"finish_reason":null}]}
data: [DONE]Tool calling response:
When the model invokes a tool, the response includes tool_calls instead of text content:
{ "choices": [ { "message": { "role": "assistant", "content": null, "tool_calls": [ { "id": "call_abc123", "type": "function", "function": { "name": "get_weather", "arguments": "{\"location\": \"San Francisco\"}" } } ] }, "finish_reason": "tool_calls" } ]}Messages (Anthropic-Compatible)
POST /v1/messages
Generate a message response. Fully compatible with the Anthropic API format.
Request:
{ "model": "meta-llama/Llama-4-Maverick-17B-128E-Original", "max_tokens": 1024, "system": "You are a helpful assistant.", "messages": [ {"role": "user", "content": "What is 2 + 2?"} ], "temperature": 0.7, "stream": false, "tools": []}Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
model | string | No | Model identifier | |
messages | array | Yes | Array of message objects | |
max_tokens | integer | Yes | Maximum tokens to generate | |
system | string | No | System prompt | |
temperature | float | No | 0.7 | Sampling temperature |
top_p | float | No | 0.9 | Nucleus sampling threshold |
stream | boolean | No | false | Enable SSE streaming |
stop_sequences | array | No | [] | Stop sequences |
tools | array | No | [] | Tool definitions |
Response (non-streaming):
{ "id": "msg_abc123", "type": "message", "role": "assistant", "content": [ { "type": "text", "text": "2 + 2 equals 4." } ], "model": "meta-llama/Llama-4-Maverick-17B-128E-Original", "stop_reason": "end_turn", "usage": { "input_tokens": 25, "output_tokens": 8 }}Response (streaming):
SSE events follow the Anthropic streaming format:
event: message_startdata: {"type":"message_start","message":{"id":"msg_abc123","type":"message","role":"assistant","content":[],"model":"meta-llama/Llama-4-Maverick-17B-128E-Original","usage":{"input_tokens":25}}}
event: content_block_startdata: {"type":"content_block_start","index":0,"content_block":{"type":"text","text":""}}
event: content_block_deltadata: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"2 + 2"}}
event: content_block_deltadata: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":" equals 4."}}
event: content_block_stopdata: {"type":"content_block_stop","index":0}
event: message_deltadata: {"type":"message_delta","delta":{"stop_reason":"end_turn"},"usage":{"output_tokens":8}}
event: message_stopdata: {"type":"message_stop"}Tool use response:
{ "content": [ { "type": "tool_use", "id": "toolu_abc123", "name": "get_weather", "input": {"location": "San Francisco"} } ], "stop_reason": "tool_use"}Models
GET /v1/models
List available models.
curl http://localhost:8080/v1/modelsResponse:
{ "object": "list", "data": [ { "id": "Llama-4-Maverick-17B-128E-Original", "object": "model", "owned_by": "meta-llama" } ]}Sessions
POST /v1/sessions/init
Create a new session. LayerScale’s stateful inference engine maintains context across multiple requests without re-processing the prompt.
Request:
{ "type": "ohlcv", "prompt": "You are a financial analyst. Analyze the following market data as it arrives.", "flash": [ {"query": "Is the trend bullish or bearish?", "max_tokens": 4}, {"query": "What is the current support level?", "max_tokens": 16} ]}Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
type | string | Yes | Stream data type: ohlcv, iot, spatial, event, or vitals |
prompt | string | No | Initial system prompt to process and cache |
context_size | integer | No | Context window size in tokens (default: server default) |
window_size | integer | No | Maximum entries in the data buffer (auto-computed from context size and data type if omitted) |
flash | array | No | Flash queries to register (see Flash Queries) |
flash[].query | string | Yes | The question text |
flash[].max_tokens | integer | No | Maximum answer tokens (default: 32) |
Response:
{ "session_id": "sess_a1b2c3d4e5f6", "type": "ohlcv", "n_tokens": 42, "context_size": 8192, "window_size": 370, "flash_queries": 2, "pos_max": 41}GET /v1/sessions
List all active sessions.
curl http://localhost:8080/v1/sessionsResponse:
{ "sessions": [ { "session_id": "sess_a1b2c3d4e5f6", "n_tokens": 1024, "context_size": 8192, "cache_usage": 0.125, "created_at": 1711000000000, "last_used_at": 1711000060000, "in_use": false } ], "count": 1, "max_sessions": 10000}GET /v1/sessions/{session_id}/state
Get detailed state of a session including token position, context usage, and configuration.
curl http://localhost:8080/v1/sessions/sess_a1b2c3d4e5f6/stateResponse:
{ "session_id": "sess_a1b2c3d4e5f6", "type": "ohlcv", "n_tokens": 2048, "context_size": 8192, "pos_min": 0, "pos_max": 2047, "pos_next": 2048, "prefix_end": 42, "data_region": { "start": 42, "end": 2048, "window_size": 370 }, "data_count": 85, "cache_usage": 0.25, "created_at": 1711000000000, "last_used_at": 1711000060000, "in_use": false}DELETE /v1/sessions/{session_id}
Delete a session and free its GPU memory.
curl -X DELETE http://localhost:8080/v1/sessions/sess_a1b2c3d4e5f6Response:
{ "status": "deleted"}POST /v1/sessions/{session_id}/append
Append text to the session context. The text is tokenized and processed into the session state.
Request:
{ "text": "Additional context to append to this session."}Response:
{ "status": "ok", "pos": 2100}POST /v1/sessions/{session_id}/generate
Generate text from the current session state. Supports streaming via SSE.
Request:
{ "prompt": "Based on the data so far, what is the trend?", "max_tokens": 256, "temperature": 0.3, "stream": true}Parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
prompt | string | No | Optional question to append before generating | |
max_tokens | integer | No | 256 | Maximum tokens to generate |
temperature | float | No | 0.7 | Sampling temperature |
stream | boolean | No | false | Stream response as SSE |
If the prompt matches a cached Flash Query, the cached answer is returned immediately with near-zero latency.
Response (non-streaming):
{ "text": "The trend is bullish with strong support at 185.50.", "tokens_generated": 12, "from_cache": false}Response (streaming):
data: {"token": "The", "pos": 2101}data: {"token": " trend", "pos": 2102}data: {"token": " is", "pos": 2103}data: [DONE]If the answer comes from the Flash Query cache:
{ "text": "Bullish", "tokens_generated": 1, "from_cache": true, "confidence": 0.92}Streaming Data
POST /v1/sessions/{session_id}/stream/push
Push data into a session’s streaming buffer. This endpoint is non-blocking. Data is queued and processed in the background. The data format is determined by the session’s type (set at creation). Only send the data array.
OHLCV (Financial Market Data):
{ "data": [ {"o": 185.50, "h": 186.20, "l": 185.10, "c": 185.80, "v": 12500} ]}| Field | Type | Description |
|---|---|---|
timestamp | integer | Unix timestamp (optional) |
o | float | Open price |
h | float | High price |
l | float | Low price |
c | float | Close price |
v | float | Volume |
IoT (Industrial Sensor Telemetry):
{ "data": [ {"sid": "temp-01", "val": 72.5, "lo": 60.0, "hi": 85.0} ]}| Field | Type | Description |
|---|---|---|
timestamp | integer | Unix timestamp (optional) |
sid | string | Sensor identifier |
val | float | Current reading |
lo | float | Low threshold or observed minimum |
hi | float | High threshold or observed maximum |
Spatial (Autonomous Systems):
{ "data": [ {"x": 37.7749, "y": -122.4194, "z": 150.0, "spd": 12.5, "hdg": 270.0} ]}| Field | Type | Description |
|---|---|---|
timestamp | integer | Unix timestamp (optional) |
x | float | X position or latitude |
y | float | Y position or longitude |
z | float | Z position or altitude |
spd | float | Speed |
hdg | float | Heading in degrees (0-360) |
Event (Security and Monitoring):
{ "data": [ {"src": "10.0.1.5", "sev": 3, "cat": "auth_failure", "cnt": 12} ]}| Field | Type | Description |
|---|---|---|
timestamp | integer | Unix timestamp (optional) |
src | string | Source identifier (IP, hostname, service) |
sev | integer | Severity level (0 = info, 1 = low, 2 = medium, 3 = high, 4 = critical) |
cat | string | Event category |
cnt | integer | Event count in window |
Vitals (Healthcare Monitoring):
{ "data": [ {"hr": 72.0, "bp_s": 120.0, "bp_d": 80.0, "spo2": 98.5, "temp": 36.8} ]}| Field | Type | Description |
|---|---|---|
timestamp | integer | Unix timestamp (optional) |
hr | float | Heart rate (bpm) |
bp_s | float | Systolic blood pressure (mmHg) |
bp_d | float | Diastolic blood pressure (mmHg) |
spo2 | float | Oxygen saturation (%) |
temp | float | Body temperature (Celsius) |
Response:
{ "pushed": 2, "dropped": 0, "queue_size": 42, "queue_capacity": 4096, "total_enqueued": 1500, "total_dropped": 0}| Field | Type | Description |
|---|---|---|
pushed | integer | Number of entries accepted |
dropped | integer | Entries dropped due to full buffer (overwritten) |
queue_size | integer | Current entries waiting to be processed |
queue_capacity | integer | Maximum ring buffer capacity |
total_enqueued | integer | Lifetime total entries enqueued |
total_dropped | integer | Lifetime total entries dropped |
GET /v1/sessions/{session_id}/stream/status
Get the current streaming status for a session.
curl http://localhost:8080/v1/sessions/sess_a1b2c3d4e5f6/stream/statusResponse:
{ "type": "ohlcv", "streaming": { "initialized": true, "running": true, "error": false, "queue_size": 42, "queue_capacity": 4096, "window_size": 370, "process_interval_us": 10000 }, "statistics": { "total_enqueued": 1500, "total_dropped": 0, "batches_processed": 30, "items_processed": 1500, "total_process_time_us": 450000, "avg_process_time_us": 15000 }, "data": { "initialized": true, "data_count": 85, "total_data_tokens": 3400, "pos_next": 3442 }}Flash Queries
POST /v1/sessions/{session_id}/flash
Register a new Flash Query on an existing session.
Request:
{ "query": "Is volatility increasing?", "max_tokens": 8}Response:
{ "id": 3, "query": "Is volatility increasing?", "status": "registered"}GET /v1/sessions/{session_id}/flash
List all registered Flash Queries for a session.
curl http://localhost:8080/v1/sessions/sess_a1b2c3d4e5f6/flashResponse:
{ "queries": [ { "id": 1, "query": "Is the trend bullish or bearish?", "max_tokens": 4, "value": "Bullish", "confidence": 0.92, "last_evaluated_ms": 1711000000000 }, { "id": 2, "query": "What is the current support level?", "max_tokens": 16, "value": "Support at 185.10", "confidence": 0.87, "last_evaluated_ms": 1711000000000 } ]}DELETE /v1/sessions/{session_id}/flash/{query_id}
Remove a Flash Query.
curl -X DELETE http://localhost:8080/v1/sessions/sess_a1b2c3d4e5f6/flash/1GET /v1/sessions/{session_id}/events
Subscribe to a Server-Sent Events stream for real-time Flash Query results.
curl -N http://localhost:8080/v1/sessions/sess_a1b2c3d4e5f6/eventsEvents:
event: flash_readydata: {"id":1,"query":"Is the trend bullish or bearish?","value":"Bullish","confidence":0.92,"data_version":15}
event: flash_readydata: {"id":2,"query":"What is the current support level?","value":"Support at 185.10","confidence":0.87,"data_version":15}WebSocket
WebSocket /v1/sessions/{session_id}/ws
Bidirectional WebSocket connection that combines data push and event delivery in a single connection.
Connect:
ws://localhost:8080/v1/sessions/sess_a1b2c3d4e5f6/wsInbound messages (client to server):
Push data:
{ "type": "push", "data": [ {"o": 185.50, "h": 186.20, "l": 185.10, "c": 185.80, "v": 12500} ]}Ping:
{ "type": "ping"}Outbound messages (server to client):
Connection established:
{ "type": "connected", "data": { "session_id": "sess_a1b2c3d4e5f6", "streaming": true }}Flash Query result:
{ "type": "flash_ready", "data": { "id": 1, "query": "Is the trend bullish or bearish?", "value": "Bullish", "confidence": 0.92, "data_version": 15 }}Error:
{ "type": "error", "data": { "message": "Session not found", "code": 404 }}Error Responses
All endpoints return errors in a consistent format:
{ "error": { "message": "Session not found", "type": "not_found", "code": 404 }}| HTTP Status | Meaning |
|---|---|
| 400 | Bad request (missing or invalid parameters) |
| 401 | Unauthorized (invalid or missing token) |
| 404 | Resource not found (invalid session ID) |
| 409 | Conflict (session is busy with another request) |
| 503 | Service unavailable (model not loaded, server starting) |