Skip to content

API Reference

Complete reference for all LayerScale API endpoints.

Base URL

http://localhost:8080

Authentication

All endpoints require authentication. Pass your API key as a bearer token:

Authorization: Bearer your-api-key

Or using the Anthropic-style header:

x-api-key: your-api-key

Health

GET /v1/health

Check if the server is ready to accept requests.

Terminal window
curl http://localhost:8080/v1/health

Response 200: Empty body. The model is loaded and the server is ready.

Response 503: Server is still loading the model.


Chat Completions (OpenAI-Compatible)

POST /v1/chat/completions

Generate a chat completion. Fully compatible with the OpenAI API format.

Request:

{
"model": "meta-llama/Llama-4-Maverick-17B-128E-Original",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is 2 + 2?"}
],
"max_tokens": 100,
"temperature": 0.7,
"top_p": 0.9,
"stream": false,
"stop": ["\n\n"],
"tools": []
}

Parameters:

ParameterTypeRequiredDefaultDescription
modelstringNoModel identifier (accepted but not used for routing)
messagesarrayYesArray of message objects with role and content
max_tokensintegerNo512Maximum tokens to generate
temperaturefloatNo0.7Sampling temperature (0.0 to 2.0)
top_pfloatNo0.9Nucleus sampling threshold
streambooleanNofalseEnable SSE streaming
stoparrayNo[]Stop sequences
toolsarrayNo[]Tool definitions for function calling

Response (non-streaming):

{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1711000000,
"model": "meta-llama/Llama-4-Maverick-17B-128E-Original",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "2 + 2 equals 4."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 25,
"completion_tokens": 8,
"total_tokens": 33
}
}

Response (streaming):

Each SSE event contains a chunk:

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"2"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":" +"},"finish_reason":null}]}
data: [DONE]

Tool calling response:

When the model invokes a tool, the response includes tool_calls instead of text content:

{
"choices": [
{
"message": {
"role": "assistant",
"content": null,
"tool_calls": [
{
"id": "call_abc123",
"type": "function",
"function": {
"name": "get_weather",
"arguments": "{\"location\": \"San Francisco\"}"
}
}
]
},
"finish_reason": "tool_calls"
}
]
}

Messages (Anthropic-Compatible)

POST /v1/messages

Generate a message response. Fully compatible with the Anthropic API format.

Request:

{
"model": "meta-llama/Llama-4-Maverick-17B-128E-Original",
"max_tokens": 1024,
"system": "You are a helpful assistant.",
"messages": [
{"role": "user", "content": "What is 2 + 2?"}
],
"temperature": 0.7,
"stream": false,
"tools": []
}

Parameters:

ParameterTypeRequiredDefaultDescription
modelstringNoModel identifier
messagesarrayYesArray of message objects
max_tokensintegerYesMaximum tokens to generate
systemstringNoSystem prompt
temperaturefloatNo0.7Sampling temperature
top_pfloatNo0.9Nucleus sampling threshold
streambooleanNofalseEnable SSE streaming
stop_sequencesarrayNo[]Stop sequences
toolsarrayNo[]Tool definitions

Response (non-streaming):

{
"id": "msg_abc123",
"type": "message",
"role": "assistant",
"content": [
{
"type": "text",
"text": "2 + 2 equals 4."
}
],
"model": "meta-llama/Llama-4-Maverick-17B-128E-Original",
"stop_reason": "end_turn",
"usage": {
"input_tokens": 25,
"output_tokens": 8
}
}

Response (streaming):

SSE events follow the Anthropic streaming format:

event: message_start
data: {"type":"message_start","message":{"id":"msg_abc123","type":"message","role":"assistant","content":[],"model":"meta-llama/Llama-4-Maverick-17B-128E-Original","usage":{"input_tokens":25}}}
event: content_block_start
data: {"type":"content_block_start","index":0,"content_block":{"type":"text","text":""}}
event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"2 + 2"}}
event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":" equals 4."}}
event: content_block_stop
data: {"type":"content_block_stop","index":0}
event: message_delta
data: {"type":"message_delta","delta":{"stop_reason":"end_turn"},"usage":{"output_tokens":8}}
event: message_stop
data: {"type":"message_stop"}

Tool use response:

{
"content": [
{
"type": "tool_use",
"id": "toolu_abc123",
"name": "get_weather",
"input": {"location": "San Francisco"}
}
],
"stop_reason": "tool_use"
}

Models

GET /v1/models

List available models.

Terminal window
curl http://localhost:8080/v1/models

Response:

{
"object": "list",
"data": [
{
"id": "Llama-4-Maverick-17B-128E-Original",
"object": "model",
"owned_by": "meta-llama"
}
]
}

Sessions

POST /v1/sessions/init

Create a new session. LayerScale’s stateful inference engine maintains context across multiple requests without re-processing the prompt.

Request:

{
"type": "ohlcv",
"prompt": "You are a financial analyst. Analyze the following market data as it arrives.",
"flash": [
{"query": "Is the trend bullish or bearish?", "max_tokens": 4},
{"query": "What is the current support level?", "max_tokens": 16}
]
}

Parameters:

ParameterTypeRequiredDescription
typestringYesStream data type: ohlcv, iot, spatial, event, or vitals
promptstringNoInitial system prompt to process and cache
context_sizeintegerNoContext window size in tokens (default: server default)
window_sizeintegerNoMaximum entries in the data buffer (auto-computed from context size and data type if omitted)
flasharrayNoFlash queries to register (see Flash Queries)
flash[].querystringYesThe question text
flash[].max_tokensintegerNoMaximum answer tokens (default: 32)

Response:

{
"session_id": "sess_a1b2c3d4e5f6",
"type": "ohlcv",
"n_tokens": 42,
"context_size": 8192,
"window_size": 370,
"flash_queries": 2,
"pos_max": 41
}

GET /v1/sessions

List all active sessions.

Terminal window
curl http://localhost:8080/v1/sessions

Response:

{
"sessions": [
{
"session_id": "sess_a1b2c3d4e5f6",
"n_tokens": 1024,
"context_size": 8192,
"cache_usage": 0.125,
"created_at": 1711000000000,
"last_used_at": 1711000060000,
"in_use": false
}
],
"count": 1,
"max_sessions": 10000
}

GET /v1/sessions/{session_id}/state

Get detailed state of a session including token position, context usage, and configuration.

Terminal window
curl http://localhost:8080/v1/sessions/sess_a1b2c3d4e5f6/state

Response:

{
"session_id": "sess_a1b2c3d4e5f6",
"type": "ohlcv",
"n_tokens": 2048,
"context_size": 8192,
"pos_min": 0,
"pos_max": 2047,
"pos_next": 2048,
"prefix_end": 42,
"data_region": {
"start": 42,
"end": 2048,
"window_size": 370
},
"data_count": 85,
"cache_usage": 0.25,
"created_at": 1711000000000,
"last_used_at": 1711000060000,
"in_use": false
}

DELETE /v1/sessions/{session_id}

Delete a session and free its GPU memory.

Terminal window
curl -X DELETE http://localhost:8080/v1/sessions/sess_a1b2c3d4e5f6

Response:

{
"status": "deleted"
}

POST /v1/sessions/{session_id}/append

Append text to the session context. The text is tokenized and processed into the session state.

Request:

{
"text": "Additional context to append to this session."
}

Response:

{
"status": "ok",
"pos": 2100
}

POST /v1/sessions/{session_id}/generate

Generate text from the current session state. Supports streaming via SSE.

Request:

{
"prompt": "Based on the data so far, what is the trend?",
"max_tokens": 256,
"temperature": 0.3,
"stream": true
}

Parameters:

ParameterTypeRequiredDefaultDescription
promptstringNoOptional question to append before generating
max_tokensintegerNo256Maximum tokens to generate
temperaturefloatNo0.7Sampling temperature
streambooleanNofalseStream response as SSE

If the prompt matches a cached Flash Query, the cached answer is returned immediately with near-zero latency.

Response (non-streaming):

{
"text": "The trend is bullish with strong support at 185.50.",
"tokens_generated": 12,
"from_cache": false
}

Response (streaming):

data: {"token": "The", "pos": 2101}
data: {"token": " trend", "pos": 2102}
data: {"token": " is", "pos": 2103}
data: [DONE]

If the answer comes from the Flash Query cache:

{
"text": "Bullish",
"tokens_generated": 1,
"from_cache": true,
"confidence": 0.92
}

Streaming Data

POST /v1/sessions/{session_id}/stream/push

Push data into a session’s streaming buffer. This endpoint is non-blocking. Data is queued and processed in the background. The data format is determined by the session’s type (set at creation). Only send the data array.

OHLCV (Financial Market Data):

{
"data": [
{"o": 185.50, "h": 186.20, "l": 185.10, "c": 185.80, "v": 12500}
]
}
FieldTypeDescription
timestampintegerUnix timestamp (optional)
ofloatOpen price
hfloatHigh price
lfloatLow price
cfloatClose price
vfloatVolume

IoT (Industrial Sensor Telemetry):

{
"data": [
{"sid": "temp-01", "val": 72.5, "lo": 60.0, "hi": 85.0}
]
}
FieldTypeDescription
timestampintegerUnix timestamp (optional)
sidstringSensor identifier
valfloatCurrent reading
lofloatLow threshold or observed minimum
hifloatHigh threshold or observed maximum

Spatial (Autonomous Systems):

{
"data": [
{"x": 37.7749, "y": -122.4194, "z": 150.0, "spd": 12.5, "hdg": 270.0}
]
}
FieldTypeDescription
timestampintegerUnix timestamp (optional)
xfloatX position or latitude
yfloatY position or longitude
zfloatZ position or altitude
spdfloatSpeed
hdgfloatHeading in degrees (0-360)

Event (Security and Monitoring):

{
"data": [
{"src": "10.0.1.5", "sev": 3, "cat": "auth_failure", "cnt": 12}
]
}
FieldTypeDescription
timestampintegerUnix timestamp (optional)
srcstringSource identifier (IP, hostname, service)
sevintegerSeverity level (0 = info, 1 = low, 2 = medium, 3 = high, 4 = critical)
catstringEvent category
cntintegerEvent count in window

Vitals (Healthcare Monitoring):

{
"data": [
{"hr": 72.0, "bp_s": 120.0, "bp_d": 80.0, "spo2": 98.5, "temp": 36.8}
]
}
FieldTypeDescription
timestampintegerUnix timestamp (optional)
hrfloatHeart rate (bpm)
bp_sfloatSystolic blood pressure (mmHg)
bp_dfloatDiastolic blood pressure (mmHg)
spo2floatOxygen saturation (%)
tempfloatBody temperature (Celsius)

Response:

{
"pushed": 2,
"dropped": 0,
"queue_size": 42,
"queue_capacity": 4096,
"total_enqueued": 1500,
"total_dropped": 0
}
FieldTypeDescription
pushedintegerNumber of entries accepted
droppedintegerEntries dropped due to full buffer (overwritten)
queue_sizeintegerCurrent entries waiting to be processed
queue_capacityintegerMaximum ring buffer capacity
total_enqueuedintegerLifetime total entries enqueued
total_droppedintegerLifetime total entries dropped

GET /v1/sessions/{session_id}/stream/status

Get the current streaming status for a session.

Terminal window
curl http://localhost:8080/v1/sessions/sess_a1b2c3d4e5f6/stream/status

Response:

{
"type": "ohlcv",
"streaming": {
"initialized": true,
"running": true,
"error": false,
"queue_size": 42,
"queue_capacity": 4096,
"window_size": 370,
"process_interval_us": 10000
},
"statistics": {
"total_enqueued": 1500,
"total_dropped": 0,
"batches_processed": 30,
"items_processed": 1500,
"total_process_time_us": 450000,
"avg_process_time_us": 15000
},
"data": {
"initialized": true,
"data_count": 85,
"total_data_tokens": 3400,
"pos_next": 3442
}
}

Flash Queries

POST /v1/sessions/{session_id}/flash

Register a new Flash Query on an existing session.

Request:

{
"query": "Is volatility increasing?",
"max_tokens": 8
}

Response:

{
"id": 3,
"query": "Is volatility increasing?",
"status": "registered"
}

GET /v1/sessions/{session_id}/flash

List all registered Flash Queries for a session.

Terminal window
curl http://localhost:8080/v1/sessions/sess_a1b2c3d4e5f6/flash

Response:

{
"queries": [
{
"id": 1,
"query": "Is the trend bullish or bearish?",
"max_tokens": 4,
"value": "Bullish",
"confidence": 0.92,
"last_evaluated_ms": 1711000000000
},
{
"id": 2,
"query": "What is the current support level?",
"max_tokens": 16,
"value": "Support at 185.10",
"confidence": 0.87,
"last_evaluated_ms": 1711000000000
}
]
}

DELETE /v1/sessions/{session_id}/flash/{query_id}

Remove a Flash Query.

Terminal window
curl -X DELETE http://localhost:8080/v1/sessions/sess_a1b2c3d4e5f6/flash/1

GET /v1/sessions/{session_id}/events

Subscribe to a Server-Sent Events stream for real-time Flash Query results.

Terminal window
curl -N http://localhost:8080/v1/sessions/sess_a1b2c3d4e5f6/events

Events:

event: flash_ready
data: {"id":1,"query":"Is the trend bullish or bearish?","value":"Bullish","confidence":0.92,"data_version":15}
event: flash_ready
data: {"id":2,"query":"What is the current support level?","value":"Support at 185.10","confidence":0.87,"data_version":15}

WebSocket

WebSocket /v1/sessions/{session_id}/ws

Bidirectional WebSocket connection that combines data push and event delivery in a single connection.

Connect:

ws://localhost:8080/v1/sessions/sess_a1b2c3d4e5f6/ws

Inbound messages (client to server):

Push data:

{
"type": "push",
"data": [
{"o": 185.50, "h": 186.20, "l": 185.10, "c": 185.80, "v": 12500}
]
}

Ping:

{
"type": "ping"
}

Outbound messages (server to client):

Connection established:

{
"type": "connected",
"data": {
"session_id": "sess_a1b2c3d4e5f6",
"streaming": true
}
}

Flash Query result:

{
"type": "flash_ready",
"data": {
"id": 1,
"query": "Is the trend bullish or bearish?",
"value": "Bullish",
"confidence": 0.92,
"data_version": 15
}
}

Error:

{
"type": "error",
"data": {
"message": "Session not found",
"code": 404
}
}

Error Responses

All endpoints return errors in a consistent format:

{
"error": {
"message": "Session not found",
"type": "not_found",
"code": 404
}
}
HTTP StatusMeaning
400Bad request (missing or invalid parameters)
401Unauthorized (invalid or missing token)
404Resource not found (invalid session ID)
409Conflict (session is busy with another request)
503Service unavailable (model not loaded, server starting)