API Reference

Complete reference for all LayerScale API endpoints.

Base URL

http://localhost:8080

Authentication

Authentication is enforced when LayerScale runs with a Pro license key. The free tier accepts requests without authentication. The /health and /v1/health endpoints are always accessible without authentication. /metrics requires LayerScale Pro; on the free tier it returns HTTP 402 with code license_required.

When authentication is required, pass your license key as a bearer token:

Authorization: Bearer your-license-key

Or using the Anthropic-style header:

x-api-key: your-license-key

Health

GET /v1/health

Check if the server is ready to accept requests.

curl http://localhost:8080/v1/health

Response 200: Empty body. The model is loaded and the server is ready.

Response 503: Server is still loading the model.

Chat Completions (OpenAI-Compatible)

POST /v1/chat/completions

Generate a chat completion. Fully compatible with the OpenAI API format.

Request:

{
  "model": "meta-llama/Llama-4-Maverick-17B-128E-Original",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is 2 + 2?"}
  ],
  "max_tokens": 100,
  "temperature": 0.7,
  "top_p": 0.9,
  "stream": false,
  "stop": ["\n\n"],
  "tools": []
}

Parameters:

Parameter	Type	Required	Default	Description
`model`	string	No		Model identifier (accepted but not used for routing)
`messages`	array	Yes		Array of message objects with `role` and `content`
`max_tokens`	integer	No	`256`	Maximum tokens to generate
`temperature`	float	No	`0.7`	Sampling temperature (0.0 to 2.0)
`top_p`	float	No	`1.0`	Nucleus sampling threshold
`stream`	boolean	No	`false`	Enable SSE streaming
`stop`	array \| string	No	`[]`	Stop sequences
`tools`	array	No	`[]`	Tool definitions for function calling
`session_id`	string	No		Route this request through an existing session (from `/v1/sessions/init`) so its warm context is reused instead of reprocessed

Response (non-streaming):

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1711000000,
  "model": "meta-llama/Llama-4-Maverick-17B-128E-Original",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "2 + 2 equals 4."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 25,
    "completion_tokens": 8,
    "total_tokens": 33
  }
}

Response (streaming):

Each SSE event contains a chunk:

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"2"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":" +"},"finish_reason":null}]}

data: [DONE]

Tool calling response:

When the model invokes a tool, the response includes tool_calls instead of text content:

{
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": null,
        "tool_calls": [
          {
            "id": "call_abc123",
            "type": "function",
            "function": {
              "name": "get_weather",
              "arguments": "{\"location\": \"San Francisco\"}"
            }
          }
        ]
      },
      "finish_reason": "tool_calls"
    }
  ]
}

POST /v1/completions

Legacy OpenAI-compatible text completion. Takes a raw prompt string instead of a messages array and returns OpenAI-style text_completion objects (one text field per choice). Honors max_tokens, temperature, and top_p.

{
  "model": "meta-llama/Llama-4-Maverick-17B-128E-Original",
  "prompt": "The capital of France is",
  "max_tokens": 16,
  "temperature": 0.7
}

Messages (Anthropic-Compatible)

POST /v1/messages

Generate a message response. Fully compatible with the Anthropic API format.

Request:

{
  "model": "meta-llama/Llama-4-Maverick-17B-128E-Original",
  "max_tokens": 1024,
  "system": "You are a helpful assistant.",
  "messages": [
    {"role": "user", "content": "What is 2 + 2?"}
  ],
  "temperature": 0.7,
  "stream": false,
  "tools": []
}

Parameters:

Parameter	Type	Required	Default	Description
`model`	string	No		Model identifier
`messages`	array	Yes		Array of message objects
`max_tokens`	integer	Yes		Maximum tokens to generate
`system`	string	No		System prompt
`temperature`	float	No	`0.7`	Sampling temperature
`top_p`	float	No	`1.0`	Nucleus sampling threshold
`stream`	boolean	No	`false`	Enable SSE streaming
`stop_sequences`	array	No	`[]`	Stop sequences
`tools`	array	No	`[]`	Tool definitions
`session_id`	string	No		Route this request through an existing session (from `/v1/sessions/init`) so its warm context is reused instead of reprocessed

Response (non-streaming):

{
  "id": "msg_abc123",
  "type": "message",
  "role": "assistant",
  "content": [
    {
      "type": "text",
      "text": "2 + 2 equals 4."
    }
  ],
  "model": "meta-llama/Llama-4-Maverick-17B-128E-Original",
  "stop_reason": "end_turn",
  "usage": {
    "input_tokens": 25,
    "output_tokens": 8
  }
}

Response (streaming):

SSE events follow the Anthropic streaming format:

event: message_start
data: {"type":"message_start","message":{"id":"msg_abc123","type":"message","role":"assistant","content":[],"model":"meta-llama/Llama-4-Maverick-17B-128E-Original","usage":{"input_tokens":25}}}

event: content_block_start
data: {"type":"content_block_start","index":0,"content_block":{"type":"text","text":""}}

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"2 + 2"}}

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":" equals 4."}}

event: content_block_stop
data: {"type":"content_block_stop","index":0}

event: message_delta
data: {"type":"message_delta","delta":{"stop_reason":"end_turn"},"usage":{"output_tokens":8}}

event: message_stop
data: {"type":"message_stop"}

Tool use response:

{
  "content": [
    {
      "type": "tool_use",
      "id": "toolu_abc123",
      "name": "get_weather",
      "input": {"location": "San Francisco"}
    }
  ],
  "stop_reason": "tool_use"
}

Models

GET /v1/models

List available models.

curl http://localhost:8080/v1/models

Response:

{
  "object": "list",
  "data": [
    {
      "id": "Llama-3.1-8B-Instruct",
      "object": "model",
      "created": 0,
      "owned_by": "layerscale"
    }
  ],
  "server": {
    "version": "0.7.0",
    "build": "2026-07-03T18:12:00Z",
    "licensed": true,
    "accelerators": [
      {
        "name": "CUDA0",
        "description": "NVIDIA L40S",
        "vram_mb": 46068,
        "free_mb": 12040
      }
    ],
    "model": {
      "id": "Llama-3.1-8B-Instruct",
      "size_bytes": 16060530688,
      "params": 8030261248,
      "layers": 32,
      "hidden_size": 4096,
      "attention_heads": 32,
      "kv_heads": 8,
      "trained_context": 131072,
      "kv_cache_type": "fp16"
    },
    "scheduler": {
      "max_concurrent": 8,
      "max_sessions": 0,
      "max_sequences": 289,
      "batch_tokens": 6144,
      "context": 32768,
      "prefix_slots": 256,
      "chunk_size": 4096
    }
  }
}

The id is derived from the loaded model file name (without extension).

The server object describes the running instance: build info and license state, a point-in-time snapshot of each GPU (accelerators[], with free VRAM measured at request time), the loaded model’s shape, and the scheduler limits the server booted with. Clients and benchmark harnesses can read server.scheduler.max_concurrent to size their request concurrency to the server.

Sessions

POST /v1/sessions/init

Create a new session. LayerScale’s stateful inference engine maintains context across multiple requests without re-processing the prompt.

Request:

{
  "type": "ohlcv",
  "prompt": "You are a financial analyst. Analyze the following market data as it arrives.",
  "flash": [
    {"query": "Is the trend bullish or bearish?", "max_tokens": 4},
    {"query": "What is the current support level?", "max_tokens": 16}
  ]
}

Parameters:

Parameter	Type	Required	Description
`type`	string	Yes	Stream data type: `ohlcv`, `iot`, `spatial`, `event`, or `vitals`
`prompt`	string	No	Initial system prompt to process and cache
`context`	integer	No	Context window size in tokens (default: server default)
`window_size`	integer	No	Maximum entries in the data buffer (auto-computed from context size and data type if omitted)
`flash`	array	No	Flash queries to register (see Flash Queries)
`flash[].query`	string	Yes	The question text
`flash[].max_tokens`	integer	No	Maximum answer tokens (default: 32)

Response:

{
  "session_id": "sess_a1b2c3d4e5f6",
  "type": "ohlcv",
  "n_tokens": 42,
  "context": 8192,
  "window_size": 370,
  "flash_queries": 2,
  "pos_max": 41
}

GET /v1/sessions

List all active sessions.

curl http://localhost:8080/v1/sessions

Response:

{
  "sessions": [
    {
      "session_id": "sess_a1b2c3d4e5f6",
      "n_tokens": 1024,
      "context": 8192,
      "prefix_end": 42,
      "cache_usage": 0.125,
      "created_at": 1711000000000,
      "last_used_at": 1711000060000,
      "in_use": false
    }
  ],
  "count": 1,
  "max_sessions": 8
}

GET /v1/sessions/{session_id}/state

Get detailed state of a session including token position, context usage, and configuration.

curl http://localhost:8080/v1/sessions/sess_a1b2c3d4e5f6/state

Response:

{
  "session_id": "sess_a1b2c3d4e5f6",
  "type": "ohlcv",
  "n_tokens": 2048,
  "context": 8192,
  "pos_min": 0,
  "pos_max": 2047,
  "pos_next": 2048,
  "prefix_end": 42,
  "data_region": {
    "start": 42,
    "end": 2048,
    "window_size": 370
  },
  "data_count": 85,
  "cache_usage": 0.25,
  "created_at": 1711000000000,
  "last_used_at": 1711000060000,
  "in_use": false,
  "context_text": "..."
}

context_text is the tokens decoded back to text, intended for debugging. Expect it to be large on long-running sessions.

DELETE /v1/sessions/{session_id}

Delete a session and free its GPU memory.

curl -X DELETE http://localhost:8080/v1/sessions/sess_a1b2c3d4e5f6

Response:

{
  "success": true
}

POST /v1/sessions/{session_id}/append

Append text to the session context. The text is tokenized and processed into the session state.

Request:

{
  "text": "Additional context to append to this session."
}

Response:

{
  "n_tokens_added": 12,
  "total_tokens": 2112,
  "pos_max": 2111
}

POST /v1/sessions/{session_id}/mark_prefix

Freeze everything processed so far as the session’s immutable prefix. Positions before the mark are never trimmed or recomputed by later window management, so the system prompt and any static context stay cached for the life of the session. Call it once, after the initial prompt has been processed and before streaming data in. This is the endpoint behind the mark_prefix / markPrefix option in the LayerScale client.

curl -X POST http://localhost:8080/v1/sessions/sess_a1b2c3d4e5f6/mark_prefix

Response:

{
  "prefix_end": 42,
  "message": "Prefix frozen. KV cache for positions 0-41 will not be recomputed."
}

POST /v1/sessions/{session_id}/generate

Generate text from the current session state. Supports streaming via SSE.

Request:

{
  "prompt": "Based on the data so far, what is the trend?",
  "max_tokens": 128,
  "stream": true
}

Parameters:

Parameter	Type	Required	Default	Description
`prompt`	string	No	`""`	Optional question to append before generating
`max_tokens`	integer	No	`128`	Maximum tokens to generate
`stream`	boolean	No	`false`	Stream response as SSE
`fast_answer`	boolean	No	`false`	Attempt sub-10ms speculative exit for single-word answers when a cached ready-position is available
`clear_after`	integer	No		If set, truncate the session at this position before generating (combines clear + generate in one call)
`stop`	array	No	`[]`	Stop sequences (strings)

If the prompt matches a cached Flash Query, the cached answer is returned immediately with near-zero latency.

Response (non-streaming):

{
  "text": "The trend is bullish with strong support at 185.50.",
  "n_tokens": 12,
  "total_tokens": 2113,
  "pos_max": 2112
}

Response (streaming):

Per-token events arrive as SSE chunks, followed by a final done event:

data: {"token": "The", "pos": 2101}
data: {"token": " trend", "pos": 2102}
data: {"token": " is", "pos": 2103}
data: {"done": true, "text": "The trend is ...", "n_tokens": 12}

If the answer comes from the Flash Query cache:

{
  "text": "Bullish",
  "n_tokens": 1,
  "total_tokens": 2101,
  "pos_max": 2100,
  "flash": true,
  "flash_id": 1,
  "data_version": 15,
  "confidence": 0.92
}

If a speculative exit short-circuits generation (with fast_answer: true):

{
  "text": "Bullish",
  "n_tokens": 1,
  "total_tokens": 2101,
  "pos_max": 2100,
  "speculative": true,
  "logit_gap": 3.7
}

Streaming Data

POST /v1/sessions/{session_id}/stream/push

Push data into a session’s streaming buffer. This endpoint is non-blocking. Data is queued and processed in the background. The data format is determined by the session’s type (set at creation). Only send the data array.

OHLCV (Financial Market Data):

{
  "data": [
    {"o": 185.50, "h": 186.20, "l": 185.10, "c": 185.80, "v": 12500}
  ]
}

Field	Type	Description
`timestamp`	integer	Unix timestamp (optional)
`o`	float	Open price
`h`	float	High price
`l`	float	Low price
`c`	float	Close price
`v`	float	Volume

IoT (Industrial Sensor Telemetry):

{
  "data": [
    {"sid": "temp-01", "val": 72.5, "lo": 60.0, "hi": 85.0}
  ]
}

Field	Type	Description
`timestamp`	integer	Unix timestamp (optional)
`sid`	string	Sensor identifier
`val`	float	Current reading
`lo`	float	Low threshold or observed minimum
`hi`	float	High threshold or observed maximum

Spatial (Autonomous Systems):

{
  "data": [
    {"x": 37.7749, "y": -122.4194, "z": 150.0, "spd": 12.5, "hdg": 270.0}
  ]
}

Field	Type	Description
`timestamp`	integer	Unix timestamp (optional)
`x`	float	X position or latitude
`y`	float	Y position or longitude
`z`	float	Z position or altitude
`spd`	float	Speed
`hdg`	float	Heading in degrees (0-360)

Event (Security and Monitoring):

{
  "data": [
    {"src": "10.0.1.5", "sev": 3, "cat": "auth_failure", "cnt": 12}
  ]
}

Field	Type	Description
`timestamp`	integer	Unix timestamp (optional)
`src`	string	Source identifier (IP, hostname, service)
`sev`	integer	Severity level (0 = info, 1 = low, 2 = medium, 3 = high, 4 = critical)
`cat`	string	Event category
`cnt`	integer	Event count in window

Vitals (Healthcare Monitoring):

{
  "data": [
    {"hr": 72.0, "bp_s": 120.0, "bp_d": 80.0, "spo2": 98.5, "temp": 36.8}
  ]
}

Field	Type	Description
`timestamp`	integer	Unix timestamp (optional)
`hr`	float	Heart rate (bpm)
`bp_s`	float	Systolic blood pressure (mmHg)
`bp_d`	float	Diastolic blood pressure (mmHg)
`spo2`	float	Oxygen saturation (%)
`temp`	float	Body temperature (Celsius)

Response:

{
  "pushed": 2,
  "dropped": 0,
  "queue_size": 42,
  "queue_capacity": 4096,
  "total_enqueued": 1500,
  "total_dropped": 0
}

Field	Type	Description
`pushed`	integer	Number of entries accepted
`dropped`	integer	Entries dropped due to full buffer (overwritten)
`queue_size`	integer	Current entries waiting to be processed
`queue_capacity`	integer	Maximum ring buffer capacity
`total_enqueued`	integer	Lifetime total entries enqueued
`total_dropped`	integer	Lifetime total entries dropped

GET /v1/sessions/{session_id}/stream/status

Get the current streaming status for a session.

curl http://localhost:8080/v1/sessions/sess_a1b2c3d4e5f6/stream/status

Response:

{
  "type": "ohlcv",
  "streaming": {
    "initialized": true,
    "running": true,
    "error": false,
    "queue_size": 42,
    "queue_capacity": 4096,
    "window_size": 370,
    "process_interval_us": 10000
  },
  "statistics": {
    "total_enqueued": 1500,
    "total_dropped": 0,
    "batches_processed": 30,
    "items_processed": 1500,
    "total_process_time_us": 450000,
    "avg_process_time_us": 15000
  },
  "data": {
    "initialized": true,
    "data_count": 85,
    "total_data_tokens": 3400,
    "pos_next": 3442
  }
}

GET /v1/sessions/{session_id}/stats

Computed statistics over the data currently in the session’s stream buffer. The response fields are specific to the session’s stream type: OHLCV returns price and volume aggregates, IoT returns value ranges and threshold breaches, and so on. Exposed by the client SDKs as client.sessions.stats().

Returns HTTP 400 if no data has been ingested yet.

curl http://localhost:8080/v1/sessions/sess_a1b2c3d4e5f6/stats

Response (OHLCV session):

{
  "type": "ohlcv",
  "count": 85,
  "max_high": 195.42,
  "max_high_index": 61,
  "min_low": 180.13,
  "min_low_index": 7,
  "max_volume": 1250000,
  "first_open": 182.5,
  "last_close": 194.8,
  "avg_close": 188.6,
  "change": 12.3,
  "change_pct": 6.74
}

Flash Queries

POST /v1/sessions/{session_id}/flash

Request:

{
  "query": "Is volatility increasing?",
  "max_tokens": 8
}

max_tokens is clamped to the range [1, 256] and defaults to 32.

Response:

{
  "id": 3,
  "query": "Is volatility increasing?",
  "max_tokens": 8,
  "n_tokens": 4,
  "total_flash_queries": 3
}

GET /v1/sessions/{session_id}/flash

List all registered Flash Queries for a session.

curl http://localhost:8080/v1/sessions/sess_a1b2c3d4e5f6/flash

Response:

{
  "flash_queries": [
    {
      "id": 1,
      "query": "Is the trend bullish or bearish?",
      "max_tokens": 4,
      "n_tokens": 9,
      "has_answer": true,
      "value": "Bullish",
      "data_version": 15,
      "confidence": 0.92,
      "evaluated_at": 1711000000000
    },
    {
      "id": 2,
      "query": "What is the current support level?",
      "max_tokens": 16,
      "n_tokens": 12,
      "has_answer": true,
      "value": "Support at 185.10",
      "data_version": 15,
      "confidence": 0.87,
      "evaluated_at": 1711000000000
    }
  ],
  "count": 2,
  "current_data_version": 15
}

DELETE /v1/sessions/{session_id}/flash/{query_id}

Remove a Flash Query.

curl -X DELETE http://localhost:8080/v1/sessions/sess_a1b2c3d4e5f6/flash/1

GET /v1/sessions/{session_id}/events

Subscribe to a Server-Sent Events stream for real-time Flash Query results.

curl -N http://localhost:8080/v1/sessions/sess_a1b2c3d4e5f6/events

On connect the server emits a single connected event, replays the current cached answer for every query that has one, and then streams flash_ready events as answers update.

Events:

event: connected
data: {"data_version":15,"flash_queries":3}

event: flash_ready
data: {"id":1,"query":"Is the trend bullish or bearish?","value":"Bullish","data_version":15,"confidence":0.92,"evaluated_at":1711000000000}

event: flash_ready
data: {"id":2,"query":"What is the current support level?","value":"Support at 185.10","data_version":15,"confidence":0.87,"evaluated_at":1711000000000}

WebSocket

WebSocket /v1/sessions/{session_id}/ws

Bidirectional WebSocket connection that combines data push and event delivery in a single connection.

Connect:

ws://localhost:8080/v1/sessions/sess_a1b2c3d4e5f6/ws

Inbound messages (client to server):

Push data:

{
  "type": "push",
  "data": [
    {"o": 185.50, "h": 186.20, "l": 185.10, "c": 185.80, "v": 12500}
  ]
}

Ping:

{
  "type": "ping"
}

Outbound messages (server to client):

Connection established (followed immediately by flash_ready events for every query with a cached answer):

{
  "type": "connected",
  "data": {
    "session_id": "sess_a1b2c3d4e5f6",
    "data_version": 15,
    "streaming": true,
    "flash_queries": 3
  }
}

Flash Query result:

{
  "type": "flash_ready",
  "data": {
    "id": 1,
    "query": "Is the trend bullish or bearish?",
    "value": "Bullish",
    "data_version": 15,
    "confidence": 0.92,
    "evaluated_at": 1711000000000
  }
}

Error:

{
  "type": "error",
  "data": {
    "message": "Session not found",
    "code": 404
  }
}

Metrics (LayerScale Pro)

GET /metrics

Prometheus-format metrics for scraping. Covered only by LayerScale Pro.

curl http://localhost:8080/metrics

Returns text/plain Prometheus exposition output. Request counters, latency histograms, context pool occupancy, Flash Query cache hit rates, and streaming queue depths are exported here. On a free license this endpoint returns 402 license_required.

Error Responses

Most endpoints return errors in the form:

{
  "error": {
    "message": "Session not found",
    "code": 404
  }
}

License-gated errors (free-tier limits, missing or invalid Pro key) use a string slug in code:

{
  "error": {
    "code": "license_required",
    "message": "LayerScale allows 1 free session. Upgrade to LayerScale Pro for unlimited sessions. Visit https://layerscale.ai for details."
  }
}

The Anthropic-compatible endpoint POST /v1/messages returns Anthropic-shaped errors:

{
  "type": "error",
  "error": {
    "type": "invalid_request_error",
    "message": "max_tokens is required"
  }
}

HTTP Status	Meaning
400	Bad request (missing or invalid parameters)
401	Unauthorized (invalid or missing token)
402	License required (free-tier limit hit, e.g. second session or context > 32K)
404	Resource not found (invalid session ID)
409	Conflict (session is busy with another request)
503	Service unavailable (model not loaded, server starting)