API Reference

Complete reference for all LayerScale API endpoints.

Base URL

http://localhost:8080

Authentication

All endpoints require authentication. Pass your API key as a bearer token:

Authorization: Bearer your-api-key

Or using the Anthropic-style header:

x-api-key: your-api-key

Health

GET /v1/health

Check if the server is ready to accept requests.

curl http://localhost:8080/v1/health

Response 200: Empty body. The model is loaded and the server is ready.

Response 503: Server is still loading the model.

Chat Completions (OpenAI-Compatible)

POST /v1/chat/completions

Generate a chat completion. Fully compatible with the OpenAI API format.

Request:

{
  "model": "meta-llama/Llama-4-Maverick-17B-128E-Original",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is 2 + 2?"}
  ],
  "max_tokens": 100,
  "temperature": 0.7,
  "top_p": 0.9,
  "stream": false,
  "stop": ["\n\n"],
  "tools": []
}

Parameters:

Parameter	Type	Required	Default	Description
`model`	string	No		Model identifier (accepted but not used for routing)
`messages`	array	Yes		Array of message objects with `role` and `content`
`max_tokens`	integer	No	`512`	Maximum tokens to generate
`temperature`	float	No	`0.7`	Sampling temperature (0.0 to 2.0)
`top_p`	float	No	`0.9`	Nucleus sampling threshold
`stream`	boolean	No	`false`	Enable SSE streaming
`stop`	array	No	`[]`	Stop sequences
`tools`	array	No	`[]`	Tool definitions for function calling

Response (non-streaming):

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1711000000,
  "model": "meta-llama/Llama-4-Maverick-17B-128E-Original",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "2 + 2 equals 4."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 25,
    "completion_tokens": 8,
    "total_tokens": 33
  }
}

Response (streaming):

Each SSE event contains a chunk:

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"2"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":" +"},"finish_reason":null}]}

data: [DONE]

Tool calling response:

When the model invokes a tool, the response includes tool_calls instead of text content:

{
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": null,
        "tool_calls": [
          {
            "id": "call_abc123",
            "type": "function",
            "function": {
              "name": "get_weather",
              "arguments": "{\"location\": \"San Francisco\"}"
            }
          }
        ]
      },
      "finish_reason": "tool_calls"
    }
  ]
}

Messages (Anthropic-Compatible)

POST /v1/messages

Generate a message response. Fully compatible with the Anthropic API format.

Request:

{
  "model": "meta-llama/Llama-4-Maverick-17B-128E-Original",
  "max_tokens": 1024,
  "system": "You are a helpful assistant.",
  "messages": [
    {"role": "user", "content": "What is 2 + 2?"}
  ],
  "temperature": 0.7,
  "stream": false,
  "tools": []
}

Parameters:

Parameter	Type	Required	Default	Description
`model`	string	No		Model identifier
`messages`	array	Yes		Array of message objects
`max_tokens`	integer	Yes		Maximum tokens to generate
`system`	string	No		System prompt
`temperature`	float	No	`0.7`	Sampling temperature
`top_p`	float	No	`0.9`	Nucleus sampling threshold
`stream`	boolean	No	`false`	Enable SSE streaming
`stop_sequences`	array	No	`[]`	Stop sequences
`tools`	array	No	`[]`	Tool definitions

Response (non-streaming):

{
  "id": "msg_abc123",
  "type": "message",
  "role": "assistant",
  "content": [
    {
      "type": "text",
      "text": "2 + 2 equals 4."
    }
  ],
  "model": "meta-llama/Llama-4-Maverick-17B-128E-Original",
  "stop_reason": "end_turn",
  "usage": {
    "input_tokens": 25,
    "output_tokens": 8
  }
}

Response (streaming):

SSE events follow the Anthropic streaming format:

event: message_start
data: {"type":"message_start","message":{"id":"msg_abc123","type":"message","role":"assistant","content":[],"model":"meta-llama/Llama-4-Maverick-17B-128E-Original","usage":{"input_tokens":25}}}

event: content_block_start
data: {"type":"content_block_start","index":0,"content_block":{"type":"text","text":""}}

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"2 + 2"}}

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":" equals 4."}}

event: content_block_stop
data: {"type":"content_block_stop","index":0}

event: message_delta
data: {"type":"message_delta","delta":{"stop_reason":"end_turn"},"usage":{"output_tokens":8}}

event: message_stop
data: {"type":"message_stop"}

Tool use response:

{
  "content": [
    {
      "type": "tool_use",
      "id": "toolu_abc123",
      "name": "get_weather",
      "input": {"location": "San Francisco"}
    }
  ],
  "stop_reason": "tool_use"
}

Models

GET /v1/models

List available models.

curl http://localhost:8080/v1/models

Response:

{
  "object": "list",
  "data": [
    {
      "id": "Llama-4-Maverick-17B-128E-Original",
      "object": "model",
      "owned_by": "meta-llama"
    }
  ]
}

Sessions

POST /v1/sessions/init

Create a new session. LayerScale’s stateful inference engine maintains context across multiple requests without re-processing the prompt.

Request:

{
  "type": "ohlcv",
  "prompt": "You are a financial analyst. Analyze the following market data as it arrives.",
  "flash": [
    {"query": "Is the trend bullish or bearish?", "max_tokens": 4},
    {"query": "What is the current support level?", "max_tokens": 16}
  ]
}

Parameters:

Parameter	Type	Required	Description
`type`	string	Yes	Stream data type: `ohlcv`, `iot`, `spatial`, `event`, or `vitals`
`prompt`	string	No	Initial system prompt to process and cache
`context_size`	integer	No	Context window size in tokens (default: server default)
`window_size`	integer	No	Maximum entries in the data buffer (auto-computed from context size and data type if omitted)
`flash`	array	No	Flash queries to register (see Flash Queries)
`flash[].query`	string	Yes	The question text
`flash[].max_tokens`	integer	No	Maximum answer tokens (default: 32)

Response:

{
  "session_id": "sess_a1b2c3d4e5f6",
  "type": "ohlcv",
  "n_tokens": 42,
  "context_size": 8192,
  "window_size": 370,
  "flash_queries": 2,
  "pos_max": 41
}

GET /v1/sessions

List all active sessions.

curl http://localhost:8080/v1/sessions

Response:

{
  "sessions": [
    {
      "session_id": "sess_a1b2c3d4e5f6",
      "n_tokens": 1024,
      "context_size": 8192,
      "cache_usage": 0.125,
      "created_at": 1711000000000,
      "last_used_at": 1711000060000,
      "in_use": false
    }
  ],
  "count": 1,
  "max_sessions": 10000
}

GET /v1/sessions/{session_id}/state

Get detailed state of a session including token position, context usage, and configuration.

curl http://localhost:8080/v1/sessions/sess_a1b2c3d4e5f6/state

Response:

{
  "session_id": "sess_a1b2c3d4e5f6",
  "type": "ohlcv",
  "n_tokens": 2048,
  "context_size": 8192,
  "pos_min": 0,
  "pos_max": 2047,
  "pos_next": 2048,
  "prefix_end": 42,
  "data_region": {
    "start": 42,
    "end": 2048,
    "window_size": 370
  },
  "data_count": 85,
  "cache_usage": 0.25,
  "created_at": 1711000000000,
  "last_used_at": 1711000060000,
  "in_use": false
}

DELETE /v1/sessions/{session_id}

Delete a session and free its GPU memory.

curl -X DELETE http://localhost:8080/v1/sessions/sess_a1b2c3d4e5f6

Response:

{
  "status": "deleted"
}

POST /v1/sessions/{session_id}/append

Append text to the session context. The text is tokenized and processed into the session state.

Request:

{
  "text": "Additional context to append to this session."
}

Response:

{
  "status": "ok",
  "pos": 2100
}

POST /v1/sessions/{session_id}/generate

Generate text from the current session state. Supports streaming via SSE.

Request:

{
  "prompt": "Based on the data so far, what is the trend?",
  "max_tokens": 256,
  "temperature": 0.3,
  "stream": true
}

Parameters:

Parameter	Type	Required	Default	Description
`prompt`	string	No		Optional question to append before generating
`max_tokens`	integer	No	`256`	Maximum tokens to generate
`temperature`	float	No	`0.7`	Sampling temperature
`stream`	boolean	No	`false`	Stream response as SSE

If the prompt matches a cached Flash Query, the cached answer is returned immediately with near-zero latency.

Response (non-streaming):

{
  "text": "The trend is bullish with strong support at 185.50.",
  "tokens_generated": 12,
  "from_cache": false
}

Response (streaming):

data: {"token": "The", "pos": 2101}
data: {"token": " trend", "pos": 2102}
data: {"token": " is", "pos": 2103}
data: [DONE]

If the answer comes from the Flash Query cache:

{
  "text": "Bullish",
  "tokens_generated": 1,
  "from_cache": true,
  "confidence": 0.92
}

Streaming Data

POST /v1/sessions/{session_id}/stream/push

Push data into a session’s streaming buffer. This endpoint is non-blocking. Data is queued and processed in the background. The data format is determined by the session’s type (set at creation). Only send the data array.

OHLCV (Financial Market Data):

{
  "data": [
    {"o": 185.50, "h": 186.20, "l": 185.10, "c": 185.80, "v": 12500}
  ]
}

Field	Type	Description
`timestamp`	integer	Unix timestamp (optional)
`o`	float	Open price
`h`	float	High price
`l`	float	Low price
`c`	float	Close price
`v`	float	Volume

IoT (Industrial Sensor Telemetry):

{
  "data": [
    {"sid": "temp-01", "val": 72.5, "lo": 60.0, "hi": 85.0}
  ]
}

Field	Type	Description
`timestamp`	integer	Unix timestamp (optional)
`sid`	string	Sensor identifier
`val`	float	Current reading
`lo`	float	Low threshold or observed minimum
`hi`	float	High threshold or observed maximum

Spatial (Autonomous Systems):

{
  "data": [
    {"x": 37.7749, "y": -122.4194, "z": 150.0, "spd": 12.5, "hdg": 270.0}
  ]
}

Field	Type	Description
`timestamp`	integer	Unix timestamp (optional)
`x`	float	X position or latitude
`y`	float	Y position or longitude
`z`	float	Z position or altitude
`spd`	float	Speed
`hdg`	float	Heading in degrees (0-360)

Event (Security and Monitoring):

{
  "data": [
    {"src": "10.0.1.5", "sev": 3, "cat": "auth_failure", "cnt": 12}
  ]
}

Field	Type	Description
`timestamp`	integer	Unix timestamp (optional)
`src`	string	Source identifier (IP, hostname, service)
`sev`	integer	Severity level (0 = info, 1 = low, 2 = medium, 3 = high, 4 = critical)
`cat`	string	Event category
`cnt`	integer	Event count in window

Vitals (Healthcare Monitoring):

{
  "data": [
    {"hr": 72.0, "bp_s": 120.0, "bp_d": 80.0, "spo2": 98.5, "temp": 36.8}
  ]
}

Field	Type	Description
`timestamp`	integer	Unix timestamp (optional)
`hr`	float	Heart rate (bpm)
`bp_s`	float	Systolic blood pressure (mmHg)
`bp_d`	float	Diastolic blood pressure (mmHg)
`spo2`	float	Oxygen saturation (%)
`temp`	float	Body temperature (Celsius)

Response:

{
  "pushed": 2,
  "dropped": 0,
  "queue_size": 42,
  "queue_capacity": 4096,
  "total_enqueued": 1500,
  "total_dropped": 0
}

Field	Type	Description
`pushed`	integer	Number of entries accepted
`dropped`	integer	Entries dropped due to full buffer (overwritten)
`queue_size`	integer	Current entries waiting to be processed
`queue_capacity`	integer	Maximum ring buffer capacity
`total_enqueued`	integer	Lifetime total entries enqueued
`total_dropped`	integer	Lifetime total entries dropped

GET /v1/sessions/{session_id}/stream/status

Get the current streaming status for a session.

curl http://localhost:8080/v1/sessions/sess_a1b2c3d4e5f6/stream/status

Response:

{
  "type": "ohlcv",
  "streaming": {
    "initialized": true,
    "running": true,
    "error": false,
    "queue_size": 42,
    "queue_capacity": 4096,
    "window_size": 370,
    "process_interval_us": 10000
  },
  "statistics": {
    "total_enqueued": 1500,
    "total_dropped": 0,
    "batches_processed": 30,
    "items_processed": 1500,
    "total_process_time_us": 450000,
    "avg_process_time_us": 15000
  },
  "data": {
    "initialized": true,
    "data_count": 85,
    "total_data_tokens": 3400,
    "pos_next": 3442
  }
}

Flash Queries

POST /v1/sessions/{session_id}/flash

Request:

{
  "query": "Is volatility increasing?",
  "max_tokens": 8
}

Response:

{
  "id": 3,
  "query": "Is volatility increasing?",
  "status": "registered"
}

GET /v1/sessions/{session_id}/flash

List all registered Flash Queries for a session.

curl http://localhost:8080/v1/sessions/sess_a1b2c3d4e5f6/flash

Response:

{
  "queries": [
    {
      "id": 1,
      "query": "Is the trend bullish or bearish?",
      "max_tokens": 4,
      "value": "Bullish",
      "confidence": 0.92,
      "last_evaluated_ms": 1711000000000
    },
    {
      "id": 2,
      "query": "What is the current support level?",
      "max_tokens": 16,
      "value": "Support at 185.10",
      "confidence": 0.87,
      "last_evaluated_ms": 1711000000000
    }
  ]
}

DELETE /v1/sessions/{session_id}/flash/{query_id}

Remove a Flash Query.

curl -X DELETE http://localhost:8080/v1/sessions/sess_a1b2c3d4e5f6/flash/1

GET /v1/sessions/{session_id}/events

Subscribe to a Server-Sent Events stream for real-time Flash Query results.

curl -N http://localhost:8080/v1/sessions/sess_a1b2c3d4e5f6/events

Events:

event: flash_ready
data: {"id":1,"query":"Is the trend bullish or bearish?","value":"Bullish","confidence":0.92,"data_version":15}

event: flash_ready
data: {"id":2,"query":"What is the current support level?","value":"Support at 185.10","confidence":0.87,"data_version":15}

WebSocket

WebSocket /v1/sessions/{session_id}/ws

Bidirectional WebSocket connection that combines data push and event delivery in a single connection.

Connect:

ws://localhost:8080/v1/sessions/sess_a1b2c3d4e5f6/ws

Inbound messages (client to server):

Push data:

{
  "type": "push",
  "data": [
    {"o": 185.50, "h": 186.20, "l": 185.10, "c": 185.80, "v": 12500}
  ]
}

Ping:

{
  "type": "ping"
}

Outbound messages (server to client):

Connection established:

{
  "type": "connected",
  "data": {
    "session_id": "sess_a1b2c3d4e5f6",
    "streaming": true
  }
}

Flash Query result:

{
  "type": "flash_ready",
  "data": {
    "id": 1,
    "query": "Is the trend bullish or bearish?",
    "value": "Bullish",
    "confidence": 0.92,
    "data_version": 15
  }
}

Error:

{
  "type": "error",
  "data": {
    "message": "Session not found",
    "code": 404
  }
}

Error Responses

All endpoints return errors in a consistent format:

{
  "error": {
    "message": "Session not found",
    "type": "not_found",
    "code": 404
  }
}

HTTP Status	Meaning
400	Bad request (missing or invalid parameters)
401	Unauthorized (invalid or missing token)
404	Resource not found (invalid session ID)
409	Conflict (session is busy with another request)
503	Service unavailable (model not loaded, server starting)