Flash Queries

Flash Queries let you register questions that LayerScale evaluates continuously in the background. After each data update, registered queries are re-evaluated and answers are pushed to your application via SSE or WebSocket. When you query the same question through the generate endpoint, the cached answer is returned instantly.

How It Works

Register queries when you create a session (or add them later)
Push data into the session via the streaming endpoints
Background evaluation runs after each data batch is processed
Answers push to your application via SSE or WebSocket when results change
Instant responses when you call /generate with a matching query

Each query result includes a confidence score derived from the model’s logit gap. Higher confidence means the model was more decisive in its answer.

Registering Flash Queries

At Session Creation

curl -X POST http://localhost:8080/v1/sessions/init \
  -H "Content-Type: application/json" \
  -d '{
    "type": "ohlcv",
    "prompt": "You are a market analyst monitoring real-time price data.",
    "flash": [
      {"query": "Is the trend bullish or bearish?", "max_tokens": 4},
      {"query": "What is the current support level?", "max_tokens": 16},
      {"query": "Is volatility increasing?", "max_tokens": 8}
    ]
  }'

On an Existing Session

curl -X POST http://localhost:8080/v1/sessions/sess_a1b2c3d4e5f6/flash \
  -H "Content-Type: application/json" \
  -d '{
    "query": "Are there any volume anomalies?",
    "max_tokens": 32
  }'

Response:

{
  "id": 4,
  "query": "Are there any volume anomalies?",
  "status": "registered"
}

Query Parameters

Parameter	Type	Default	Description
`query`	string	(required)	The question to evaluate continuously
`max_tokens`	integer	`32`	Maximum tokens in the answer. Shorter answers evaluate faster.

Keep max_tokens low for yes/no or categorical questions. Use higher values for questions that need a short explanation.

Receiving Results

SSE (Server-Sent Events)

Connect to the events endpoint to receive results as they update:

curl -N http://localhost:8080/v1/sessions/sess_a1b2c3d4e5f6/events

Each event arrives when a query answer changes after new data:

event: flash_ready
data: {"id":1,"query":"Is the trend bullish or bearish?","value":"Bullish","confidence":0.92,"data_version":15}

event: flash_ready
data: {"id":2,"query":"What is the current support level?","value":"Support at 185.10","confidence":0.87,"data_version":15}

event: flash_ready
data: {"id":3,"query":"Is volatility increasing?","value":"No","confidence":0.78,"data_version":15}

SSE Event Format

Field	Type	Description
`id`	integer	Flash query ID
`query`	string	The registered question
`value`	string	The model’s output
`confidence`	float	Confidence score (0.0 to 1.0) based on logit gap
`data_version`	integer	Data version this answer was evaluated against

WebSocket

If you are already connected via WebSocket for data push, Flash Query results arrive on the same connection:

{
  "type": "flash_ready",
  "data": {
    "id": 1,
    "query": "Is the trend bullish or bearish?",
    "value": "Bullish",
    "confidence": 0.92,
    "data_version": 15
  }
}

Python Example

import requests
import sseclient
import threading

SESSION_ID = "sess_a1b2c3d4e5f6"
BASE_URL = "http://localhost:8080"

def listen_for_results():
    response = requests.get(
        f"{BASE_URL}/v1/sessions/{SESSION_ID}/events",
        stream=True,
    )
    client = sseclient.SSEClient(response)
    for event in client.events():
        if event.event == "flash_ready":
            import json
            data = json.loads(event.data)
            print(f"[{data['confidence']:.0%}] {data['query']}: {data['value']}")

# Run in background
thread = threading.Thread(target=listen_for_results, daemon=True)
thread.start()

TypeScript Example

const eventSource = new EventSource(
    "http://localhost:8080/v1/sessions/sess_a1b2c3d4e5f6/events"
);

eventSource.addEventListener("flash_ready", (event) => {
    const data = JSON.parse(event.data);
    console.log(`[${(data.confidence * 100).toFixed(0)}%] ${data.query}: ${data.value}`);
});

eventSource.onerror = (err) => {
    console.error("SSE connection error:", err);
};

Instant Cache Hits

When you call the generate endpoint with a prompt that matches a registered Flash Query, the cached answer is returned immediately without running inference:

curl -X POST http://localhost:8080/v1/sessions/sess_a1b2c3d4e5f6/generate \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Is the trend bullish or bearish?",
    "max_tokens": 4
  }'

If the cache is fresh (evaluated against the latest data version):

{
  "text": "Bullish",
  "tokens_generated": 1,
  "from_cache": true,
  "confidence": 0.92
}

This enables sub-millisecond response times for monitored queries.

Managing Flash Queries

List Queries

curl http://localhost:8080/v1/sessions/sess_a1b2c3d4e5f6/flash

{
  "queries": [
    {
      "id": 1,
      "query": "Is the trend bullish or bearish?",
      "max_tokens": 4,
      "value": "Bullish",
      "confidence": 0.92,
      "last_evaluated_ms": 1711000000000
    },
    {
      "id": 2,
      "query": "What is the current support level?",
      "max_tokens": 16,
      "value": "Support at 185.10",
      "confidence": 0.87,
      "last_evaluated_ms": 1711000000000
    }
  ]
}

Delete a Query

curl -X DELETE http://localhost:8080/v1/sessions/sess_a1b2c3d4e5f6/flash/1

Confidence Scores

The confidence score is computed from the logit gap between the top-1 and top-2 token probabilities at each generation step. A higher score means the model was more certain.

Score Range	Interpretation
0.90+	High confidence, strong signal
0.70 - 0.89	Moderate confidence
0.50 - 0.69	Low confidence, consider a longer answer
Below 0.50	Uncertain, the model is hedging

Use confidence scores to filter out noisy results or trigger alerts only on high-confidence changes.

Practical Examples

Market Trend Monitoring

import requests

session = requests.post("http://localhost:8080/v1/sessions/init", json={
    "type": "ohlcv",
    "prompt": "You are a quantitative analyst monitoring BTC/USD 1-minute candles.",
    "flash": [
        {"query": "Is momentum bullish or bearish?", "max_tokens": 4},
        {"query": "Is a reversal forming?", "max_tokens": 8},
        {"query": "What is the key support level?", "max_tokens": 16},
        {"query": "What is the key resistance level?", "max_tokens": 16},
    ],
}).json()

Anomaly Detection

session = requests.post("http://localhost:8080/v1/sessions/init", json={
    "type": "iot",
    "prompt": "You are an anomaly detection system for server metrics. "
              "Flag any unusual patterns in CPU, memory, or network metrics.",
    "flash": [
        {"query": "Are there any anomalies in the current data?", "max_tokens": 32},
        {"query": "Is resource usage trending toward capacity?", "max_tokens": 16},
        {"query": "Should ops be alerted?", "max_tokens": 4},
    ],
}).json()

Threshold Alerts

Use Flash Queries with a confidence filter to build alerting:

import json
import requests
import sseclient

ALERT_THRESHOLD = 0.85

response = requests.get(
    f"http://localhost:8080/v1/sessions/{session_id}/events",
    stream=True,
)

for event in sseclient.SSEClient(response).events():
    if event.event == "flash_ready":
        data = json.loads(event.data)

        # Only act on high-confidence answers
        if data["confidence"] >= ALERT_THRESHOLD:
            if "alert" in data["value"].lower() or "yes" in data["value"].lower():
                send_alert(
                    query=data["query"],
                    answer=data["value"],
                    confidence=data["confidence"],
                )

Performance Considerations

Each Flash Query adds evaluation time after every data batch. Keep the number of queries reasonable (under 20 per session).
Use low max_tokens for categorical answers. A yes/no question needs max_tokens: 4, not 128.
Flash Queries share the session’s GPU context. They are evaluated sequentially after data processing, not in parallel.
The decode scheduler prioritizes Flash Query evaluation (FLASH priority) over streaming data processing (STREAM priority), so queries stay responsive even under heavy data load.