Flash Queries

Flash Queries is a fundamentally new interaction pattern for transformer-based inference. The model is no longer waiting for your question. It is already working on the answer.

Register questions upfront, and LayerScale evaluates them continuously in the background. After each data update, registered queries are re-evaluated and results are pushed to your application via SSE or WebSocket. When you query the same question through the generate endpoint, the result is returned instantly.

How It Works

Register queries when you create a session (or add them later)
Push data into the session via the streaming endpoints
Background evaluation runs after each data batch is processed
Answers push to your application via SSE or WebSocket when results change
Instant responses when you call /generate with a matching query

Each query result includes a confidence score derived from the model’s logit gap. Higher confidence means the model was more decisive in its answer.

Registering Flash Queries

At Session Creation

import { LayerScale } from "@layerscale/layerscale";

const client = new LayerScale("http://localhost:8080", { apiKey: "LS-..." });

const session = await client.sessions.create({
    type: "ohlcv",
    prompt: "You are a market analyst monitoring real-time price data.",
    flash: [
        { query: "Is the trend bullish or bearish?", max_tokens: 4 },
        { query: "What is the current support level?", max_tokens: 16 },
        { query: "Is volatility increasing?", max_tokens: 8 },
    ],
    markPrefix: true,
});

from layerscale import LayerScale

client = LayerScale("http://localhost:8080", api_key="LS-...")

session = client.sessions.create(
    type="ohlcv",
    prompt="You are a market analyst monitoring real-time price data.",
    flash=[
        {"query": "Is the trend bullish or bearish?", "max_tokens": 4},
        {"query": "What is the current support level?", "max_tokens": 16},
        {"query": "Is volatility increasing?", "max_tokens": 8},
    ],
    mark_prefix=True,
)

curl -X POST http://localhost:8080/v1/sessions/init \
  -H "Content-Type: application/json" \
  -d '{
    "type": "ohlcv",
    "prompt": "You are a market analyst monitoring real-time price data.",
    "flash": [
      {"query": "Is the trend bullish or bearish?", "max_tokens": 4},
      {"query": "What is the current support level?", "max_tokens": 16},
      {"query": "Is volatility increasing?", "max_tokens": 8}
    ]
  }'

On an Existing Session

const registered = await client.sessions.flash(
    session.session_id,
    "Are there any volume anomalies?",
    32,
);

registered = client.sessions.flash(
    session.session_id,
    "Are there any volume anomalies?",
    max_tokens=32,
)

curl -X POST http://localhost:8080/v1/sessions/sess_a1b2c3d4e5f6/flash \
  -H "Content-Type: application/json" \
  -d '{
    "query": "Are there any volume anomalies?",
    "max_tokens": 32
  }'

Response:

{
  "id": 4,
  "query": "Are there any volume anomalies?",
  "max_tokens": 32,
  "n_tokens": 6,
  "total_flash_queries": 4
}

Query Parameters

| Parameter | Type | Default | Description | |-----------|------|---------|-------------| | query | string | (required) | The question to evaluate continuously | | max_tokens | integer | 32 | Maximum tokens in the answer (clamped to [1, 256]). Shorter answers evaluate faster. |

Keep max_tokens low for yes/no or categorical questions. Use higher values for questions that need a short explanation. A session can hold up to 20 Flash Queries concurrently.

Receiving Results

SSE (Server-Sent Events)

Connect to the events endpoint to receive results as they update:

for await (const event of client.sessions.events(session.session_id)) {
    if (event.type === "flash_ready") {
        console.log(
            `[${(event.confidence * 100).toFixed(0)}%] ${event.query}: ${event.value}`,
        );
    }
}

with client.sessions.events(session.session_id) as events:
    for event in events:
        if event.type == "flash_ready":
            print(f"[{event.confidence:.0%}] {event.query}: {event.value}")

curl -N http://localhost:8080/v1/sessions/sess_a1b2c3d4e5f6/events

The server first emits a single connected event with the current data version and flash query count, replays the cached answer (if any) for every registered query, and then streams flash_ready events as answers update:

event: connected
data: {"data_version":15,"flash_queries":3}

event: flash_ready
data: {"id":1,"query":"Is the trend bullish or bearish?","value":"Bullish","data_version":15,"confidence":0.92,"evaluated_at":1711000000000}

event: flash_ready
data: {"id":2,"query":"What is the current support level?","value":"Support at 185.10","data_version":15,"confidence":0.87,"evaluated_at":1711000000000}

event: flash_ready
data: {"id":3,"query":"Is volatility increasing?","value":"No","data_version":15,"confidence":0.78,"evaluated_at":1711000000000}

SSE Event Format

| Field | Type | Description | |-------|------|-------------| | id | integer | Flash query ID | | query | string | The registered question | | value | string | The model’s output | | data_version | integer | Data version this answer was evaluated against | | confidence | float | Confidence score (0.0 to 1.0) based on logit gap | | evaluated_at | integer | Unix millisecond timestamp when this answer was produced |

WebSocket

If you are already connected via WebSocket for data push, Flash Query results arrive on the same connection:

const socket = client.sessions.stream(session.session_id);

socket.on("flash_ready", (data) => {
    console.log(`[${(data.confidence * 100).toFixed(0)}%] ${data.query}: ${data.value}`);
});

from layerscale import WsFlashReady

with client.sessions.stream(session.session_id) as socket:
    for event in socket:
        if isinstance(event, WsFlashReady):
            d = event.data
            print(f"[{d.confidence:.0%}] {d.query}: {d.value}")

{
  "type": "flash_ready",
  "data": {
    "id": 1,
    "query": "Is the trend bullish or bearish?",
    "value": "Bullish",
    "confidence": 0.92,
    "data_version": 15
  }
}

Instant Cache Hits

When you call the generate endpoint with a prompt that matches a registered Flash Query, the cached answer is returned immediately without running inference:

const result = await client.sessions.query(session.session_id, {
    prompt: "Is the trend bullish or bearish?",
    max_tokens: 4,
});
// result.flash === true when a cached flash answer was returned

result = client.sessions.query(
    session.session_id,
    prompt="Is the trend bullish or bearish?",
    max_tokens=4,
)
# result.flash == True when a cached flash answer was returned

curl -X POST http://localhost:8080/v1/sessions/sess_a1b2c3d4e5f6/generate \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Is the trend bullish or bearish?",
    "max_tokens": 4
  }'

If the cache is fresh (evaluated against the latest data version):

{
  "text": "Bullish",
  "n_tokens": 1,
  "total_tokens": 2101,
  "pos_max": 2100,
  "flash": true,
  "flash_id": 1,
  "data_version": 15,
  "confidence": 0.92
}

This enables response times on the order of a few milliseconds (the cost is a JSON lookup and serialization, no GPU work).

Managing Flash Queries

List Queries

const list = await client.sessions.listFlash(session.session_id);
for (const q of list.flash_queries) {
    console.log(`#${q.id} ${q.query} -> ${q.value}`);
}

listing = client.sessions.list_flash(session.session_id)
for q in listing.flash_queries:
    print(f"#{q.id} {q.query} -> {q.value}")

curl http://localhost:8080/v1/sessions/sess_a1b2c3d4e5f6/flash

{
  "flash_queries": [
    {
      "id": 1,
      "query": "Is the trend bullish or bearish?",
      "max_tokens": 4,
      "n_tokens": 9,
      "has_answer": true,
      "value": "Bullish",
      "data_version": 15,
      "confidence": 0.92,
      "evaluated_at": 1711000000000
    },
    {
      "id": 2,
      "query": "What is the current support level?",
      "max_tokens": 16,
      "n_tokens": 12,
      "has_answer": true,
      "value": "Support at 185.10",
      "data_version": 15,
      "confidence": 0.87,
      "evaluated_at": 1711000000000
    }
  ],
  "count": 2,
  "current_data_version": 15
}

Delete a Query

await client.sessions.unflash(session.session_id, 1);

client.sessions.unflash(session.session_id, 1)

curl -X DELETE http://localhost:8080/v1/sessions/sess_a1b2c3d4e5f6/flash/1

Confidence Scores

The confidence score is computed from the logit gap between the top-1 and top-2 token probabilities at each generation step. A higher score means the model was more certain.

| Score Range | Interpretation | |-------------|----------------| | 0.90+ | High confidence, strong signal | | 0.70 - 0.89 | Moderate confidence | | 0.50 - 0.69 | Low confidence, consider a longer answer | | Below 0.50 | Uncertain, the model is hedging |

Use confidence scores to filter out noisy results or trigger alerts only on high-confidence changes.

Practical Examples

const session = await client.sessions.create({
    type: "ohlcv",
    prompt: "You are a quantitative analyst monitoring BTC/USD 1-minute candles.",
    flash: [
        { query: "Is momentum bullish or bearish?", max_tokens: 4 },
        { query: "Is a reversal forming?", max_tokens: 8 },
        { query: "What is the key support level?", max_tokens: 16 },
        { query: "What is the key resistance level?", max_tokens: 16 },
    ],
    markPrefix: true,
});

session = client.sessions.create(
    type="ohlcv",
    prompt="You are a quantitative analyst monitoring BTC/USD 1-minute candles.",
    flash=[
        {"query": "Is momentum bullish or bearish?", "max_tokens": 4},
        {"query": "Is a reversal forming?", "max_tokens": 8},
        {"query": "What is the key support level?", "max_tokens": 16},
        {"query": "What is the key resistance level?", "max_tokens": 16},
    ],
    mark_prefix=True,
)

Anomaly Detection

TypeScript
Python

const session = await client.sessions.create({
    type: "iot",
    prompt:
        "You are an anomaly detection system for server metrics. " +
        "Flag any unusual patterns in CPU, memory, or network metrics.",
    flash: [
        { query: "Are there any anomalies in the current data?", max_tokens: 32 },
        { query: "Is resource usage trending toward capacity?", max_tokens: 16 },
        { query: "Should ops be alerted?", max_tokens: 4 },
    ],
    markPrefix: true,
});

session = client.sessions.create(
    type="iot",
    prompt=(
        "You are an anomaly detection system for server metrics. "
        "Flag any unusual patterns in CPU, memory, or network metrics."
    ),
    flash=[
        {"query": "Are there any anomalies in the current data?", "max_tokens": 32},
        {"query": "Is resource usage trending toward capacity?", "max_tokens": 16},
        {"query": "Should ops be alerted?", "max_tokens": 4},
    ],
    mark_prefix=True,
)

Threshold Alerts

Use Flash Queries with a confidence filter to build alerting:

TypeScript
Python

const ALERT_THRESHOLD = 0.85;

for await (const event of client.sessions.events(session.session_id)) {
    if (event.type !== "flash_ready") continue;
    if (event.confidence < ALERT_THRESHOLD) continue;

    const answer = event.value.toLowerCase();
    if (answer.includes("alert") || answer.includes("yes")) {
        sendAlert({
            query: event.query,
            answer: event.value,
            confidence: event.confidence,
        });
    }
}

ALERT_THRESHOLD = 0.85

with client.sessions.events(session.session_id) as events:
    for event in events:
        if event.type != "flash_ready":
            continue
        if event.confidence < ALERT_THRESHOLD:
            continue

        answer = event.value.lower()
        if "alert" in answer or "yes" in answer:
            send_alert(
                query=event.query,
                answer=event.value,
                confidence=event.confidence,
            )

Performance Considerations

Each Flash Query adds evaluation time after every data batch. Each session is capped at 20 registered queries.
Use low max_tokens for categorical answers. A yes/no question needs max_tokens: 4, not 128.
Flash Queries share the session’s GPU context. They are evaluated sequentially after data processing, not in parallel.
The decode scheduler runs with FLASH > SESSION > POOL > STREAM priority, so Flash Query evaluation preempts background streaming work and pool-backed chat completions, keeping cached answers fresh under heavy data load.