LayerScale - The Stateful Inference Engine
LayerScale is a real-time inference engine for LLMs. Deploy any open-weight model on your own GPU, process data as it arrives, and get sub-50ms query responses. Drop-in compatible with the OpenAI and Anthropic SDKs - change two lines and you’re running on your own hardware.
Three things that make LayerScale different
Streaming Inference
Data flows into the model continuously via HTTP or WebSocket. No polling. No batching on your side. Push a market tick, a sensor reading, or a log line - the model incorporates it into its live context in the background. By the time you query, it has already reasoned over your latest data. Learn more →
Flash Queries
Register questions that the engine evaluates after every data update. Answers are cached and pushed to your application in real time. When you ask a question that matches a Flash Query, the answer comes back in sub-millisecond time - no inference, just a cache hit. Think of it as a standing order for intelligence. Learn more →
Agentic Tool Calling
Every tool call in a multi-step agent workflow typically re-processes the entire conversation. LayerScale retains the full conversation state in GPU memory across turns. The 50th tool call is as fast as the first. Constant-time turns, not linear. This is what makes agents actually usable in production. Learn more →
Where to start
Run a model and send your first query - start with the Quick Start. You’ll be up in under 5 minutes with Docker.
Use LayerScale from your existing code - see OpenAI Compatible or Anthropic Compatible. It’s a two-line change.
Build streaming data pipelines - see Streaming Data and Flash Queries.
Deploy to production - see Deployment for Docker Compose, Kubernetes, AWS, and GCP configurations.