Tool Calling
LayerScale supports OpenAI-compatible and Anthropic-compatible tool calling (function calling). As a stateful inference engine, LayerScale is particularly effective for agentic workloads where models call tools repeatedly across multi-turn conversations.
Why LayerScale for Agents
In a typical agentic loop, the model generates a tool call, your code executes it, and you send the result back. Each round trip re-sends the full conversation history. With cloud APIs, every turn re-processes all previous tokens.
LayerScale’s stateful engine retains the full conversation state in GPU memory across turns. Previous context is never reprocessed. This means:
- Constant-time turns. The 50th tool call in a chain is as fast as the first.
- Lower latency. No prompt re-processing overhead between turns.
- Lower cost. GPU compute scales with new tokens only, not total conversation length.
Basic Tool Calling
OpenAI Format
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="your-api-key")
tools = [ { "type": "function", "function": { "name": "search_database", "description": "Search a product database by query", "parameters": { "type": "object", "properties": { "query": {"type": "string", "description": "Search query"}, "limit": {"type": "integer", "description": "Max results"}, }, "required": ["query"], }, }, }, { "type": "function", "function": { "name": "get_product_details", "description": "Get full details for a product by ID", "parameters": { "type": "object", "properties": { "product_id": {"type": "string"}, }, "required": ["product_id"], }, }, },]
messages = [ {"role": "system", "content": "You are a shopping assistant. Use the available tools to help users find products."}, {"role": "user", "content": "Find me a good mechanical keyboard under $100"},]
response = client.chat.completions.create( model="meta-llama/Llama-4-Maverick-17B-128E-Original", messages=messages, tools=tools, max_tokens=300,)
print(response.choices[0].message)Anthropic Format
import anthropic
client = anthropic.Anthropic(base_url="http://localhost:8080", api_key="your-api-key")
tools = [ { "name": "search_database", "description": "Search a product database by query", "input_schema": { "type": "object", "properties": { "query": {"type": "string", "description": "Search query"}, "limit": {"type": "integer", "description": "Max results"}, }, "required": ["query"], }, },]
response = client.messages.create( model="meta-llama/Llama-4-Maverick-17B-128E-Original", max_tokens=300, tools=tools, messages=[ {"role": "user", "content": "Find me a good mechanical keyboard under $100"}, ],)Agentic Loop
A complete tool-calling loop that runs until the model stops requesting tools:
from openai import OpenAIimport json
client = OpenAI(base_url="http://localhost:8080/v1", api_key="your-api-key")
# Define your toolstools = [ { "type": "function", "function": { "name": "read_file", "description": "Read the contents of a file", "parameters": { "type": "object", "properties": { "path": {"type": "string", "description": "File path"}, }, "required": ["path"], }, }, }, { "type": "function", "function": { "name": "write_file", "description": "Write content to a file", "parameters": { "type": "object", "properties": { "path": {"type": "string"}, "content": {"type": "string"}, }, "required": ["path", "content"], }, }, }, { "type": "function", "function": { "name": "run_command", "description": "Run a shell command", "parameters": { "type": "object", "properties": { "command": {"type": "string"}, }, "required": ["command"], }, }, },]
# Your tool implementationsdef execute_tool(name, args): if name == "read_file": return open(args["path"]).read() elif name == "write_file": with open(args["path"], "w") as f: f.write(args["content"]) return "File written successfully" elif name == "run_command": import subprocess result = subprocess.run(args["command"], shell=True, capture_output=True, text=True) return result.stdout or result.stderr
messages = [ {"role": "system", "content": "You are a coding assistant. Use tools to read, write, and test code."}, {"role": "user", "content": "Create a Python script that prints the first 20 Fibonacci numbers, save it as fib.py, and run it."},]
# Agentic loopwhile True: response = client.chat.completions.create( model="meta-llama/Llama-4-Maverick-17B-128E-Original", messages=messages, tools=tools, max_tokens=500, )
choice = response.choices[0] messages.append(choice.message)
if choice.finish_reason == "stop": print(choice.message.content) break
if choice.message.tool_calls: for tool_call in choice.message.tool_calls: args = json.loads(tool_call.function.arguments) print(f" -> {tool_call.function.name}({args})")
result = execute_tool(tool_call.function.name, args)
messages.append({ "role": "tool", "tool_call_id": tool_call.id, "content": str(result), })LangChain Integration
from langchain_openai import ChatOpenAIfrom langchain.agents import create_tool_calling_agent, AgentExecutorfrom langchain.tools import toolfrom langchain_core.prompts import ChatPromptTemplate
# Point LangChain at LayerScalellm = ChatOpenAI( base_url="http://localhost:8080/v1", api_key="your-api-key", model="meta-llama/Llama-4-Maverick-17B-128E-Original", temperature=0,)
@tooldef calculator(expression: str) -> str: """Evaluate a math expression. Example: '2 + 3 * 4'""" return str(eval(expression))
@tooldef search_web(query: str) -> str: """Search the web for information.""" # Your search implementation return f"Results for: {query}"
prompt = ChatPromptTemplate.from_messages([ ("system", "You are a helpful research assistant."), ("human", "{input}"), ("placeholder", "{agent_scratchpad}"),])
agent = create_tool_calling_agent(llm, [calculator, search_web], prompt)executor = AgentExecutor(agent=agent, tools=[calculator, search_web], verbose=True)
result = executor.invoke({"input": "What is 15% of 847?"})print(result["output"])CrewAI Integration
from crewai import Agent, Task, Crewfrom langchain_openai import ChatOpenAI
llm = ChatOpenAI( base_url="http://localhost:8080/v1", api_key="your-api-key", model="meta-llama/Llama-4-Maverick-17B-128E-Original", temperature=0.3,)
researcher = Agent( role="Research Analyst", goal="Find and analyze relevant data", backstory="You are a senior research analyst with expertise in data analysis.", llm=llm, verbose=True,)
writer = Agent( role="Technical Writer", goal="Write clear and concise reports", backstory="You are a technical writer who turns complex analysis into readable reports.", llm=llm, verbose=True,)
research_task = Task( description="Research the current state of edge computing and its impact on AI inference.", agent=researcher, expected_output="A structured analysis with key findings",)
writing_task = Task( description="Write a one-page executive summary based on the research findings.", agent=writer, expected_output="A polished executive summary",)
crew = Crew( agents=[researcher, writer], tasks=[research_task, writing_task], verbose=True,)
result = crew.kickoff()print(result)OpenAI Agents SDK
from agents import Agent, Runner, function_toolfrom openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="your-api-key")
@function_tooldef get_inventory(product_id: str) -> str: """Check inventory for a product.""" # Your inventory check return f"Product {product_id}: 42 units in stock"
@function_tooldef place_order(product_id: str, quantity: int) -> str: """Place an order for a product.""" return f"Order placed: {quantity}x {product_id}"
agent = Agent( name="inventory-agent", instructions="You manage product inventory. Check stock before placing orders.", tools=[get_inventory, place_order], model="meta-llama/Llama-4-Maverick-17B-128E-Original",)
result = Runner.run_sync( agent, "Order 5 units of WIDGET-A if we have enough in stock.", client=client,)
print(result.final_output)Interleaved Multi-Agent Patterns
Run multiple agents on the same LayerScale instance. Each uses the same model but maintains independent conversation state:
from openai import OpenAIimport jsonimport concurrent.futures
client = OpenAI(base_url="http://localhost:8080/v1", api_key="your-api-key")
tools = [ { "type": "function", "function": { "name": "submit_finding", "description": "Submit a finding to the shared workspace", "parameters": { "type": "object", "properties": { "category": {"type": "string"}, "finding": {"type": "string"}, "severity": {"type": "string", "enum": ["low", "medium", "high"]}, }, "required": ["category", "finding", "severity"], }, }, },]
findings = []
def run_agent(role, task): messages = [ {"role": "system", "content": f"You are a {role}. Use submit_finding to report results."}, {"role": "user", "content": task}, ]
while True: response = client.chat.completions.create( model="meta-llama/Llama-4-Maverick-17B-128E-Original", messages=messages, tools=tools, max_tokens=400 ) choice = response.choices[0] messages.append(choice.message)
if choice.finish_reason == "stop": break
if choice.message.tool_calls: for tc in choice.message.tool_calls: args = json.loads(tc.function.arguments) findings.append({"agent": role, **args}) messages.append({ "role": "tool", "tool_call_id": tc.id, "content": "Finding recorded.", })
# Run agents concurrentlywith concurrent.futures.ThreadPoolExecutor(max_workers=3) as executor: executor.submit(run_agent, "security auditor", "Review auth.py for vulnerabilities") executor.submit(run_agent, "performance analyst", "Review db_queries.py for N+1 queries") executor.submit(run_agent, "code reviewer", "Review utils.py for maintainability issues")
for f in findings: print(f"[{f['agent']}] [{f['severity']}] {f['category']}: {f['finding']}")Recommended Models for Tool Calling
Models with strong instruction following produce more reliable tool calls:
| Model | Size | Tool Calling Quality | Notes |
|---|---|---|---|
| Mistral Small 3.2 | 24B | Excellent | Best all-around for tool calling at this size. Fits on a single 24 GB GPU at Q4. |
| Llama 3.3 | 70B | Excellent | The gold standard for open-weight tool use if you have the VRAM. |
| Llama 4 Maverick | 17B active (MoE) | Very good | Built for agentic workloads. MoE architecture requires more RAM than active parameter count suggests. |
| Mistral Nemo | 12B | Very good | Fast and lightweight. Strong replacement for older 7B models. |
| Mistral 4 | Various | Excellent | Latest generation Mistral with native tool calling support. |
| Functionary v3.2 | Various | Excellent | Fine-tuned specifically for OpenAI-format tool use. Best option if tool calling is your primary workload. |
For complex multi-tool chains, use a 24B+ model. Mistral Small 3.2 is the sweet spot for most use cases. If quality is paramount and you have the VRAM, Llama 3.3 70B remains the most reliable. Smaller models (12B and under) work well for single-tool patterns but can struggle with multi-step orchestration.