Skip to content

Tool Calling

LayerScale supports OpenAI-compatible and Anthropic-compatible tool calling (function calling). As a stateful inference engine, LayerScale is particularly effective for agentic workloads where models call tools repeatedly across multi-turn conversations.

Why LayerScale for Agents

In a typical agentic loop, the model generates a tool call, your code executes it, and you send the result back. Each round trip re-sends the full conversation history. With cloud APIs, every turn re-processes all previous tokens.

LayerScale’s stateful engine retains the full conversation state in GPU memory across turns. Previous context is never reprocessed. This means:

  • Constant-time turns. The 50th tool call in a chain is as fast as the first.
  • Lower latency. No prompt re-processing overhead between turns.
  • Lower cost. GPU compute scales with new tokens only, not total conversation length.

Basic Tool Calling

OpenAI Format

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="your-api-key")
tools = [
{
"type": "function",
"function": {
"name": "search_database",
"description": "Search a product database by query",
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "Search query"},
"limit": {"type": "integer", "description": "Max results"},
},
"required": ["query"],
},
},
},
{
"type": "function",
"function": {
"name": "get_product_details",
"description": "Get full details for a product by ID",
"parameters": {
"type": "object",
"properties": {
"product_id": {"type": "string"},
},
"required": ["product_id"],
},
},
},
]
messages = [
{"role": "system", "content": "You are a shopping assistant. Use the available tools to help users find products."},
{"role": "user", "content": "Find me a good mechanical keyboard under $100"},
]
response = client.chat.completions.create(
model="meta-llama/Llama-4-Maverick-17B-128E-Original",
messages=messages,
tools=tools,
max_tokens=300,
)
print(response.choices[0].message)

Anthropic Format

import anthropic
client = anthropic.Anthropic(base_url="http://localhost:8080", api_key="your-api-key")
tools = [
{
"name": "search_database",
"description": "Search a product database by query",
"input_schema": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "Search query"},
"limit": {"type": "integer", "description": "Max results"},
},
"required": ["query"],
},
},
]
response = client.messages.create(
model="meta-llama/Llama-4-Maverick-17B-128E-Original",
max_tokens=300,
tools=tools,
messages=[
{"role": "user", "content": "Find me a good mechanical keyboard under $100"},
],
)

Agentic Loop

A complete tool-calling loop that runs until the model stops requesting tools:

from openai import OpenAI
import json
client = OpenAI(base_url="http://localhost:8080/v1", api_key="your-api-key")
# Define your tools
tools = [
{
"type": "function",
"function": {
"name": "read_file",
"description": "Read the contents of a file",
"parameters": {
"type": "object",
"properties": {
"path": {"type": "string", "description": "File path"},
},
"required": ["path"],
},
},
},
{
"type": "function",
"function": {
"name": "write_file",
"description": "Write content to a file",
"parameters": {
"type": "object",
"properties": {
"path": {"type": "string"},
"content": {"type": "string"},
},
"required": ["path", "content"],
},
},
},
{
"type": "function",
"function": {
"name": "run_command",
"description": "Run a shell command",
"parameters": {
"type": "object",
"properties": {
"command": {"type": "string"},
},
"required": ["command"],
},
},
},
]
# Your tool implementations
def execute_tool(name, args):
if name == "read_file":
return open(args["path"]).read()
elif name == "write_file":
with open(args["path"], "w") as f:
f.write(args["content"])
return "File written successfully"
elif name == "run_command":
import subprocess
result = subprocess.run(args["command"], shell=True, capture_output=True, text=True)
return result.stdout or result.stderr
messages = [
{"role": "system", "content": "You are a coding assistant. Use tools to read, write, and test code."},
{"role": "user", "content": "Create a Python script that prints the first 20 Fibonacci numbers, save it as fib.py, and run it."},
]
# Agentic loop
while True:
response = client.chat.completions.create(
model="meta-llama/Llama-4-Maverick-17B-128E-Original",
messages=messages,
tools=tools,
max_tokens=500,
)
choice = response.choices[0]
messages.append(choice.message)
if choice.finish_reason == "stop":
print(choice.message.content)
break
if choice.message.tool_calls:
for tool_call in choice.message.tool_calls:
args = json.loads(tool_call.function.arguments)
print(f" -> {tool_call.function.name}({args})")
result = execute_tool(tool_call.function.name, args)
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": str(result),
})

LangChain Integration

from langchain_openai import ChatOpenAI
from langchain.agents import create_tool_calling_agent, AgentExecutor
from langchain.tools import tool
from langchain_core.prompts import ChatPromptTemplate
# Point LangChain at LayerScale
llm = ChatOpenAI(
base_url="http://localhost:8080/v1",
api_key="your-api-key",
model="meta-llama/Llama-4-Maverick-17B-128E-Original",
temperature=0,
)
@tool
def calculator(expression: str) -> str:
"""Evaluate a math expression. Example: '2 + 3 * 4'"""
return str(eval(expression))
@tool
def search_web(query: str) -> str:
"""Search the web for information."""
# Your search implementation
return f"Results for: {query}"
prompt = ChatPromptTemplate.from_messages([
("system", "You are a helpful research assistant."),
("human", "{input}"),
("placeholder", "{agent_scratchpad}"),
])
agent = create_tool_calling_agent(llm, [calculator, search_web], prompt)
executor = AgentExecutor(agent=agent, tools=[calculator, search_web], verbose=True)
result = executor.invoke({"input": "What is 15% of 847?"})
print(result["output"])

CrewAI Integration

from crewai import Agent, Task, Crew
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
base_url="http://localhost:8080/v1",
api_key="your-api-key",
model="meta-llama/Llama-4-Maverick-17B-128E-Original",
temperature=0.3,
)
researcher = Agent(
role="Research Analyst",
goal="Find and analyze relevant data",
backstory="You are a senior research analyst with expertise in data analysis.",
llm=llm,
verbose=True,
)
writer = Agent(
role="Technical Writer",
goal="Write clear and concise reports",
backstory="You are a technical writer who turns complex analysis into readable reports.",
llm=llm,
verbose=True,
)
research_task = Task(
description="Research the current state of edge computing and its impact on AI inference.",
agent=researcher,
expected_output="A structured analysis with key findings",
)
writing_task = Task(
description="Write a one-page executive summary based on the research findings.",
agent=writer,
expected_output="A polished executive summary",
)
crew = Crew(
agents=[researcher, writer],
tasks=[research_task, writing_task],
verbose=True,
)
result = crew.kickoff()
print(result)

OpenAI Agents SDK

from agents import Agent, Runner, function_tool
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="your-api-key")
@function_tool
def get_inventory(product_id: str) -> str:
"""Check inventory for a product."""
# Your inventory check
return f"Product {product_id}: 42 units in stock"
@function_tool
def place_order(product_id: str, quantity: int) -> str:
"""Place an order for a product."""
return f"Order placed: {quantity}x {product_id}"
agent = Agent(
name="inventory-agent",
instructions="You manage product inventory. Check stock before placing orders.",
tools=[get_inventory, place_order],
model="meta-llama/Llama-4-Maverick-17B-128E-Original",
)
result = Runner.run_sync(
agent,
"Order 5 units of WIDGET-A if we have enough in stock.",
client=client,
)
print(result.final_output)

Interleaved Multi-Agent Patterns

Run multiple agents on the same LayerScale instance. Each uses the same model but maintains independent conversation state:

from openai import OpenAI
import json
import concurrent.futures
client = OpenAI(base_url="http://localhost:8080/v1", api_key="your-api-key")
tools = [
{
"type": "function",
"function": {
"name": "submit_finding",
"description": "Submit a finding to the shared workspace",
"parameters": {
"type": "object",
"properties": {
"category": {"type": "string"},
"finding": {"type": "string"},
"severity": {"type": "string", "enum": ["low", "medium", "high"]},
},
"required": ["category", "finding", "severity"],
},
},
},
]
findings = []
def run_agent(role, task):
messages = [
{"role": "system", "content": f"You are a {role}. Use submit_finding to report results."},
{"role": "user", "content": task},
]
while True:
response = client.chat.completions.create(
model="meta-llama/Llama-4-Maverick-17B-128E-Original", messages=messages, tools=tools, max_tokens=400
)
choice = response.choices[0]
messages.append(choice.message)
if choice.finish_reason == "stop":
break
if choice.message.tool_calls:
for tc in choice.message.tool_calls:
args = json.loads(tc.function.arguments)
findings.append({"agent": role, **args})
messages.append({
"role": "tool",
"tool_call_id": tc.id,
"content": "Finding recorded.",
})
# Run agents concurrently
with concurrent.futures.ThreadPoolExecutor(max_workers=3) as executor:
executor.submit(run_agent, "security auditor", "Review auth.py for vulnerabilities")
executor.submit(run_agent, "performance analyst", "Review db_queries.py for N+1 queries")
executor.submit(run_agent, "code reviewer", "Review utils.py for maintainability issues")
for f in findings:
print(f"[{f['agent']}] [{f['severity']}] {f['category']}: {f['finding']}")

Models with strong instruction following produce more reliable tool calls:

ModelSizeTool Calling QualityNotes
Mistral Small 3.224BExcellentBest all-around for tool calling at this size. Fits on a single 24 GB GPU at Q4.
Llama 3.370BExcellentThe gold standard for open-weight tool use if you have the VRAM.
Llama 4 Maverick17B active (MoE)Very goodBuilt for agentic workloads. MoE architecture requires more RAM than active parameter count suggests.
Mistral Nemo12BVery goodFast and lightweight. Strong replacement for older 7B models.
Mistral 4VariousExcellentLatest generation Mistral with native tool calling support.
Functionary v3.2VariousExcellentFine-tuned specifically for OpenAI-format tool use. Best option if tool calling is your primary workload.

For complex multi-tool chains, use a 24B+ model. Mistral Small 3.2 is the sweet spot for most use cases. If quality is paramount and you have the VRAM, Llama 3.3 70B remains the most reliable. Smaller models (12B and under) work well for single-tool patterns but can struggle with multi-step orchestration.