EvalKit
EvalKit is an open-source SDK for adding LLM tracing and evaluation to your AI applications. Instrument your code in minutes โ traces appear in the Syntropylabs dashboard automatically.
Distributed Tracing
Every LLM call, tool use, and HTTP request, as a span.
LLM Evaluation
Run custom prompt-based judges on any trace or dataset.
Auto-Instrument
Zero-config patching for OpenAI, Anthropic, Axios, and more.
Scenario Simulation
Generate synthetic users and run them against your real agent.
Python SDK
Installation
Install from PyPI:
pip install syntropylabs-evalkitThe distribution installs as syntropylabs-evalkit but the import name stays evalkit.
Quick Start
import evalkit
client = evalkit.init(
subscription_key="tk_live_...", # from Settings โ Tracing
service_name="my-app",
environment="production",
debug=True,
)
# All OpenAI / Anthropic calls are now traced automatically
from openai import OpenAI
openai_client = OpenAI()
response = openai_client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)Manual spans
end, ctx = evalkit.start_span("my-operation", {"custom.key": "value"})
try:
result = do_something()
end("ok")
except Exception:
end("error")
raiseTracing your own functions (APM) โ automatic
Function tracing is on by default. init() traces every function in your app's own source tree (the directory of the file that called init()) as each module imports โ one function_call span per call with input/output/latency. Third-party libraries are never wrapped, only module-level functions are (class methods are left alone), and signatures are preserved so framework introspection (FastAPI Depends, etc.) keeps working.
# Nothing to wire โ just init():
evalkit.init(subscription_key="tk_live_...", service_name="my-api")
# every function your app defines is now traced as it imports.
# Disable it:
# EVALKIT_FUNCTION_TRACE=false (env)
# evalkit.init(..., function_tracing=False)
# Trace sibling packages outside the caller's dir:
# evalkit.init(..., trace_packages=["support_bot", "workers"])For finer control you can still opt in explicitly โ a function, a tool, a whole class, or a module/package:
# One function -> function_call span (input / output / latency)
@evalkit.trace_function()
def do_work(x):
return x * 2
# One tool -> tool_call span (renders in Input/Output panels + tool metrics)
@evalkit.trace_tool()
def search_web(query: str):
return run_search(query)
# Every method of a class, APM-style
@evalkit.traced
class OrderService:
def place(self, order): ...
def cancel(self, id): ...
# Every function defined in a module / whole package
import myapp.services as svc
evalkit.trace_module(svc)
evalkit.trace_package(myapp)Client-side tools you run yourself (your own functions the model calls) only show their output if you wrap them with trace_tool โ the SDK sees the model's request but never your function's return value. Server-side tools (e.g. OpenAI web_search) and LangChain tools are captured automatically.
Configuration options
evalkit.init(
subscription_key="tk_live_...",
base_url="https://api.syntropylabs.ai", # default
service_name="my-service",
environment="production", # production | staging | development
debug=False, # log exports to stdout
scheduled_delay_millis=5000, # batch export delay (ms)
max_body_bytes=10 * 1024 * 1024, # max captured HTTP body size (default 10 MB)
function_tracing=True, # auto-trace your app's functions (default True)
trace_packages=None, # extra sibling packages to auto-trace
)TypeScript / Node.js SDK
Installation
Install from npm:
npm install syntropylabs-evalkit # or yarn add syntropylabs-evalkitQuick Start
import * as evalkit from 'syntropylabs-evalkit';
const client = evalkit.init({
subscriptionKey: 'tk_live_...', // from Settings โ Tracing
serviceName: 'my-app',
environment: 'production',
debug: true,
});
// OpenAI is auto-patched โ just use it normally
import OpenAI from 'openai';
const openai = new OpenAI();
const res = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [{ role: 'user', content: 'Hello!' }],
});
console.log(res.choices[0].message.content);Manual spans
import { startSpan } from 'syntropylabs-evalkit';
const { end } = startSpan('my-operation', { 'custom.key': 'value' });
try {
const data = await fetchData();
end('OK', { 'result.count': data.length });
} catch (e) {
end('ERROR', { 'error.message': String(e) });
throw e;
}Tracing your own functions & tools (APM)
Auto-instrumentation covers libraries (LLM / HTTP / DB). For your code, opt in โ a function, a tool, a class, or a whole service object:
import * as evalkit from 'syntropylabs-evalkit';
import { Traced } from 'syntropylabs-evalkit';
// One function -> function_call span (input / output / latency)
const doWork = evalkit.traceFunction('doWork', (x: number) => x * 2);
// One tool -> tool_call span (Input/Output panels + tool metrics)
const searchWeb = evalkit.traceTool('search_web', (q: string) => runSearch(q));
// Every method of a class, APM-style
@Traced()
class OrderService {
place(order: Order) { /* ... */ }
cancel(id: string) { /* ... */ }
}
// Every function of a service object (parity with Python's trace_module)
export const orders = evalkit.traceObject({ place, cancel }, { prefix: 'orders' });
// NestJS โ trace EVERY provider/controller method via the DI registry.
// One line in main.ts after create โ pass the app, the SDK resolves
// DiscoveryService itself (no @nestjs/core import needed):
await evalkit.enableNestjsAutoTrace(app);Client-side tools you run yourself only show their output if you wrap them with traceTool โ the SDK sees the model's request but never your function's return value. Server-side tools (e.g. OpenAI web_search) and LangChain tools are captured automatically.
Auto-discovery of all functions is only possible where the framework has a registry (NestJS DI). Route metadata (@Get, guards) is preserved, and the call never throws. For Express / Fastify / Koa / Hono / Hapi, use traceObject / traceFunction on your modules.
NestJS / Express
The SDK auto-instruments all incoming HTTP requests โ no manual middleware needed. Call evalkit.init() before your app bootstraps, then one line after create() turns on full APM (every provider/controller method):
// main.ts (NestJS)
import * as evalkit from 'syntropylabs-evalkit';
evalkit.init({
subscriptionKey: process.env.EVALKIT_SUBSCRIPTION_KEY!,
serviceName: 'my-nestjs-app',
environment: process.env.NODE_ENV ?? 'development',
});
// Then bootstrap normally
const app = await NestFactory.create(AppModule);
await evalkit.enableNestjsAutoTrace(app); // APM: trace every provider/controller
await app.listen(3000);Configuration options
evalkit.init({
subscriptionKey: 'tk_live_...',
baseUrl: 'https://api.syntropylabs.ai', // default
serviceName: 'my-service',
environment: 'production',
debug: false,
scheduledDelayMillis: 5000, // batch export delay (ms)
maxBodyBytes: 10 * 1024 * 1024, // max captured HTTP body size (default 10 MB)
});Tracing
What gets traced automatically
W3C traceparent propagation
Pass the traceparent header from your frontend to backend to stitch spans into a single trace across services. EvalKit reads the header automatically and creates a child span.
// Frontend: propagate traceparent to your API
fetch('/api/chat', {
headers: {
'Content-Type': 'application/json',
traceparent: evalkit.getTraceparent(), // from syntropylabs-evalkit
},
body: JSON.stringify({ message }),
});Viewing traces
Go to Dashboard โ Tracing. Select a Trace Project to see all traces. Click any trace to open the waterfall view โ spans are colour-coded by type (LLM, tool, HTTP, DB).
Evaluation
Offline evaluation (manual)
Select one or more traces in the Tracing dashboard, click Evaluate, choose your evaluation rules and a judge model, then click Run. Results are saved and shown in the Evaluation tab of each trace.
Online evaluation (automatic)
Enable online evaluation per Trace Project to automatically evaluate every new trace as it arrives. Click Online Eval in the tracing toolbar, choose rules, a judge model, and a polling interval.
Evaluation rules
Rules are prompt templates that a judge LLM uses to score a trace. Go to Dashboard โ Evaluators to create rules. Group related rules into Collections so you can apply them all at once.
Example rule prompt:
"Given the conversation below, score from 0.0 to 1.0 how well the
assistant stayed on topic and did not hallucinate.
Conversation:
{{trace}}
Return JSON: { "score": <float>, "passed": <bool>, "reasoning": "<str>" }"Scenario Simulation
EvalKit can generate realistic synthetic user scenarios from your agent's system prompt and tool list, then run each scenario against your real agent, turn by turn, and score the results. This lets you stress-test your agent's behavior without writing test cases by hand.
How it works
Python โ generate_scenarios + simulate_user
import evalkit
evalkit.init(subscription_key="tk_live_...", service_name="support-bot")
SYSTEM_PROMPT = "You are a customer support agent for Acme Store..."
TOOLS = ["search_knowledge_base", "lookup_order", "create_support_ticket"]
# Step 1 โ generate scenarios from your agent's own prompt & tools
scenarios = evalkit.generate_scenarios(
SYSTEM_PROMPT,
tools=TOOLS,
count=5,
provider="anthropic", # or "openai" / "google"
api_key="sk-ant-...", # your BYOK key for generation
model="claude-haiku-4-5-20251001",
)
# Step 2 โ run each scenario against the real agent and score it
def entrypoint(ctx: evalkit.SimContext) -> evalkit.AgentTurnResult:
# ctx.message โ the synthetic user's message for this turn
# ctx.session_id โ keep multi-turn context across calls for the same session
reply, tool_calls = run_agent(ctx.session_id, ctx.message)
return evalkit.AgentTurnResult(
text=reply,
tool_calls=[{"name": t} for t in tool_calls],
)
report = evalkit.simulate_user(entrypoint, scenarios, tags=["ci"])
print("Simulation ID:", report["simulation_id"])TypeScript โ generateScenarios + simulateUser
import evalkit from 'syntropylabs-evalkit';
evalkit.init({ subscriptionKey: 'tk_live_...', serviceName: 'support-bot' });
const scenarios = await evalkit.generateScenarios({
agentInstructions: SYSTEM_PROMPT,
tools: ['search_knowledge_base', 'lookup_order', 'create_support_ticket'],
count: 5,
provider: 'anthropic',
apiKey: process.env.ANTHROPIC_API_KEY,
});
const { simulationId, results } = await evalkit.simulateUser({
scenarios,
entrypoint: async (ctx) => {
const { text, toolCalls } = await runAgent(ctx.sessionId, ctx.message);
return { text, toolCalls };
},
tags: ['ci'],
});
console.log('Simulation ID:', simulationId);Evaluate a run against a collection โ evaluate_simulation
simulate_userrecords each scenario as a trace and computes lightweight offline scores. To grade a finished run against your project's evaluator collection (LLM-as-judge rules), call evaluate_simulation with the simulation_id and a collection id. The judge is bring-your-own-key. Results come back per scenario and per criterion (with reasons) and also appear in the Tracing dashboard.
report = evalkit.simulate_user(entrypoint, scenarios, tags=["ci"])
result = evalkit.evaluate_simulation(
report["simulation_id"],
collection_id="665f0c...", # Dashboard โ Evaluators โ Collections
provider="openai", # BYOK judge
model="gpt-4o",
api_key="sk-...",
max_tokens=1024, # optional judge output cap
# run_id="run_..." # optional; defaults to the latest run
)
print(result["aggregate"]) # { averageScore, passRate, ... }
for scn in result["scenarios"]:
print(scn["name"], scn["overallScore"], scn["passed"])
for m in scn["metrics"]:
print(" -", m["ruleName"], m["score"], m["reason"])const { simulationId } = await evalkit.simulateUser({ scenarios, entrypoint });
const result = await evalkit.evaluateSimulation({
simulationId,
collectionId: '665f0c...', // Dashboard โ Evaluators โ Collections
provider: 'openai', // BYOK judge
model: 'gpt-4o',
apiKey: process.env.OPENAI_API_KEY!,
maxTokens: 1024, // optional judge output cap
});
console.log(result.aggregate); // { averageScore, passRate, ... }
for (const scn of result.scenarios) {
console.log(scn.name, scn.overallScore, scn.passed);
for (const m of scn.metrics) console.log(' -', m.ruleName, m.score, m.reason);
}Multi-turn scenarios & count
Generated scenarios are multi-turn by default: each carries a conversation_plan (โฅ3 user turns; โฅ2 for pure-refusal cases) and a turns array of verbatim user messages the simulator replays in order โ so a single run drives a real back-and-forth, not a one-shot prompt.
count is honored exactly. When you ask for more scenarios than there are coverage categories, the categories cycle (round-robin) and repeated-category scenarios are made materially distinct โ socount=20 returns 20 scenarios, not 9.
BYOK โ bring your own key for generation
Scenario generation uses a hosted model by default. Pass provider + api_key (Python) or apiKey (TypeScript) to use your own Anthropic, OpenAI, or Google key instead. The agent that runs during simulation is always your own โ EvalKit only orchestrates the synthetic-user messages.
Viewing results
Go to Dashboard โ Simulations. Each run shows scenario scores, per-turn traces, tool trajectory, and LLM-judge ratings. Runs are tagged so you can filter by environment or CI branch.
Key Concepts
EvalKit is built by Syntropylabs. Published on PyPI and npm.