Evaluate and optimize
your agent harness.
Full trace waterfall from orchestrator to leaf tool. LLM-as-judge scoring. Offline/Online eval callbacks on every live turn.
Works with any LLM, any framework. One call, everything Tracked.
One trace per agent run
Every step — tool calls, LLM calls, DB reads — stitched into a single waterfall. No fragmented logs.
Tool Call: stripe.refund
"order_id": "ORD-8429",
"amount": 49.99,
"reason": "customer_requested"
}
"status": "refund_processed",
"receipt_url": "https://stripe.com/..."
}
AI fails differently
than normal software.
Agents don't throw exceptions. They hallucinate, drift, and burn tokens — and you only find out when a user complains.
Traces shatter across threads.
Multi-step agents spawn spans across services. You get 14 disconnected logs instead of one waterfall. Debugging means grep, not insight.
You have no quality signal.
The agent returns a string. You don't know if it hallucinated until a user files a ticket. Correctness is invisible at the API boundary.
Manual QA doesn't scale.
Spot-checking 10 outputs feels thorough. At 1,000 it's theater. At 10,000 it's negligence. You need automated evaluation, not more reviewers.
Cost is invisible until it isn't.
Token usage and latency compound silently across sessions and devices. The monthly invoice is not a monitoring strategy.
The complete evaluation stack
Trace every span, score every output, catch every regression — from prototype to production.
Agent Tracing
Every LLM call, tool invocation, HTTP hop, and DB query lands as a typed span — one waterfall per run.
Zero-Config SDK
One init() call. All providers patched automatically.
Online Evaluation
Every production trace scored automatically as it arrives. No cron jobs.
Voice Agent Eval
Generate audio, score naturalness + instruction-following, replay rows. Connect live in-browser.
Batch Evaluation
Run a dataset through your agent, score per row, regenerate failures individually.
Image & Video Eval
Generate, then LLM-as-judge score across visual quality dimensions.
Session & Device
session_id groups a conversation. device_id tracks originating clients APM-style.
Model Catalog
Central registry for all providers. Set access grants, test models inline.
Evaluator Collections
Group LLM-as-judge prompts into reusable collections. Apply to any project or set globally. Three rule types: model-graded, custom prompt, statistical.
SDK, harness evaluation, dashboard — one stack
Everything your agent needs from dev to production, without stitching together five tools.
SDK
One init() call instruments your entire agent stack — LLM clients, HTTP, SQL, logging. Python and TypeScript, zero deps.
- Auto-patches OpenAI, Anthropic, Gemini, Bedrock
- W3C traceparent propagation
- Session + device context
Agent Harness evaluation
The evaluation core. Run batch jobs, score traces online, compare models pairwise — text, voice, image, and video.
- Online + batch evaluation
- Voice agent eval via LiveKit
- Image / video quality scoring
Dashboard
Trace waterfall, session explorer, eval results, cost breakdown, and model catalog — everything in one place.
- Trace waterfall per agent run
- Per-session cost analytics
- Regression comparison across deploys
One call.
Everything Tracked.
init() patches every LLM client, HTTP layer, and database adapter your agent touches. Traces start flowing in seconds — no manual spans, no config files.
- pip install syntropylabs-evalkit / npm install syntropylabs-evalkit
- Auto-instruments OpenAI, Anthropic, Gemini, Bedrock, HTTP, SQL, Redis, Mongoose
- Online eval: attach evaluators to a project, every trace gets scored automatically
- W3C traceparent — frontend and backend stitched into one trace, no extra work
Works with your entire AI stack
Simple, transparent pricing
Start free. Scale up when you need it.