System live

    Evaluate and optimize
    your agent harness.

    Full trace waterfall from orchestrator to leaf tool. LLM-as-judge scoring. Offline/Online eval callbacks on every live turn.

    Works with any LLM, any framework. One call, everything Tracked.

    < 5ms
    ingest p99
    8
    LLM providers
    < 5 min
    to first trace
    Pass Rate
    94.1%
    research_agent · eval_job #8429
    RUNNING 01:24
    Agents
    Trace waterfall
    orchestrator
    3.4s
    llm.plan
    840ms
    tool.search
    1.2s
    db.memory
    12ms
    llm.reply
    980ms
    Relevancy0.94
    Groundedness0.88
    Safety0.31
    Coherence0.91
    passed
    13 / 14
    Safety score below threshold
    Trace viewer

    One trace per agent run

    Every step — tool calls, LLM calls, DB reads — stitched into a single waterfall. No fragmented logs.

    PASSED EVAL
    Execution Waterfall
    Support Agent
    3.4s
    db.get_order
    45ms
    Planner ReAct
    800ms
    stripe.refund
    412ms
    llm.generate
    1.2s

    Tool Call: stripe.refund

    Latency412ms
    Tokens1,402
    // Agent Tool Input
    {
    "order_id": "ORD-8429",
    "amount": 49.99,
    "reason": "customer_requested"
    }
    // API Output
    {
    "status": "refund_processed",
    "receipt_url": "https://stripe.com/..."
    }
    The problem

    AI fails differently
    than normal software.

    Agents don't throw exceptions. They hallucinate, drift, and burn tokens — and you only find out when a user complains.

    01

    Traces shatter across threads.

    Multi-step agents spawn spans across services. You get 14 disconnected logs instead of one waterfall. Debugging means grep, not insight.

    02

    You have no quality signal.

    The agent returns a string. You don't know if it hallucinated until a user files a ticket. Correctness is invisible at the API boundary.

    03

    Manual QA doesn't scale.

    Spot-checking 10 outputs feels thorough. At 1,000 it's theater. At 10,000 it's negligence. You need automated evaluation, not more reviewers.

    04

    Cost is invisible until it isn't.

    Token usage and latency compound silently across sessions and devices. The monthly invoice is not a monitoring strategy.

    Capabilities

    The complete evaluation stack

    Trace every span, score every output, catch every regression — from prototype to production.

    Agent Tracing

    Every LLM call, tool invocation, HTTP hop, and DB query lands as a typed span — one waterfall per run.

    support_agent
    3.4s
    db.get_order
    45ms
    llm.plan
    0.8s
    stripe.refund
    412ms
    llm.reply
    1.2s
    llm_calltool_callhttp_calldb_query

    Zero-Config SDK

    One init() call. All providers patched automatically.

    $ pip install syntropylabs-evalkit
    $ npm i syntropylabs-evalkit

    Online Evaluation

    Every production trace scored automatically as it arrives. No cron jobs.

    Scoring live traces…

    Voice Agent Eval

    Generate audio, score naturalness + instruction-following, replay rows. Connect live in-browser.

    naturalness 0.91instruction 0.8712 / 14 passed

    Batch Evaluation

    Run a dataset through your agent, score per row, regenerate failures individually.

    row_001
    94%
    row_002
    72%
    row_003
    38%
    row_004
    91%

    Image & Video Eval

    Generate, then LLM-as-judge score across visual quality dimensions.

    Visual quality88%
    Prompt alignment74%
    Coherence91%

    Session & Device

    session_id groups a conversation. device_id tracks originating clients APM-style.

    sess_4f2a$0.031
    sess_8c1b$0.089
    sess_2d9e$0.214

    Model Catalog

    Central registry for all providers. Set access grants, test models inline.

    OpenAI
    Anthropic
    Gemini
    Bedrock
    Cohere
    Azure

    Evaluator Collections

    Group LLM-as-judge prompts into reusable collections. Apply to any project or set globally. Three rule types: model-graded, custom prompt, statistical.

    RelevancyGroundednessSafetyPII DetectionCoherenceCustom Prompt
    The Platform

    SDK, harness evaluation, dashboard — one stack

    Everything your agent needs from dev to production, without stitching together five tools.

    SDK

    One init() call instruments your entire agent stack — LLM clients, HTTP, SQL, logging. Python and TypeScript, zero deps.

    • Auto-patches OpenAI, Anthropic, Gemini, Bedrock
    • W3C traceparent propagation
    • Session + device context

    Agent Harness evaluation

    The evaluation core. Run batch jobs, score traces online, compare models pairwise — text, voice, image, and video.

    • Online + batch evaluation
    • Voice agent eval via LiveKit
    • Image / video quality scoring

    Dashboard

    Trace waterfall, session explorer, eval results, cost breakdown, and model catalog — everything in one place.

    • Trace waterfall per agent run
    • Per-session cost analytics
    • Regression comparison across deploys
    Built for developers

    One call.
    Everything Tracked.

    init() patches every LLM client, HTTP layer, and database adapter your agent touches. Traces start flowing in seconds — no manual spans, no config files.

    • pip install syntropylabs-evalkit / npm install syntropylabs-evalkit
    • Auto-instruments OpenAI, Anthropic, Gemini, Bedrock, HTTP, SQL, Redis, Mongoose
    • Online eval: attach evaluators to a project, every trace gets scored automatically
    • W3C traceparent — frontend and backend stitched into one trace, no extra work
    agent_setup.py
    1import syntropylabs as stl
    2
    3stl.init(
    4 subscription_key="sk_live_...",
    5 service_name="my-agent",
    6 environment="production",
    7 session_id=user_session_id,
    8 device_id=device_fingerprint,
    9)
    10
    11# OpenAI, Anthropic, HTTP, SQL — all patched automatically
    12from openai import OpenAI
    13client = OpenAI()
    14
    15response = client.chat.completions.create(
    16 model="gpt-4o",
    17 messages=[{"role": "user", "content": prompt}],
    18)
    19# ↑ captured: model, prompt, completion, tokens, latency

    Works with your entire AI stack

    O
    OpenAI
    A
    Anthropic
    G
    Gemini
    A
    Bedrock
    C
    Cohere
    A
    Azure
    < 5ms
    ingest p99 latency
    90d
    trace retention
    6+
    LLM providers
    2
    SDK languages
    Pricing

    Simple, transparent pricing

    Start free. Scale up when you need it.

    Free

    $0

    For developers getting started with agent evaluation. No credit card required.

    • Up to 10,000 spans / month
    • 7-day trace retention
    • 2 LLM providers
    • Batch evaluation (100 rows)
    • Community support

    Enterprise

    Custom

    For organizations with custom security, scale, and compliance requirements.

    • Unlimited spans
    • 90-day trace retention
    • All LLM providers
    • Online + batch evaluation
    • On-premise deployment
    • SSO / SAML
    • SLA & dedicated support

    Learn how teams ship reliable AI

    Guides on agent observability, LLM evaluation techniques, voice and multimodal pipelines, and running evals in production.

    Ship AI agents
    you can trust.

    Instrument in minutes. Score every output. Catch failures before your users do.