โ† Back|EvalKit Docsv0.1.30
    syntropylabs.ai

    EvalKit

    EvalKit is an open-source SDK for adding LLM tracing and evaluation to your AI applications. Instrument your code in minutes โ€” traces appear in the Syntropylabs dashboard automatically.

    ๐Ÿ”ญ

    Distributed Tracing

    Every LLM call, tool use, and HTTP request, as a span.

    โš—๏ธ

    LLM Evaluation

    Run custom prompt-based judges on any trace or dataset.

    โšก

    Auto-Instrument

    Zero-config patching for OpenAI, Anthropic, Axios, and more.

    ๐Ÿงช

    Scenario Simulation

    Generate synthetic users and run them against your real agent.

    Available now: EvalKit is published on PyPI and npm โ€” pip install syntropylabs-evalkit and npm i syntropylabs-evalkit. The Python import name stays evalkit.

    Python SDK

    Installation

    Install from PyPI:

    pip install syntropylabs-evalkit

    The distribution installs as syntropylabs-evalkit but the import name stays evalkit.

    Quick Start

    import evalkit
    
    client = evalkit.init(
        subscription_key="tk_live_...",   # from Settings โ†’ Tracing
        service_name="my-app",
        environment="production",
        debug=True,
    )
    
    # All OpenAI / Anthropic calls are now traced automatically
    from openai import OpenAI
    openai_client = OpenAI()
    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": "Hello!"}],
    )
    print(response.choices[0].message.content)

    Manual spans

    end, ctx = evalkit.start_span("my-operation", {"custom.key": "value"})
    try:
        result = do_something()
        end("ok")
    except Exception:
        end("error")
        raise

    Tracing your own functions (APM) โ€” automatic

    Function tracing is on by default. init() traces every function in your app's own source tree (the directory of the file that called init()) as each module imports โ€” one function_call span per call with input/output/latency. Third-party libraries are never wrapped, only module-level functions are (class methods are left alone), and signatures are preserved so framework introspection (FastAPI Depends, etc.) keeps working.

    # Nothing to wire โ€” just init():
    evalkit.init(subscription_key="tk_live_...", service_name="my-api")
    # every function your app defines is now traced as it imports.
    
    # Disable it:
    #   EVALKIT_FUNCTION_TRACE=false            (env)
    #   evalkit.init(..., function_tracing=False)
    # Trace sibling packages outside the caller's dir:
    #   evalkit.init(..., trace_packages=["support_bot", "workers"])

    For finer control you can still opt in explicitly โ€” a function, a tool, a whole class, or a module/package:

    # One function -> function_call span (input / output / latency)
    @evalkit.trace_function()
    def do_work(x):
        return x * 2
    
    # One tool -> tool_call span (renders in Input/Output panels + tool metrics)
    @evalkit.trace_tool()
    def search_web(query: str):
        return run_search(query)
    
    # Every method of a class, APM-style
    @evalkit.traced
    class OrderService:
        def place(self, order): ...
        def cancel(self, id): ...
    
    # Every function defined in a module / whole package
    import myapp.services as svc
    evalkit.trace_module(svc)
    evalkit.trace_package(myapp)

    Client-side tools you run yourself (your own functions the model calls) only show their output if you wrap them with trace_tool โ€” the SDK sees the model's request but never your function's return value. Server-side tools (e.g. OpenAI web_search) and LangChain tools are captured automatically.

    Configuration options

    evalkit.init(
        subscription_key="tk_live_...",
        base_url="https://api.syntropylabs.ai",  # default
        service_name="my-service",
        environment="production",               # production | staging | development
        debug=False,                            # log exports to stdout
        scheduled_delay_millis=5000,            # batch export delay (ms)
        max_body_bytes=10 * 1024 * 1024,        # max captured HTTP body size (default 10 MB)
        function_tracing=True,                  # auto-trace your app's functions (default True)
        trace_packages=None,                    # extra sibling packages to auto-trace
    )

    TypeScript / Node.js SDK

    Installation

    Install from npm:

    npm install syntropylabs-evalkit # or yarn add syntropylabs-evalkit

    Quick Start

    import * as evalkit from 'syntropylabs-evalkit';
    
    const client = evalkit.init({
      subscriptionKey: 'tk_live_...',   // from Settings โ†’ Tracing
      serviceName: 'my-app',
      environment: 'production',
      debug: true,
    });
    
    // OpenAI is auto-patched โ€” just use it normally
    import OpenAI from 'openai';
    const openai = new OpenAI();
    
    const res = await openai.chat.completions.create({
      model: 'gpt-4o',
      messages: [{ role: 'user', content: 'Hello!' }],
    });
    console.log(res.choices[0].message.content);

    Manual spans

    import { startSpan } from 'syntropylabs-evalkit';
    
    const { end } = startSpan('my-operation', { 'custom.key': 'value' });
    try {
      const data = await fetchData();
      end('OK', { 'result.count': data.length });
    } catch (e) {
      end('ERROR', { 'error.message': String(e) });
      throw e;
    }

    Tracing your own functions & tools (APM)

    Auto-instrumentation covers libraries (LLM / HTTP / DB). For your code, opt in โ€” a function, a tool, a class, or a whole service object:

    import * as evalkit from 'syntropylabs-evalkit';
    import { Traced } from 'syntropylabs-evalkit';
    
    // One function -> function_call span (input / output / latency)
    const doWork = evalkit.traceFunction('doWork', (x: number) => x * 2);
    
    // One tool -> tool_call span (Input/Output panels + tool metrics)
    const searchWeb = evalkit.traceTool('search_web', (q: string) => runSearch(q));
    
    // Every method of a class, APM-style
    @Traced()
    class OrderService {
      place(order: Order) { /* ... */ }
      cancel(id: string) { /* ... */ }
    }
    
    // Every function of a service object (parity with Python's trace_module)
    export const orders = evalkit.traceObject({ place, cancel }, { prefix: 'orders' });
    
    // NestJS โ€” trace EVERY provider/controller method via the DI registry.
    // One line in main.ts after create โ€” pass the app, the SDK resolves
    // DiscoveryService itself (no @nestjs/core import needed):
    await evalkit.enableNestjsAutoTrace(app);

    Client-side tools you run yourself only show their output if you wrap them with traceTool โ€” the SDK sees the model's request but never your function's return value. Server-side tools (e.g. OpenAI web_search) and LangChain tools are captured automatically.

    Auto-discovery of all functions is only possible where the framework has a registry (NestJS DI). Route metadata (@Get, guards) is preserved, and the call never throws. For Express / Fastify / Koa / Hono / Hapi, use traceObject / traceFunction on your modules.

    NestJS / Express

    The SDK auto-instruments all incoming HTTP requests โ€” no manual middleware needed. Call evalkit.init() before your app bootstraps, then one line after create() turns on full APM (every provider/controller method):

    // main.ts (NestJS)
    import * as evalkit from 'syntropylabs-evalkit';
    
    evalkit.init({
      subscriptionKey: process.env.EVALKIT_SUBSCRIPTION_KEY!,
      serviceName: 'my-nestjs-app',
      environment: process.env.NODE_ENV ?? 'development',
    });
    
    // Then bootstrap normally
    const app = await NestFactory.create(AppModule);
    await evalkit.enableNestjsAutoTrace(app);   // APM: trace every provider/controller
    await app.listen(3000);

    Configuration options

    evalkit.init({
      subscriptionKey: 'tk_live_...',
      baseUrl: 'https://api.syntropylabs.ai',  // default
      serviceName: 'my-service',
      environment: 'production',
      debug: false,
      scheduledDelayMillis: 5000,              // batch export delay (ms)
      maxBodyBytes: 10 * 1024 * 1024,          // max captured HTTP body size (default 10 MB)
    });

    Tracing

    What gets traced automatically

    OpenAIchat.completions, embeddings, images โ€” model, tokens, latency
    Anthropicmessages.create โ€” model, tokens, stop reason
    HTTP (fetch / axios / http)method, URL, status, latency
    Incoming requestsAll inbound HTTP/HTTPS requests get a root span
    Console.errorCaptured as span events with stack traces
    Mongoose / PostgreSQL / Redis / MySQLQuery text and latency

    W3C traceparent propagation

    Pass the traceparent header from your frontend to backend to stitch spans into a single trace across services. EvalKit reads the header automatically and creates a child span.

    // Frontend: propagate traceparent to your API
    fetch('/api/chat', {
      headers: {
        'Content-Type': 'application/json',
        traceparent: evalkit.getTraceparent(), // from syntropylabs-evalkit
      },
      body: JSON.stringify({ message }),
    });

    Viewing traces

    Go to Dashboard โ†’ Tracing. Select a Trace Project to see all traces. Click any trace to open the waterfall view โ€” spans are colour-coded by type (LLM, tool, HTTP, DB).

    Evaluation

    Offline evaluation (manual)

    Select one or more traces in the Tracing dashboard, click Evaluate, choose your evaluation rules and a judge model, then click Run. Results are saved and shown in the Evaluation tab of each trace.

    Online evaluation (automatic)

    Enable online evaluation per Trace Project to automatically evaluate every new trace as it arrives. Click Online Eval in the tracing toolbar, choose rules, a judge model, and a polling interval.

    auto badgeTraces evaluated automatically show a green "auto" pill alongside their score.
    intervalThe backend polls for new traces at your configured interval (1โ€“60 min).

    Evaluation rules

    Rules are prompt templates that a judge LLM uses to score a trace. Go to Dashboard โ†’ Evaluators to create rules. Group related rules into Collections so you can apply them all at once.

    Example rule prompt:
    
    "Given the conversation below, score from 0.0 to 1.0 how well the
    assistant stayed on topic and did not hallucinate.
    
    Conversation:
    {{trace}}
    
    Return JSON: { "score": <float>, "passed": <bool>, "reasoning": "<str>" }"

    Scenario Simulation

    EvalKit can generate realistic synthetic user scenarios from your agent's system prompt and tool list, then run each scenario against your real agent, turn by turn, and score the results. This lets you stress-test your agent's behavior without writing test cases by hand.

    How it works

    1. GenerateEvalKit calls the control-plane API to invent realistic customer scenarios from your system prompt and tool descriptions.
    2. SimulateEvalKit plays each synthetic user message against your real agent, one turn at a time, using your entrypoint function.
    3. ScoreEach scenario is scored (tool trajectory, required terms, LLM-as-judge) and results appear in the Simulations tab.

    Python โ€” generate_scenarios + simulate_user

    import evalkit
    
    evalkit.init(subscription_key="tk_live_...", service_name="support-bot")
    
    SYSTEM_PROMPT = "You are a customer support agent for Acme Store..."
    TOOLS = ["search_knowledge_base", "lookup_order", "create_support_ticket"]
    
    # Step 1 โ€” generate scenarios from your agent's own prompt & tools
    scenarios = evalkit.generate_scenarios(
        SYSTEM_PROMPT,
        tools=TOOLS,
        count=5,
        provider="anthropic",        # or "openai" / "google"
        api_key="sk-ant-...",        # your BYOK key for generation
        model="claude-haiku-4-5-20251001",
    )
    
    # Step 2 โ€” run each scenario against the real agent and score it
    def entrypoint(ctx: evalkit.SimContext) -> evalkit.AgentTurnResult:
        # ctx.message  โ€” the synthetic user's message for this turn
        # ctx.session_id โ€” keep multi-turn context across calls for the same session
        reply, tool_calls = run_agent(ctx.session_id, ctx.message)
        return evalkit.AgentTurnResult(
            text=reply,
            tool_calls=[{"name": t} for t in tool_calls],
        )
    
    report = evalkit.simulate_user(entrypoint, scenarios, tags=["ci"])
    print("Simulation ID:", report["simulation_id"])

    TypeScript โ€” generateScenarios + simulateUser

    import evalkit from 'syntropylabs-evalkit';
    
    evalkit.init({ subscriptionKey: 'tk_live_...', serviceName: 'support-bot' });
    
    const scenarios = await evalkit.generateScenarios({
      agentInstructions: SYSTEM_PROMPT,
      tools: ['search_knowledge_base', 'lookup_order', 'create_support_ticket'],
      count: 5,
      provider: 'anthropic',
      apiKey: process.env.ANTHROPIC_API_KEY,
    });
    
    const { simulationId, results } = await evalkit.simulateUser({
      scenarios,
      entrypoint: async (ctx) => {
        const { text, toolCalls } = await runAgent(ctx.sessionId, ctx.message);
        return { text, toolCalls };
      },
      tags: ['ci'],
    });
    console.log('Simulation ID:', simulationId);

    Evaluate a run against a collection โ€” evaluate_simulation

    simulate_userrecords each scenario as a trace and computes lightweight offline scores. To grade a finished run against your project's evaluator collection (LLM-as-judge rules), call evaluate_simulation with the simulation_id and a collection id. The judge is bring-your-own-key. Results come back per scenario and per criterion (with reasons) and also appear in the Tracing dashboard.

    report = evalkit.simulate_user(entrypoint, scenarios, tags=["ci"])
    
    result = evalkit.evaluate_simulation(
        report["simulation_id"],
        collection_id="665f0c...",       # Dashboard โ†’ Evaluators โ†’ Collections
        provider="openai",               # BYOK judge
        model="gpt-4o",
        api_key="sk-...",
        max_tokens=1024,                 # optional judge output cap
        # run_id="run_..."               # optional; defaults to the latest run
    )
    
    print(result["aggregate"])           # { averageScore, passRate, ... }
    for scn in result["scenarios"]:
        print(scn["name"], scn["overallScore"], scn["passed"])
        for m in scn["metrics"]:
            print("  -", m["ruleName"], m["score"], m["reason"])
    const { simulationId } = await evalkit.simulateUser({ scenarios, entrypoint });
    
    const result = await evalkit.evaluateSimulation({
      simulationId,
      collectionId: '665f0c...',         // Dashboard โ†’ Evaluators โ†’ Collections
      provider: 'openai',                // BYOK judge
      model: 'gpt-4o',
      apiKey: process.env.OPENAI_API_KEY!,
      maxTokens: 1024,                   // optional judge output cap
    });
    
    console.log(result.aggregate);       // { averageScore, passRate, ... }
    for (const scn of result.scenarios) {
      console.log(scn.name, scn.overallScore, scn.passed);
      for (const m of scn.metrics) console.log('  -', m.ruleName, m.score, m.reason);
    }

    Multi-turn scenarios & count

    Generated scenarios are multi-turn by default: each carries a conversation_plan (โ‰ฅ3 user turns; โ‰ฅ2 for pure-refusal cases) and a turns array of verbatim user messages the simulator replays in order โ€” so a single run drives a real back-and-forth, not a one-shot prompt.

    count is honored exactly. When you ask for more scenarios than there are coverage categories, the categories cycle (round-robin) and repeated-category scenarios are made materially distinct โ€” socount=20 returns 20 scenarios, not 9.

    BYOK โ€” bring your own key for generation

    Scenario generation uses a hosted model by default. Pass provider + api_key (Python) or apiKey (TypeScript) to use your own Anthropic, OpenAI, or Google key instead. The agent that runs during simulation is always your own โ€” EvalKit only orchestrates the synthetic-user messages.

    Viewing results

    Go to Dashboard โ†’ Simulations. Each run shows scenario scores, per-turn traces, tool trajectory, and LLM-judge ratings. Runs are tagged so you can filter by environment or CI branch.

    Key Concepts

    TraceA tree of spans representing a single end-to-end request or job. Identified by a trace ID (W3C format).
    SpanA single timed operation within a trace โ€” an LLM call, a tool invocation, an HTTP request.
    Trace ProjectA named group of traces with its own subscription key and tenant ID. Maps to one deployment/service.
    Subscription KeyA secret token (tk_live_...) that authenticates your SDK to the trace ingestion endpoint.
    Evaluation RuleA prompt template used by a judge model to score a trace on a specific dimension (safety, relevance, tone, etc.).
    CollectionA named bundle of evaluation rules. Pick a collection to run all its rules at once.
    Online EvalAutomatic evaluation triggered by the backend whenever a new trace arrives for a configured project.
    Offline EvalManual evaluation triggered by selecting traces in the dashboard.
    Judge ModelThe LLM used to score traces against evaluation rules. Configure in Settings โ†’ Models.
    SimulationA run of synthetic user scenarios against your real agent, scored automatically. Appears in the Simulations tab.
    ScenarioA generated multi-turn conversation with a synthetic user persona, including expected tools and scoring constraints.
    generate_scenariosEvalKit API that invents realistic test scenarios from your agent's system prompt and tool list.
    simulate_userEvalKit API that drives each scenario through your entrypoint function and scores the results.

    EvalKit is built by Syntropylabs. Published on PyPI and npm.