← Back|EvalKit Docsv0.1.x — Coming Soon
    syntropylabs.ai

    EvalKit

    EvalKit is an open-source SDK for adding LLM tracing and evaluation to your AI applications. Instrument your code in minutes — traces appear in the Syntropylabs dashboard automatically.

    🔭

    Distributed Tracing

    Every LLM call, tool use, and HTTP request, as a span.

    ⚗️

    LLM Evaluation

    Run custom prompt-based judges on any trace or dataset.

    Auto-Instrument

    Zero-config patching for OpenAI, Anthropic, Axios, and more.

    Note: The EvalKit packages are not yet published to PyPI or npm. They will be available soon at pip install evalkit and npm install @evalkit/sdk. Until then, use the local SDK from this repository.

    Python SDK

    Installation

    Once published (coming soon):

    pip install evalkit

    Or install from the repository directly:

    pip install git+https://github.com/syntropylabs/evalkit-py.git

    Quick Start

    import evalkit
    
    client = evalkit.init(
        subscription_key="tk_live_...",   # from Settings → Tracing
        service_name="my-app",
        environment="production",
        debug=True,
    )
    
    # All OpenAI / Anthropic calls are now traced automatically
    from openai import OpenAI
    openai_client = OpenAI()
    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": "Hello!"}],
    )
    print(response.choices[0].message.content)

    Manual spans

    with evalkit.start_span("my-operation") as span:
        span.set_attribute("custom.key", "value")
        result = do_something()
        span.set_attribute("result.length", len(result))

    Configuration options

    evalkit.init(
        subscription_key="tk_live_...",
        base_url="https://api.syntropylabs.ai",  # default
        service_name="my-service",
        environment="production",               # production | staging | development
        debug=False,                            # log exports to stdout
        scheduled_delay_millis=5000,            # batch export delay (ms)
    )

    TypeScript / Node.js SDK

    Installation

    Once published (coming soon):

    npm install @evalkit/sdk # or yarn add @evalkit/sdk

    Quick Start

    import * as evalkit from '@evalkit/sdk';
    
    const client = evalkit.init({
      subscriptionKey: 'tk_live_...',   // from Settings → Tracing
      serviceName: 'my-app',
      environment: 'production',
      debug: true,
    });
    
    // OpenAI is auto-patched — just use it normally
    import OpenAI from 'openai';
    const openai = new OpenAI();
    
    const res = await openai.chat.completions.create({
      model: 'gpt-4o',
      messages: [{ role: 'user', content: 'Hello!' }],
    });
    console.log(res.choices[0].message.content);

    Manual spans

    import { startSpan } from '@evalkit/sdk';
    
    const result = await startSpan('my-operation', async (span) => {
      span.setAttribute('custom.key', 'value');
      const data = await fetchData();
      span.setAttribute('result.count', data.length);
      return data;
    });

    NestJS / Express

    The SDK auto-instruments all incoming HTTP requests — no manual middleware needed. Just call evalkit.init() before your app bootstraps:

    // main.ts (NestJS)
    import * as evalkit from '@evalkit/sdk';
    
    evalkit.init({
      subscriptionKey: process.env.EVALKIT_SUBSCRIPTION_KEY!,
      serviceName: 'my-nestjs-app',
      environment: process.env.NODE_ENV ?? 'development',
    });
    
    // Then bootstrap normally
    const app = await NestFactory.create(AppModule);
    await app.listen(3000);

    Configuration options

    evalkit.init({
      subscriptionKey: 'tk_live_...',
      baseUrl: 'https://api.syntropylabs.ai',  // default
      serviceName: 'my-service',
      environment: 'production',
      debug: false,
      scheduledDelayMillis: 5000,              // batch export delay (ms)
    });

    Tracing

    What gets traced automatically

    OpenAIchat.completions, embeddings, images — model, tokens, latency
    Anthropicmessages.create — model, tokens, stop reason
    HTTP (fetch / axios / http)method, URL, status, latency
    Incoming requestsAll inbound HTTP/HTTPS requests get a root span
    Console.errorCaptured as span events with stack traces
    Mongoose / PostgreSQL / Redis / MySQLQuery text and latency

    W3C traceparent propagation

    Pass the traceparent header from your frontend to backend to stitch spans into a single trace across services. EvalKit reads the header automatically and creates a child span.

    // Frontend: propagate traceparent to your API
    fetch('/api/chat', {
      headers: {
        'Content-Type': 'application/json',
        traceparent: evalkit.getTraceparent(), // from @evalkit/sdk
      },
      body: JSON.stringify({ message }),
    });

    Viewing traces

    Go to Dashboard → Tracing. Select a Trace Project to see all traces. Click any trace to open the waterfall view — spans are colour-coded by type (LLM, tool, HTTP, DB).

    Evaluation

    Offline evaluation (manual)

    Select one or more traces in the Tracing dashboard, click Evaluate, choose your evaluation rules and a judge model, then click Run. Results are saved and shown in the Evaluation tab of each trace.

    Online evaluation (automatic)

    Enable online evaluation per Trace Project to automatically evaluate every new trace as it arrives. Click Online Eval in the tracing toolbar, choose rules, a judge model, and a polling interval.

    auto badgeTraces evaluated automatically show a green "auto" pill alongside their score.
    intervalThe backend polls for new traces at your configured interval (1–60 min).

    Evaluation rules

    Rules are prompt templates that a judge LLM uses to score a trace. Go to Dashboard → Evaluators to create rules. Group related rules into Collections so you can apply them all at once.

    Example rule prompt:
    
    "Given the conversation below, score from 0.0 to 1.0 how well the
    assistant stayed on topic and did not hallucinate.
    
    Conversation:
    {{trace}}
    
    Return JSON: { "score": <float>, "passed": <bool>, "reasoning": "<str>" }"

    Key Concepts

    TraceA tree of spans representing a single end-to-end request or job. Identified by a trace ID (W3C format).
    SpanA single timed operation within a trace — an LLM call, a tool invocation, an HTTP request.
    Trace ProjectA named group of traces with its own subscription key and tenant ID. Maps to one deployment/service.
    Subscription KeyA secret token (tk_live_...) that authenticates your SDK to the trace ingestion endpoint.
    Evaluation RuleA prompt template used by a judge model to score a trace on a specific dimension (safety, relevance, tone, etc.).
    CollectionA named bundle of evaluation rules. Pick a collection to run all its rules at once.
    Online EvalAutomatic evaluation triggered by the backend whenever a new trace arrives for a configured project.
    Offline EvalManual evaluation triggered by selecting traces in the dashboard.
    Judge ModelThe LLM used to score traces against evaluation rules. Configure in Settings → Models.

    EvalKit is built by Syntropylabs. SDK packages will be published soon.