EvalKit

EvalKit is an open-source SDK for adding LLM tracing and evaluation to your AI applications. Instrument your code in minutes — traces appear in the Syntropylabs dashboard automatically.

🔭

Distributed Tracing

Every LLM call, tool use, and HTTP request, as a span.

⚗️

LLM Evaluation

Run custom prompt-based judges on any trace or dataset.

⚡

Auto-Instrument

Zero-config patching for OpenAI, Anthropic, Axios, and more.

Note: The EvalKit packages are not yet published to PyPI or npm. They will be available soon at pip install evalkit and npm install @evalkit/sdk. Until then, use the local SDK from this repository.

Python SDK

Installation

Once published (coming soon):

pip install evalkit

Or install from the repository directly:

pip install git+https://github.com/syntropylabs/evalkit-py.git

Quick Start

import evalkit

client = evalkit.init(
    subscription_key="tk_live_...",   # from Settings → Tracing
    service_name="my-app",
    environment="production",
    debug=True,
)

# All OpenAI / Anthropic calls are now traced automatically
from openai import OpenAI
openai_client = OpenAI()
response = openai_client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

Manual spans

with evalkit.start_span("my-operation") as span:
    span.set_attribute("custom.key", "value")
    result = do_something()
    span.set_attribute("result.length", len(result))

Configuration options

evalkit.init(
    subscription_key="tk_live_...",
    base_url="https://api.syntropylabs.ai",  # default
    service_name="my-service",
    environment="production",               # production | staging | development
    debug=False,                            # log exports to stdout
    scheduled_delay_millis=5000,            # batch export delay (ms)
)

TypeScript / Node.js SDK

Installation

Once published (coming soon):

npm install @evalkit/sdk # or yarn add @evalkit/sdk

Quick Start

import * as evalkit from '@evalkit/sdk';

const client = evalkit.init({
  subscriptionKey: 'tk_live_...',   // from Settings → Tracing
  serviceName: 'my-app',
  environment: 'production',
  debug: true,
});

// OpenAI is auto-patched — just use it normally
import OpenAI from 'openai';
const openai = new OpenAI();

const res = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [{ role: 'user', content: 'Hello!' }],
});
console.log(res.choices[0].message.content);

Manual spans

import { startSpan } from '@evalkit/sdk';

const result = await startSpan('my-operation', async (span) => {
  span.setAttribute('custom.key', 'value');
  const data = await fetchData();
  span.setAttribute('result.count', data.length);
  return data;
});

NestJS / Express

The SDK auto-instruments all incoming HTTP requests — no manual middleware needed. Just call evalkit.init() before your app bootstraps:

// main.ts (NestJS)
import * as evalkit from '@evalkit/sdk';

evalkit.init({
  subscriptionKey: process.env.EVALKIT_SUBSCRIPTION_KEY!,
  serviceName: 'my-nestjs-app',
  environment: process.env.NODE_ENV ?? 'development',
});

// Then bootstrap normally
const app = await NestFactory.create(AppModule);
await app.listen(3000);

Configuration options

evalkit.init({
  subscriptionKey: 'tk_live_...',
  baseUrl: 'https://api.syntropylabs.ai',  // default
  serviceName: 'my-service',
  environment: 'production',
  debug: false,
  scheduledDelayMillis: 5000,              // batch export delay (ms)
});

Tracing

What gets traced automatically

OpenAIchat.completions, embeddings, images — model, tokens, latency

Anthropicmessages.create — model, tokens, stop reason

HTTP (fetch / axios / http)method, URL, status, latency

Incoming requestsAll inbound HTTP/HTTPS requests get a root span

Console.errorCaptured as span events with stack traces

Mongoose / PostgreSQL / Redis / MySQLQuery text and latency

W3C traceparent propagation

Pass the traceparent header from your frontend to backend to stitch spans into a single trace across services. EvalKit reads the header automatically and creates a child span.

// Frontend: propagate traceparent to your API
fetch('/api/chat', {
  headers: {
    'Content-Type': 'application/json',
    traceparent: evalkit.getTraceparent(), // from @evalkit/sdk
  },
  body: JSON.stringify({ message }),
});

Viewing traces

Go to Dashboard → Tracing. Select a Trace Project to see all traces. Click any trace to open the waterfall view — spans are colour-coded by type (LLM, tool, HTTP, DB).

Evaluation

Offline evaluation (manual)

Select one or more traces in the Tracing dashboard, click Evaluate, choose your evaluation rules and a judge model, then click Run. Results are saved and shown in the Evaluation tab of each trace.

Online evaluation (automatic)

Enable online evaluation per Trace Project to automatically evaluate every new trace as it arrives. Click Online Eval in the tracing toolbar, choose rules, a judge model, and a polling interval.

auto badgeTraces evaluated automatically show a green "auto" pill alongside their score.

intervalThe backend polls for new traces at your configured interval (1–60 min).

Evaluation rules

Rules are prompt templates that a judge LLM uses to score a trace. Go to Dashboard → Evaluators to create rules. Group related rules into Collections so you can apply them all at once.

Example rule prompt:

"Given the conversation below, score from 0.0 to 1.0 how well the
assistant stayed on topic and did not hallucinate.

Conversation:
{{trace}}

Return JSON: { "score": <float>, "passed": <bool>, "reasoning": "<str>" }"

Key Concepts

TraceA tree of spans representing a single end-to-end request or job. Identified by a trace ID (W3C format).

SpanA single timed operation within a trace — an LLM call, a tool invocation, an HTTP request.

Trace ProjectA named group of traces with its own subscription key and tenant ID. Maps to one deployment/service.

Subscription KeyA secret token (tk_live_...) that authenticates your SDK to the trace ingestion endpoint.

Evaluation RuleA prompt template used by a judge model to score a trace on a specific dimension (safety, relevance, tone, etc.).

CollectionA named bundle of evaluation rules. Pick a collection to run all its rules at once.

Online EvalAutomatic evaluation triggered by the backend whenever a new trace arrives for a configured project.

Offline EvalManual evaluation triggered by selecting traces in the dashboard.

Judge ModelThe LLM used to score traces against evaluation rules. Configure in Settings → Models.

EvalKit is built by Syntropylabs. SDK packages will be published soon.