npm - syntropylabs-evalkit - Versions diffs - 0.1.27 → 0.1.29 - Mend

syntropylabs-evalkit 0.1.27 → 0.1.29

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (6) hide show

package/README.md CHANGED Viewed

@@ -1,22 +1,25 @@
-# EvalKit TypeScript SDK
+# EvalKit — TypeScript SDK
-OpenTelemetry-based LLM observability, tracing, and evaluation for Node.js — by [Syntropy Labs](https://syntropylabs.ai).
-One `init()` call auto-instruments your LLM providers (OpenAI, Anthropic, Bedrock, Cohere, Google, Vertex, LangChain), databases (Postgres, MySQL, MongoDB, Redis), and HTTP clients — then ships traces to the EvalKit platform.
-> The Python SDK is published as [`syntropylabs-evalkit`](https://pypi.org/project/syntropylabs-evalkit/) on PyPI.
-## Install
+OpenTelemetry-based tracing and evaluation for Node.js LLM apps, by
+[Syntropy Labs](https://syntropylabs.ai). A single `init()` call auto-instruments
+your LLM providers, databases, and HTTP clients, then ships traces to the platform.
 ```bash
 npm install syntropylabs-evalkit
 ```
-Provider SDKs are **optional peer dependencies** — install only what you use:
+> The Python SDK ships as [`syntropylabs-evalkit`](https://pypi.org/project/syntropylabs-evalkit/) on PyPI.
-```bash
-npm install openai @anthropic-ai/sdk
-```
+## Contents
+- [Quick start](#quick-start)
+- [What gets traced](#what-gets-traced)
+- [Framework middleware](#framework-middleware)
+- [Trace your own code](#trace-your-own-code)
+- [Manual spans](#manual-spans)
+- [Offline evaluation](#offline-evaluation)
+- [Scenario simulation](#scenario-simulation)
+- [Configuration](#configuration)
 ## Quick start
@@ -24,18 +27,36 @@ npm install openai @anthropic-ai/sdk
 import evalkit from "syntropylabs-evalkit";
 evalkit.init({
-  subscriptionKey: process.env.EVALKIT_SUBSCRIPTION_KEY!,
+  subscriptionKey: process.env.EVALKIT_SUBSCRIPTION_KEY!,   // Dashboard → Settings → Tracing
   serviceName: "my-service",
 });
-// That's it — any OpenAI / Anthropic / DB / HTTP call from here is traced.
+// Every OpenAI / Anthropic / DB / HTTP call from here on is traced automatically.
 ```
-> Call `init()` as early as possible (top of your entrypoint, before other imports run requests) so auto-instrumentation can hook the libraries.
+Call `init()` as early as possible — at the top of your entrypoint, before other
+modules run requests — so auto-instrumentation can hook the libraries.
+## What gets traced
+| Category    | Captured automatically                                              |
+| ----------- | ------------------------------------------------------------------- |
+| LLM clients | OpenAI, Anthropic, Bedrock, Cohere, Google, Vertex, LangChain       |
+| HTTP        | `fetch`, `axios`, `node:http` — method, URL, status, latency        |
+| Databases   | Postgres, MySQL, MongoDB, Redis — query text + latency              |
+| Inbound     | Every incoming HTTP request becomes a root trace                    |
+| Your code   | Opt-in per function / tool / class, or whole-app on NestJS          |
+Provider SDKs are **optional peer dependencies** — install only what you use:
+```bash
+npm install openai @anthropic-ai/sdk
+```
 ## Framework middleware
-Each incoming request becomes a root trace, with downstream LLM/DB/HTTP calls nested underneath.
+Each incoming request becomes a root trace, with downstream LLM/DB/HTTP calls nested
+underneath.
 ```ts
 import express from "express";
@@ -47,118 +68,150 @@ const app = express();
 app.use(evalkit.expressMiddleware());
 ```
-Supported adapters: `expressMiddleware()`, `fastifyPlugin()`, `koaMiddleware()`, `honoMiddleware()`, `hapiPlugin()`, and `createNestjsInterceptor()`.
+Adapters: `expressMiddleware()`, `fastifyPlugin()`, `koaMiddleware()`,
+`honoMiddleware()`, `hapiPlugin()`, `createNestjsInterceptor()`.
-## Manual tracing helpers
+### NestJS — trace the whole app
-```ts
-import evalkit from "syntropylabs-evalkit";
+NestJS exposes a DI registry, so the SDK can wrap **every** provider/controller method
+for you. One line in `main.ts` — pass the app; the SDK resolves `DiscoveryService`
+itself (no `@nestjs/core` import needed):
-// Open a span by hand
-const { end } = evalkit.startSpan("embed-documents", { count: 42 });
-try {
-  await embed(docs);
-  end("OK", { "result.count": docs.length });
-} catch (e) {
-  end("ERROR", { "error.message": String(e) });
-  throw e;
-}
+```ts
+evalkit.init({ subscriptionKey: "tk_live_...", serviceName: "orchestrator" });
-await evalkit.flush(); // force-flush before process exit
+const app = await NestFactory.create(AppModule);
+await evalkit.enableNestjsAutoTrace(app);   // ← the only line you add
+await app.listen(5000);
 ```
-## Tracing your own functions & tools (APM)
+Route metadata (`@Get`, `@Body`, guards) is preserved, and the call never throws.
+## Trace your own code
-Auto-instrumentation covers libraries (LLM / HTTP / DB). For **your** code, opt in —
-a function, a tool, a class, or a whole service object:
+Auto-instrumentation covers libraries. For **your** code, opt in — a function, a tool,
+a class, or a whole service object:
 ```ts
 import evalkit, { Traced } from "syntropylabs-evalkit";
-// One function -> function_call span (input / output / latency)
-const rankResults = evalkit.traceFunction("rank-results", rank);
+const rankResults = evalkit.traceFunction("rank-results", rank);   // → function_call span
+const searchWeb = evalkit.traceTool("search_web", (q: string) => runSearch(q)); // → tool_call span
-// One tool -> tool_call span (Input/Output panels + tool metrics)
-const searchWeb = evalkit.traceTool("search_web", (q: string) => runSearch(q));
-// Every method of a class, APM-style
-@Traced()
+@Traced()                            // → every method of the class
 class OrderService {
   place(order: Order) { /* ... */ }
   cancel(id: string)  { /* ... */ }
 }
-// Or one method
-class Service {
-  @evalkit.TraceMethod()
-  async compute() { /* ... */ }
-}
 // Every function of a service object (parity with Python's trace_module)
 export const orders = evalkit.traceObject({ place, cancel }, { prefix: "orders" });
 ```
-### NestJS — trace the whole app automatically
+> A client-side tool the model calls only shows its **output** if you wrap it with
+> `traceTool` — the SDK sees the model's request, not your function's return value.
+> Server-side tools (e.g. OpenAI `web_search`) and LangChain tools are automatic.
-NestJS exposes a DI registry, so the SDK can wrap **every** provider/controller
-method for you — no per-class decorators. One line in `main.ts`: pass the app and
-the SDK resolves `DiscoveryService` itself (no `@nestjs/core` import needed):
+## Manual spans
 ```ts
-const app = await NestFactory.create(AppModule);
-evalkit.init({ subscriptionKey: "tk_live_…", serviceName: "orchestrator" });
-await evalkit.enableNestjsAutoTrace(app);   // ← the only line you add
-await app.listen(5000);
-```
-Route metadata (`@Get`, `@Body`, guards) is preserved, so routing and auth are
-unaffected. The call never throws — if discovery can't be resolved it's a no-op.
-You can still pass a `DiscoveryService` directly if you prefer.
+import { startSpan } from "syntropylabs-evalkit";
-> Auto-discovery is only possible where the framework has a registry (NestJS). For
-> Express / Fastify / Koa / Hono / Hapi, use `traceObject` / `traceFunction` on your
-> own modules — incoming HTTP and LLM/DB/HTTP calls are already auto-traced.
+const { end } = startSpan("embed-documents", { count: 42 });
+try {
+  await embed(docs);
+  end("OK", { "result.count": docs.length });
+} catch (e) {
+  end("ERROR", { "error.message": String(e) });
+  throw e;
+}
-> **Client-side tools you run yourself** only show their output if you wrap them with
-> `traceTool` — the SDK sees the model's *request* but never your function's return
-> value. Server-side tools (OpenAI `web_search`, …) and LangChain tools are captured
-> automatically.
+await evalkit.flush();   // force-flush before process exit
+```
 ## Offline evaluation
+Deterministic, local scoring — runs synchronously, pushed as an `eval_result` span.
 ```ts
 import { evaluate } from "syntropylabs-evalkit";
-// Deterministic, local scoring — runs synchronously, pushed as an eval_result span
 const { scores } = evaluate({
   output: "The answer is AWS and Azure.",
   expectedTools: ["search", "summarize"],
   toolCalls: [{ name: "search" }, { name: "summarize" }],
   constraints: { requiredTerms: ["AWS", "Azure"] },
 });
+// → { tool_trajectory: 1, tool_f1: 1, tool_correctness: 1, response_match: 1, constraint_compliance: 1 }
 ```
-## Scenario generation & simulation
+## Scenario simulation
+Generate synthetic-user scenarios from your agent's prompt and tools, replay each one
+against your real agent, then grade the run with LLM-as-judge evaluators.
 ```ts
 import evalkit from "syntropylabs-evalkit";
-const scenarios = await evalkit.generateScenarios({ /* ... */ });
-const { simulationId, results } = await evalkit.simulateUser({ /* ... */ });
+evalkit.init({ subscriptionKey: process.env.EVALKIT_SUBSCRIPTION_KEY!, serviceName: "support-bot" });
+// 1 — generate scenarios (bring your own key for the generation call)
+const scenarios = await evalkit.generateScenarios({
+  agentInstructions: SYSTEM_PROMPT,
+  tools: ["search_kb", "lookup_order", "create_ticket"],
+  count: 5,
+  provider: "anthropic",
+  apiKey: process.env.ANTHROPIC_API_KEY,
+});
+// 2 — replay each scenario against your real agent
+const { simulationId, runId, results } = await evalkit.simulateUser({
+  scenarios,
+  entrypoint: async (ctx) => {
+    const { text, toolCalls } = await runAgent(ctx.sessionId, ctx.message);
+    return { text, toolCalls };
+  },
+  tags: ["ci"],
+});
+// 3 — grade the run against an evaluator collection (LLM-as-judge, BYOK)
+const result = await evalkit.evaluateSimulation({
+  simulationId,
+  collectionId: "665f0c...",          // Dashboard → Evaluators → Collections
+  provider: "openai",
+  model: "gpt-4o",
+  apiKey: process.env.OPENAI_API_KEY!,
+  maxTokens: 1024,                    // optional judge output cap
+});
+console.log(result.aggregate);        // { averageScore, passRate, ... }
+for (const scn of result.scenarios) {
+  console.log(scn.name, scn.overallScore, scn.passed);
+  for (const m of scn.metrics) console.log("  -", m.ruleName, m.score, m.reason);
+}
 ```
+`evaluateSimulation` returns per-scenario, per-criterion scores with reasons, which
+also appear in the Tracing dashboard.
 ## Configuration
-| Option             | Description                                                   |
-| ------------------ | ------------------------------------------------------------- |
-| `subscriptionKey`  | EvalKit trace-project subscription key (**required**).        |
-| `serviceName`      | Logical service name attached to every trace.                 |
-| `environment`      | `"development"` \| `"staging"` \| `"production"`.             |
-| `baseUrl`          | Override the trace ingest endpoint (defaults to hosted).      |
-| `apiUrl`           | Override the control-plane endpoint (scenario generation).    |
-| `maxBodyBytes`     | Max captured HTTP request/response body size (default 10 MB). |
+| Option            | Description                                                    |
+| ----------------- | ------------------------------------------------------------- |
+| `subscriptionKey` | Trace-project subscription key (**required**).                |
+| `serviceName`     | Logical service name attached to every trace.                 |
+| `environment`     | `"development"` \| `"staging"` \| `"production"`.             |
+| `baseUrl`         | Override the trace ingest endpoint (defaults to hosted).      |
+| `apiUrl`          | Override the control-plane endpoint (scenario / simulation).  |
+| `maxBodyBytes`    | Max captured HTTP body size (default 10 MB).                  |
+See the exported `EvalKitOptions` type for the full set (`appVersion`, `deviceId`,
+`debug`, batch tuning).
+## Links
-See the exported `EvalKitOptions` type for the full set (`appVersion`, `deviceId`, `debug`, batch tuning).
+- Website: https://syntropylabs.ai
+- Documentation: https://syntropylabs.ai/docs
 ## License

package/dist/index.d.mts CHANGED Viewed

@@ -146,6 +146,9 @@ interface GenerateScenariosOptions {
     provider?: string;
     apiKey?: string;
     temperature?: number;
+    reasoningEffort?: string;
+    maxCompletionTokens?: number;
+    maxTokens?: number;
     apiUrl?: string;
 }
 interface SimulateUserOptions {
@@ -159,6 +162,53 @@ interface SimulateUserOptions {
     provider?: string;
     apiKey?: string;
 }
+interface EvaluateSimulationOptions {
+    /** The simulationId returned by simulateUser. */
+    simulationId: string;
+    /** Mongo id of the evaluator collection whose rules to run. */
+    collectionId: string;
+    /** Judge provider, e.g. "openai", "anthropic", "google". */
+    provider: string;
+    /** Judge model id, e.g. "gpt-4o". */
+    model: string;
+    /** BYOK judge key — used for the judge call only, never stored. */
+    apiKey: string;
+    /** A specific run of the simulation. Defaults to the most recent run. */
+    runId?: string;
+    /** Judge output token cap (defaults to the backend's value). */
+    maxTokens?: number;
+    /** Override the control-plane URL (default: the value from init). */
+    apiUrl?: string;
+}
+interface SimEvalMetric {
+    ruleId: string;
+    ruleName: string;
+    category: string;
+    score: number;
+    passed: boolean;
+    reason: string;
+}
+interface SimEvalScenario {
+    scenarioId: string;
+    name: string;
+    traceId: string;
+    status: string;
+    overallScore: number;
+    passed: boolean;
+    metrics: SimEvalMetric[];
+    error?: string;
+}
+interface EvaluateSimulationResult {
+    simulationId: string;
+    runId: string;
+    scenarios: SimEvalScenario[];
+    aggregate: {
+        averageScore: number;
+        passRate: number;
+        totalScenarios: number;
+        evaluatedScenarios: number;
+    };
+}
 interface HapiPluginOptions {
     name?: string | ((request: any) => string);
@@ -366,11 +416,13 @@ declare function simulateUser(opts: SimulateUserOptions): Promise<{
     runId: string;
     results: any[];
 }>;
+declare function evaluateSimulation(opts: EvaluateSimulationOptions): Promise<EvaluateSimulationResult>;
 declare const _default: {
     init: typeof init;
     evaluate: typeof evaluate;
     generateScenarios: typeof generateScenarios;
     simulateUser: typeof simulateUser;
+    evaluateSimulation: typeof evaluateSimulation;
     patchOpenAIClient: typeof patchOpenAIClient;
     patchAnthropicClient: typeof patchAnthropicClient;
     patchBedrockClient: typeof patchBedrockClient;
@@ -405,4 +457,4 @@ declare const _default: {
     flush: typeof flush;
 };
-export { type AgentTurnResult, EvalKitClient, EvalKitInterceptor, type EvalKitOptions, type ExpressMiddlewareOptions, type FastifyPluginOptions, type GenerateScenariosOptions, type HapiPluginOptions, type HonoMiddlewareOptions, type KoaMiddlewareOptions, type OfflineEvalInput, type OfflineEvalResult, type OfflineMetric, type SimContext, type SimulateUserOptions, type SpanEvent, type TraceEnvelope, TraceMethod, Traced, createNestjsInterceptor, currentTraceId, _default as default, enableNestjsAutoTrace, evaluate, expressMiddleware, fastifyPlugin, flush, generateScenarios, hapiPlugin, honoMiddleware, init, koaMiddleware, langchainHandler, patchAnthropicClient, patchAnthropicVertexClient, patchAxiosClient, patchBedrockClient, patchCohereClient, patchGoogleAIModel, patchGoogleGenAIModels, patchMongooseClient, patchMysql2Client, patchOpenAIClient, patchPgClient, patchRedisClient, patchVertexGenerativeModel, simulateUser, startHttpTrace, startSpan, startTrace, traceFunction, traceObject, traceTool, withTrace };
+export { type AgentTurnResult, EvalKitClient, EvalKitInterceptor, type EvalKitOptions, type EvaluateSimulationOptions, type EvaluateSimulationResult, type ExpressMiddlewareOptions, type FastifyPluginOptions, type GenerateScenariosOptions, type HapiPluginOptions, type HonoMiddlewareOptions, type KoaMiddlewareOptions, type OfflineEvalInput, type OfflineEvalResult, type OfflineMetric, type SimContext, type SimEvalMetric, type SimEvalScenario, type SimulateUserOptions, type SpanEvent, type TraceEnvelope, TraceMethod, Traced, createNestjsInterceptor, currentTraceId, _default as default, enableNestjsAutoTrace, evaluate, evaluateSimulation, expressMiddleware, fastifyPlugin, flush, generateScenarios, hapiPlugin, honoMiddleware, init, koaMiddleware, langchainHandler, patchAnthropicClient, patchAnthropicVertexClient, patchAxiosClient, patchBedrockClient, patchCohereClient, patchGoogleAIModel, patchGoogleGenAIModels, patchMongooseClient, patchMysql2Client, patchOpenAIClient, patchPgClient, patchRedisClient, patchVertexGenerativeModel, simulateUser, startHttpTrace, startSpan, startTrace, traceFunction, traceObject, traceTool, withTrace };

package/dist/index.d.ts CHANGED Viewed

@@ -146,6 +146,9 @@ interface GenerateScenariosOptions {
     provider?: string;
     apiKey?: string;
     temperature?: number;
+    reasoningEffort?: string;
+    maxCompletionTokens?: number;
+    maxTokens?: number;
     apiUrl?: string;
 }
 interface SimulateUserOptions {
@@ -159,6 +162,53 @@ interface SimulateUserOptions {
     provider?: string;
     apiKey?: string;
 }
+interface EvaluateSimulationOptions {
+    /** The simulationId returned by simulateUser. */
+    simulationId: string;
+    /** Mongo id of the evaluator collection whose rules to run. */
+    collectionId: string;
+    /** Judge provider, e.g. "openai", "anthropic", "google". */
+    provider: string;
+    /** Judge model id, e.g. "gpt-4o". */
+    model: string;
+    /** BYOK judge key — used for the judge call only, never stored. */
+    apiKey: string;
+    /** A specific run of the simulation. Defaults to the most recent run. */
+    runId?: string;
+    /** Judge output token cap (defaults to the backend's value). */
+    maxTokens?: number;
+    /** Override the control-plane URL (default: the value from init). */
+    apiUrl?: string;
+}
+interface SimEvalMetric {
+    ruleId: string;
+    ruleName: string;
+    category: string;
+    score: number;
+    passed: boolean;
+    reason: string;
+}
+interface SimEvalScenario {
+    scenarioId: string;
+    name: string;
+    traceId: string;
+    status: string;
+    overallScore: number;
+    passed: boolean;
+    metrics: SimEvalMetric[];
+    error?: string;
+}
+interface EvaluateSimulationResult {
+    simulationId: string;
+    runId: string;
+    scenarios: SimEvalScenario[];
+    aggregate: {
+        averageScore: number;
+        passRate: number;
+        totalScenarios: number;
+        evaluatedScenarios: number;
+    };
+}
 interface HapiPluginOptions {
     name?: string | ((request: any) => string);
@@ -366,11 +416,13 @@ declare function simulateUser(opts: SimulateUserOptions): Promise<{
     runId: string;
     results: any[];
 }>;
+declare function evaluateSimulation(opts: EvaluateSimulationOptions): Promise<EvaluateSimulationResult>;
 declare const _default: {
     init: typeof init;
     evaluate: typeof evaluate;
     generateScenarios: typeof generateScenarios;
     simulateUser: typeof simulateUser;
+    evaluateSimulation: typeof evaluateSimulation;
     patchOpenAIClient: typeof patchOpenAIClient;
     patchAnthropicClient: typeof patchAnthropicClient;
     patchBedrockClient: typeof patchBedrockClient;
@@ -405,4 +457,4 @@ declare const _default: {
     flush: typeof flush;
 };
-export { type AgentTurnResult, EvalKitClient, EvalKitInterceptor, type EvalKitOptions, type ExpressMiddlewareOptions, type FastifyPluginOptions, type GenerateScenariosOptions, type HapiPluginOptions, type HonoMiddlewareOptions, type KoaMiddlewareOptions, type OfflineEvalInput, type OfflineEvalResult, type OfflineMetric, type SimContext, type SimulateUserOptions, type SpanEvent, type TraceEnvelope, TraceMethod, Traced, createNestjsInterceptor, currentTraceId, _default as default, enableNestjsAutoTrace, evaluate, expressMiddleware, fastifyPlugin, flush, generateScenarios, hapiPlugin, honoMiddleware, init, koaMiddleware, langchainHandler, patchAnthropicClient, patchAnthropicVertexClient, patchAxiosClient, patchBedrockClient, patchCohereClient, patchGoogleAIModel, patchGoogleGenAIModels, patchMongooseClient, patchMysql2Client, patchOpenAIClient, patchPgClient, patchRedisClient, patchVertexGenerativeModel, simulateUser, startHttpTrace, startSpan, startTrace, traceFunction, traceObject, traceTool, withTrace };
+export { type AgentTurnResult, EvalKitClient, EvalKitInterceptor, type EvalKitOptions, type EvaluateSimulationOptions, type EvaluateSimulationResult, type ExpressMiddlewareOptions, type FastifyPluginOptions, type GenerateScenariosOptions, type HapiPluginOptions, type HonoMiddlewareOptions, type KoaMiddlewareOptions, type OfflineEvalInput, type OfflineEvalResult, type OfflineMetric, type SimContext, type SimEvalMetric, type SimEvalScenario, type SimulateUserOptions, type SpanEvent, type TraceEnvelope, TraceMethod, Traced, createNestjsInterceptor, currentTraceId, _default as default, enableNestjsAutoTrace, evaluate, evaluateSimulation, expressMiddleware, fastifyPlugin, flush, generateScenarios, hapiPlugin, honoMiddleware, init, koaMiddleware, langchainHandler, patchAnthropicClient, patchAnthropicVertexClient, patchAxiosClient, patchBedrockClient, patchCohereClient, patchGoogleAIModel, patchGoogleGenAIModels, patchMongooseClient, patchMysql2Client, patchOpenAIClient, patchPgClient, patchRedisClient, patchVertexGenerativeModel, simulateUser, startHttpTrace, startSpan, startTrace, traceFunction, traceObject, traceTool, withTrace };