@evalgate/sdk 2.2.3 → 2.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (50) hide show
  1. package/CHANGELOG.md +31 -0
  2. package/README.md +39 -2
  3. package/dist/assertions.d.ts +186 -6
  4. package/dist/assertions.js +515 -61
  5. package/dist/batch.js +4 -4
  6. package/dist/cache.d.ts +4 -0
  7. package/dist/cache.js +4 -0
  8. package/dist/cli/baseline.d.ts +14 -0
  9. package/dist/cli/baseline.js +43 -3
  10. package/dist/cli/check.d.ts +5 -2
  11. package/dist/cli/check.js +20 -12
  12. package/dist/cli/compare.d.ts +80 -0
  13. package/dist/cli/compare.js +266 -0
  14. package/dist/cli/index.js +244 -101
  15. package/dist/cli/regression-gate.js +23 -0
  16. package/dist/cli/run.js +22 -0
  17. package/dist/cli/start.d.ts +26 -0
  18. package/dist/cli/start.js +130 -0
  19. package/dist/cli/templates.d.ts +24 -0
  20. package/dist/cli/templates.js +314 -0
  21. package/dist/cli/traces.d.ts +109 -0
  22. package/dist/cli/traces.js +152 -0
  23. package/dist/cli/validate.d.ts +37 -0
  24. package/dist/cli/validate.js +252 -0
  25. package/dist/cli/watch.d.ts +19 -0
  26. package/dist/cli/watch.js +175 -0
  27. package/dist/client.js +6 -13
  28. package/dist/constants.d.ts +2 -0
  29. package/dist/constants.js +5 -0
  30. package/dist/index.d.ts +8 -6
  31. package/dist/index.js +26 -6
  32. package/dist/integrations/openai.js +83 -60
  33. package/dist/logger.d.ts +3 -1
  34. package/dist/logger.js +2 -1
  35. package/dist/otel.d.ts +130 -0
  36. package/dist/otel.js +309 -0
  37. package/dist/runtime/eval.d.ts +14 -4
  38. package/dist/runtime/eval.js +127 -2
  39. package/dist/runtime/registry.d.ts +4 -2
  40. package/dist/runtime/registry.js +11 -3
  41. package/dist/runtime/run-report.d.ts +1 -1
  42. package/dist/runtime/run-report.js +7 -4
  43. package/dist/runtime/types.d.ts +38 -0
  44. package/dist/testing.d.ts +8 -0
  45. package/dist/testing.js +45 -10
  46. package/dist/version.d.ts +2 -2
  47. package/dist/version.js +2 -2
  48. package/dist/workflows.d.ts +2 -0
  49. package/dist/workflows.js +184 -102
  50. package/package.json +124 -117
package/CHANGELOG.md CHANGED
@@ -5,8 +5,39 @@ All notable changes to the @evalgate/sdk package will be documented in this file
5
5
  The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
6
6
  and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
7
7
 
8
+ ## [2.3.0] - 2026-03-04
9
+
10
+ ### Breaking
11
+
12
+ - **`hasConsistency` / `hasConsistencyAsync` return `{ score, passed }` instead of `{ score, consistent }`** — aligns with every other assertion in the SDK that returns a `passed` field. If you were destructuring `consistent`, rename it to `passed`:
13
+ ```ts
14
+ // Before:
15
+ const { score, consistent } = hasConsistency(outputs);
16
+ // After:
17
+ const { score, passed } = hasConsistency(outputs);
18
+ ```
19
+ - **`respondedWithinDuration` / `respondedWithinTimeSince` return `AssertionResult` instead of `boolean`** — these now return `{ name, passed, expected, actual, message }` like all other assertions, enabling uniform pipeline usage and failure messages. The deprecated `respondedWithinTime` alias also returns `AssertionResult`.
20
+ ```ts
21
+ // Before:
22
+ const ok = respondedWithinDuration(250, 500); // boolean
23
+ // After:
24
+ const { passed } = respondedWithinDuration(250, 500); // AssertionResult
25
+ ```
26
+
27
+ ### Added
28
+
29
+ - **`computeBaselineChecksum` / `verifyBaselineChecksum` in main barrel** — previously only reachable via `@evalgate/sdk/cli/baseline` subpath. Now importable directly from `@evalgate/sdk`.
30
+ - **`resetSentimentDeprecationWarning` in main barrel** — the one-time deprecation reset utility for `hasSentimentAsync` is now importable from the main entry point, making it easier to test deprecation behavior. `SentimentAsyncResult` type was already exported.
31
+
32
+ ---
33
+
8
34
  ## [2.2.3] - 2026-03-03
9
35
 
36
+ ### Breaking
37
+
38
+ - **`PaginatedIterator` API changed from cursor-based to offset-based** — the constructor signature changed from `(cursor) => { items, nextCursor, hasMore }` to `(offset, limit) => { data, hasMore }`. If you were using `PaginatedIterator` directly with a cursor-based fetcher, update your callback to accept `(offset: number, limit: number)` and return `{ data: T[], hasMore: boolean }`. The `autoPaginate` and `autoPaginateGenerator` helpers also use the new offset-based signature. Cursor encoding/decoding utilities (`encodeCursor`, `decodeCursor`) remain available for server-side cursor generation.
39
+ - **`RequestCache` removed from public exports** — `RequestCache` was an internal HTTP cache with a method-specific API (`set(method, url, data, ttl, params)`) that did not match general-purpose cache expectations. It is no longer exported from the package entry point. If you were importing it directly, use your own cache implementation or rely on the SDK's built-in automatic caching. `CacheTTL` constants remain exported for advanced configuration.
40
+
10
41
  ### Fixed
11
42
 
12
43
  - **`RequestCache.set` missing default TTL** — entries stored without an explicit TTL were immediately stale on next read. Default is now `CacheTTL.MEDIUM`; callers that omit `ttl` get a live cache entry instead of a cache miss every time.
package/README.md CHANGED
@@ -3,7 +3,7 @@
3
3
  [![npm version](https://img.shields.io/npm/v/@evalgate/sdk.svg)](https://www.npmjs.com/package/@evalgate/sdk)
4
4
  [![npm downloads](https://img.shields.io/npm/dm/@evalgate/sdk.svg)](https://www.npmjs.com/package/@evalgate/sdk)
5
5
  [![TypeScript](https://img.shields.io/badge/TypeScript-strict-blue.svg)](https://www.typescriptlang.org/)
6
- [![SDK Tests](https://img.shields.io/badge/tests-159%20passed-brightgreen.svg)](#)
6
+ [![SDK Tests](https://img.shields.io/badge/tests-541%20passed-brightgreen.svg)](#)
7
7
  [![Contract Version](https://img.shields.io/badge/report%20schema-v1-blue.svg)](#)
8
8
  [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
9
9
 
@@ -157,6 +157,42 @@ Every failure prints a clear next step:
157
157
  | `npx evalgate diff --base last --head last` | Compare last two runs |
158
158
  | `npx evalgate diff --format github` | GitHub Step Summary with regressions |
159
159
 
160
+ ### Compare — Side-by-Side Result Diff
161
+
162
+ **Important:** `evalgate compare` compares **result files**, not models.
163
+ You run each model/config separately (via `evalgate run --write-results`),
164
+ then compare the saved JSON artifacts. Nothing is re-executed.
165
+
166
+ ```bash
167
+ # The primary interface — two result files:
168
+ evalgate compare --base .evalgate/runs/gpt4o-run.json --head .evalgate/runs/claude-run.json
169
+
170
+ # Optional labels for the output table (cosmetic, not identifiers):
171
+ evalgate compare --base gpt4o.json --head claude.json --labels "GPT-4o" "Claude 3.5"
172
+
173
+ # N-way compare (3+ files):
174
+ evalgate compare --runs run-a.json run-b.json run-c.json
175
+
176
+ # Machine-readable:
177
+ evalgate compare --base a.json --head b.json --format json
178
+ ```
179
+
180
+ | Command | Description |
181
+ |---------|-------------|
182
+ | `evalgate compare --base <file> --head <file>` | Compare two run result JSON files |
183
+ | `evalgate compare --runs <f1> <f2> [f3...]` | N-way comparison across multiple runs |
184
+ | `--labels <l1> <l2>` | Optional human-readable labels for output |
185
+ | `--sort-by <key>` | Sort specs by: `name` (default), `score`, `duration` |
186
+ | `--format json` | Machine-readable JSON output |
187
+
188
+ **Workflow:**
189
+ ```
190
+ evalgate run --write-results # saves .evalgate/runs/run-<id>.json
191
+ # change model/config/prompt
192
+ evalgate run --write-results # saves another run file
193
+ evalgate compare --base <first>.json --head <second>.json
194
+ ```
195
+
160
196
  ### Legacy Regression Gate (local, no account needed)
161
197
 
162
198
  | Command | Description |
@@ -389,7 +425,8 @@ console.log(hasNoToxicity("Have a great day!")); // true
389
425
  console.log(hasValidCodeSyntax("function f() {}", "js")); // true
390
426
 
391
427
  // Async — LLM-backed, context-aware
392
- console.log(await hasSentimentAsync("subtle irony...", "negative")); // true
428
+ const { matches, confidence } = await hasSentimentAsync("subtle irony...", "negative");
429
+ console.log(matches, confidence); // true, 0.85
393
430
  console.log(await hasNoToxicityAsync("sarcastic attack text")); // false
394
431
  ```
395
432
 
@@ -132,8 +132,16 @@ export declare class Expectation {
132
132
  */
133
133
  toContainCode(language?: string, message?: string): AssertionResult;
134
134
  /**
135
- * Assert value is professional tone (no profanity)
136
- * @example expect(output).toBeProfessional()
135
+ * Blocklist check for 7 common profane words. Does NOT analyze tone,
136
+ * formality, or professional communication quality. For actual tone
137
+ * analysis, use an LLM-backed assertion.
138
+ * @see hasSentimentAsync for LLM-based tone checking
139
+ * @example expect(output).toHaveNoProfanity()
140
+ */
141
+ toHaveNoProfanity(message?: string): AssertionResult;
142
+ /**
143
+ * @deprecated Use {@link toHaveNoProfanity} instead. This method only
144
+ * checks for 7 profane words — it does not analyze professional tone.
137
145
  */
138
146
  toBeProfessional(message?: string): AssertionResult;
139
147
  /**
@@ -200,7 +208,63 @@ export declare function hasPII(text: string): boolean;
200
208
  * {@link hasSentimentAsync} with an LLM provider for context-aware accuracy.
201
209
  */
202
210
  export declare function hasSentiment(text: string, expected: "positive" | "negative" | "neutral"): boolean;
211
+ /**
212
+ * Lexicon-based sentiment check with confidence score.
213
+ * Returns the detected sentiment, a confidence score (0–1), and whether
214
+ * it matches the expected sentiment.
215
+ *
216
+ * Confidence is derived from the magnitude of the word-count difference
217
+ * relative to the total sentiment-bearing words found.
218
+ *
219
+ * @example
220
+ * ```ts
221
+ * const { sentiment, confidence, matches } = hasSentimentWithScore(
222
+ * "This product is absolutely amazing and wonderful!",
223
+ * "positive",
224
+ * );
225
+ * // sentiment: "positive", confidence: ~0.9, matches: true
226
+ * ```
227
+ */
228
+ export declare function hasSentimentWithScore(text: string, expected: "positive" | "negative" | "neutral"): {
229
+ sentiment: "positive" | "negative" | "neutral";
230
+ confidence: number;
231
+ matches: boolean;
232
+ };
203
233
  export declare function similarTo(text1: string, text2: string, threshold?: number): boolean;
234
+ /**
235
+ * Measure consistency across multiple outputs for the same input.
236
+ * **Fast and approximate** — uses word-overlap (Jaccard) across all pairs.
237
+ * Returns a score from 0 (completely inconsistent) to 1 (identical).
238
+ *
239
+ * @param outputs - Array of LLM outputs to compare (minimum 2)
240
+ * @param threshold - Optional minimum consistency score to return true (default 0.7)
241
+ * @returns `{ score, passed }` where `passed` is `score >= threshold`
242
+ *
243
+ * @example
244
+ * ```ts
245
+ * const { score, passed } = hasConsistency([
246
+ * "The capital of France is Paris.",
247
+ * "Paris is the capital of France.",
248
+ * "France's capital city is Paris.",
249
+ * ]);
250
+ * // score ≈ 0.6-0.8, passed = true at default threshold
251
+ * ```
252
+ */
253
+ export declare function hasConsistency(outputs: string[], threshold?: number): {
254
+ score: number;
255
+ passed: boolean;
256
+ };
257
+ /**
258
+ * LLM-backed consistency check. **Slow and accurate** — asks the LLM to
259
+ * judge whether multiple outputs convey the same meaning, catching
260
+ * paraphrased contradictions that word-overlap misses.
261
+ *
262
+ * @returns A score from 0 to 1 where 1 = perfectly consistent.
263
+ */
264
+ export declare function hasConsistencyAsync(outputs: string[], config?: AssertionLLMConfig): Promise<{
265
+ score: number;
266
+ passed: boolean;
267
+ }>;
204
268
  export declare function withinRange(value: number, min: number, max: number): boolean;
205
269
  export declare function isValidEmail(email: string): boolean;
206
270
  export declare function isValidURL(url: string): boolean;
@@ -229,7 +293,24 @@ export declare function containsLanguage(text: string, language: string): boolea
229
293
  * paraphrasing. Use {@link hasFactualAccuracyAsync} for semantic accuracy.
230
294
  */
231
295
  export declare function hasFactualAccuracy(text: string, facts: string[]): boolean;
232
- export declare function respondedWithinTime(startTime: number, maxMs: number): boolean;
296
+ /**
297
+ * Check if a measured duration is within the allowed limit.
298
+ * @param durationMs - The actual elapsed time in milliseconds
299
+ * @param maxMs - Maximum allowed duration in milliseconds
300
+ */
301
+ export declare function respondedWithinDuration(durationMs: number, maxMs: number): AssertionResult;
302
+ /**
303
+ * Check if elapsed time since a start timestamp is within the allowed limit.
304
+ * @param startTime - Timestamp from Date.now() captured before the operation
305
+ * @param maxMs - Maximum allowed duration in milliseconds
306
+ */
307
+ export declare function respondedWithinTimeSince(startTime: number, maxMs: number): AssertionResult;
308
+ /**
309
+ * @deprecated Use {@link respondedWithinDuration} (takes measured duration)
310
+ * or {@link respondedWithinTimeSince} (takes start timestamp) instead.
311
+ * This function takes a start timestamp, not a duration — the name is misleading.
312
+ */
313
+ export declare function respondedWithinTime(startTime: number, maxMs: number): AssertionResult;
233
314
  /**
234
315
  * Blocklist-based toxicity check (~80 terms across 9 categories).
235
316
  * **Fast and approximate** — catches explicit harmful language but has
@@ -244,17 +325,66 @@ export interface AssertionLLMConfig {
244
325
  provider: "openai" | "anthropic";
245
326
  apiKey: string;
246
327
  model?: string;
328
+ /** Embedding model for toSemanticallyContain (default: text-embedding-3-small). OpenAI only. */
329
+ embeddingModel?: string;
247
330
  baseUrl?: string;
331
+ /** Maximum time in ms to wait for an LLM response. Default: 30 000 (30s). */
332
+ timeoutMs?: number;
248
333
  }
249
334
  export declare function configureAssertions(config: AssertionLLMConfig): void;
250
335
  export declare function getAssertionConfig(): AssertionLLMConfig | null;
336
+ /**
337
+ * Result object from {@link hasSentimentAsync}.
338
+ *
339
+ * Implements `Symbol.toPrimitive` so that legacy callers using
340
+ * `if (await hasSentimentAsync(...))` get the correct `matches` boolean
341
+ * instead of an always-truthy object. A one-time deprecation warning is
342
+ * emitted when boolean coercion is detected.
343
+ *
344
+ * **Migration:** Destructure the result instead of using it as a boolean.
345
+ * ```ts
346
+ * // ❌ Deprecated (works but warns):
347
+ * if (await hasSentimentAsync(text, "positive")) { ... }
348
+ *
349
+ * // ✅ New pattern:
350
+ * const { matches } = await hasSentimentAsync(text, "positive");
351
+ * if (matches) { ... }
352
+ * ```
353
+ */
354
+ export interface SentimentAsyncResult {
355
+ sentiment: "positive" | "negative" | "neutral";
356
+ confidence: number;
357
+ matches: boolean;
358
+ [Symbol.toPrimitive]: (hint: string) => boolean | number | string;
359
+ }
360
+ /** @internal Reset the one-time deprecation flag. For testing only. */
361
+ export declare function resetSentimentDeprecationWarning(): void;
251
362
  /**
252
363
  * LLM-backed sentiment check. **Slow and accurate** — uses an LLM to
253
- * classify sentiment with full context awareness. Requires
254
- * {@link configureAssertions} or an inline `config` argument.
364
+ * classify sentiment with full context awareness and return a confidence score.
365
+ * Requires {@link configureAssertions} or an inline `config` argument.
255
366
  * Falls back gracefully with a clear error if no API key is configured.
367
+ *
368
+ * Returns `{ sentiment, confidence, matches }` — the async layer now provides
369
+ * the same rich return shape as {@link hasSentimentWithScore}, but powered by
370
+ * an LLM instead of keyword counting. The `confidence` field is the LLM's
371
+ * self-reported confidence (0–1), not a lexical heuristic.
372
+ *
373
+ * The returned object implements `Symbol.toPrimitive` so that legacy code
374
+ * using `if (await hasSentimentAsync(...))` still works correctly (coerces
375
+ * to `matches`), but a deprecation warning is emitted. Migrate to
376
+ * destructuring: `const { matches } = await hasSentimentAsync(...)`.
377
+ *
378
+ * @example
379
+ * ```ts
380
+ * const { sentiment, confidence, matches } = await hasSentimentAsync(
381
+ * "This product is revolutionary but overpriced",
382
+ * "negative",
383
+ * );
384
+ * // sentiment: "negative", confidence: 0.7, matches: true
385
+ * ```
256
386
  */
257
- export declare function hasSentimentAsync(text: string, expected: "positive" | "negative" | "neutral", config?: AssertionLLMConfig): Promise<boolean>;
387
+ export declare function hasSentimentAsync(text: string, expected: "positive" | "negative" | "neutral", config?: AssertionLLMConfig): Promise<SentimentAsyncResult>;
258
388
  /**
259
389
  * LLM-backed toxicity check. **Slow and accurate** — context-aware, handles
260
390
  * sarcasm, implicit threats, and culturally specific harmful content that
@@ -269,4 +399,54 @@ export declare function hasFactualAccuracyAsync(text: string, facts: string[], c
269
399
  * claims even when they are paraphrased or contradict facts indirectly.
270
400
  */
271
401
  export declare function hasNoHallucinationsAsync(text: string, groundTruth: string[], config?: AssertionLLMConfig): Promise<boolean>;
402
+ /**
403
+ * Embedding-based semantic containment check. Uses OpenAI embeddings and
404
+ * cosine similarity to determine whether the text semantically contains
405
+ * the given concept — no LLM prompt, no "does this text contain X" trick.
406
+ *
407
+ * This is **real semantic containment**: embed both strings, compute cosine
408
+ * similarity, and compare against a threshold. "The city of lights" will
409
+ * have high similarity to "Paris" because their embeddings are close in
410
+ * vector space.
411
+ *
412
+ * Requires `provider: "openai"` in the config. For Anthropic or other
413
+ * providers without an embedding API, use {@link toSemanticallyContainLLM}.
414
+ *
415
+ * @param text - The text to check
416
+ * @param phrase - The semantic concept to look for
417
+ * @param config - LLM config (must be OpenAI with embedding support)
418
+ * @param threshold - Cosine similarity threshold (default: 0.4). Lower values
419
+ * are more permissive. Typical ranges: 0.3–0.5 for concept containment,
420
+ * 0.6–0.8 for paraphrase detection, 0.9+ for near-duplicates.
421
+ * @returns `{ contains, similarity }` — whether the threshold was met and the raw score
422
+ *
423
+ * @example
424
+ * ```ts
425
+ * const { contains, similarity } = await toSemanticallyContain(
426
+ * "The city of lights is beautiful in spring",
427
+ * "Paris",
428
+ * { provider: "openai", apiKey: process.env.OPENAI_API_KEY },
429
+ * );
430
+ * // contains: true, similarity: ~0.52
431
+ * ```
432
+ */
433
+ export declare function toSemanticallyContain(text: string, phrase: string, config?: AssertionLLMConfig, threshold?: number): Promise<{
434
+ contains: boolean;
435
+ similarity: number;
436
+ }>;
437
+ /**
438
+ * LLM-prompt-based semantic containment check. Uses an LLM prompt to ask
439
+ * whether the text conveys a concept. This is a **fallback** for providers
440
+ * that don't offer an embedding API (e.g., Anthropic).
441
+ *
442
+ * Note: This is functionally similar to `followsInstructions` — the LLM is
443
+ * being asked to judge containment, not compute vector similarity. For
444
+ * real embedding-based semantic containment, use {@link toSemanticallyContain}.
445
+ *
446
+ * @param text - The text to check
447
+ * @param phrase - The semantic concept to look for
448
+ * @param config - Optional LLM config
449
+ * @returns true if the LLM judges the text contains the concept
450
+ */
451
+ export declare function toSemanticallyContainLLM(text: string, phrase: string, config?: AssertionLLMConfig): Promise<boolean>;
272
452
  export declare function hasValidCodeSyntax(code: string, language: string): boolean;