npm - @pauly4010/evalai-sdk - Versions diffs - 1.5.8 → 1.7.0 - Mend

@pauly4010/evalai-sdk 1.5.8 → 1.7.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (20) hide show

package/CHANGELOG.md +72 -0
package/README.md +172 -251
package/dist/cli/baseline.d.ts +10 -0
package/dist/cli/baseline.js +172 -0
package/dist/cli/index.js +26 -4
package/dist/cli/init.d.ts +11 -2
package/dist/cli/init.js +227 -16
package/dist/cli/regression-gate.d.ts +15 -0
package/dist/cli/regression-gate.js +335 -0
package/dist/cli/upgrade.d.ts +15 -0
package/dist/cli/upgrade.js +491 -0
package/dist/client.request.test.d.ts +1 -1
package/dist/client.request.test.js +157 -157
package/dist/index.d.ts +1 -0
package/dist/index.js +7 -1
package/dist/regression.d.ts +100 -0
package/dist/regression.js +44 -0
package/dist/version.d.ts +1 -1
package/dist/version.js +1 -1
package/package.json +6 -1

package/CHANGELOG.md CHANGED Viewed

@@ -5,6 +5,78 @@ All notable changes to the @pauly4010/evalai-sdk package will be documented in t
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+## [1.7.0] - 2026-02-25
+### ✨ Added
+#### CLI — `evalai init` Full Project Scaffolder
+- **`evalai init`** — Zero-to-gate in under 5 minutes:
+  - Detects Node repo + package manager (npm/yarn/pnpm)
+  - Runs existing tests to capture real pass/fail + test count
+  - Creates `evals/baseline.json` with provenance metadata
+  - Installs `.github/workflows/evalai-gate.yml` (package-manager aware)
+  - Creates `evalai.config.json`
+  - Prints copy-paste next steps — just commit and push
+  - Idempotent: skips files that already exist
+#### CLI — `evalai upgrade --full` (Tier 1 → Tier 2)
+- **`evalai upgrade --full`** — Upgrade from built-in gate to full gate:
+  - Creates `scripts/regression-gate.ts` (full gate with `--update-baseline`)
+  - Adds `eval:regression-gate` + `eval:baseline-update` npm scripts
+  - Creates `.github/workflows/baseline-governance.yml` (label + diff enforcement)
+  - Upgrades `evalai-gate.yml` to project mode
+  - Adds `CODEOWNERS` entry for `evals/baseline.json`
+#### Gate Output — Machine-Readable Improvements
+- **`detectRunner()`** — Identifies test runner from `package.json` scripts: vitest, jest, mocha, node:test, ava, tap, or unknown
+- **BuiltinReport** now always emits: `durationMs`, `command`, `runner`, `baseline` metadata
+- Report schema updated with optional `durationMs`, `command`, `runner` properties
+#### Init Scaffolder Integration Tests
+- 4 fixtures: npm+jest, pnpm+vitest, yarn+jest, pnpm monorepo
+- 25 tests: files created, YAML valid, pm-aware workflow, idempotent runs
+- All fixtures use `node:test` (zero external deps)
+### 🔧 Changed
+- CLI help text updated to include `upgrade` command
+- Gate report includes runner detection and timing metadata
+- SDK test count: 147 → 172 tests (12 → 15 contract tests)
+---
+## [1.6.0] - 2026-02-24
+### ✨ Added
+#### CLI — Regression Gate & Baseline Management
+- **`evalai baseline init`** — Create a starter `evals/baseline.json` with sample values and provenance metadata
+- **`evalai baseline update`** — Run confidence tests, golden eval, and latency benchmark, then update baseline with real scores
+- **`evalai gate`** — Run the local regression gate with proper exit code taxonomy (0=pass, 1=regression, 2=infra_error, 3=confidence_failed, 4=confidence_missing)
+- **`evalai gate --format json`** — Output `evals/regression-report.json` as machine-readable JSON to stdout
+- **`evalai gate --format github`** — Output GitHub Step Summary markdown with delta table
+#### SDK Exports — Regression Gate Constants & Types
+- **`GATE_EXIT`** — Exit code constants (`PASS`, `REGRESSION`, `INFRA_ERROR`, `CONFIDENCE_FAILED`, `CONFIDENCE_MISSING`)
+- **`GATE_CATEGORY`** — Report category constants (`pass`, `regression`, `infra_error`)
+- **`REPORT_SCHEMA_VERSION`** — Current schema version for `regression-report.json`
+- **`ARTIFACTS`** — Well-known artifact paths (`BASELINE`, `REGRESSION_REPORT`, `CONFIDENCE_SUMMARY`, `LATENCY_BENCHMARK`)
+- **Types**: `RegressionReport`, `RegressionDelta`, `Baseline`, `BaselineTolerance`, `GateExitCode`, `GateCategory`
+- **Subpath export**: `@pauly4010/evalai-sdk/regression` for tree-shakeable imports
+### 🔧 Changed
+- CLI help text updated to include `baseline` and `gate` commands
+- SDK becomes the public contract for regression gate — scripts are implementation detail
+---
 ## [1.5.8] - 2026-02-22
 ### 🐛 Fixed

package/README.md CHANGED Viewed

@@ -1,22 +1,108 @@
 # @pauly4010/evalai-sdk
-[![npm version](https://img.shields.io/npm/v/@pauly4010/evalai-sdk.svg)](https://www.npmjs.com/package/@pauly4010/evalai-sdk)
+[![npm version](https://img.shields.io/npm/v/@pauly4010/evalai-sdk.svg)](https://www.npmjs.com/package/@pauly4010/evalai-sdk)
 [![npm downloads](https://img.shields.io/npm/dm/@pauly4010/evalai-sdk.svg)](https://www.npmjs.com/package/@pauly4010/evalai-sdk)
 **Stop LLM regressions in CI in minutes.**
-Evaluate locally in 60 seconds. Add an optional CI gate in 2 minutes.
-No lock-in — remove by deleting `evalai.config.json`.
+Zero to gate in under 5 minutes. No infra. No lock-in. Remove anytime.
 ---
-# 🚀 1) 60 seconds: Run locally (no account)
+## Quick Start (2 minutes)
-Install, run, get a score.
-No EvalAI account. No API key. No dashboard required.
+```bash
+npx @pauly4010/evalai-sdk init
+git add evals/ .github/workflows/evalai-gate.yml evalai.config.json
+git commit -m "chore: add EvalAI regression gate"
+git push
+```
+That's it. Open a PR and CI blocks regressions automatically.
+`evalai init` detects your project, creates a baseline from your current tests, and installs a GitHub Actions workflow. No manual config needed.
+---
+## What `evalai init` does
+1. **Detects** your Node repo and package manager (npm/yarn/pnpm)
+2. **Runs your tests** to capture a real baseline (pass/fail + test count)
+3. **Creates `evals/baseline.json`** with provenance metadata
+4. **Installs `.github/workflows/evalai-gate.yml`** (package-manager aware)
+5. **Creates `evalai.config.json`**
+6. **Prints next steps** — just commit and push
+---
+## CLI Commands
+### Regression Gate (local, no account needed)
+| Command | Description |
+|---------|-------------|
+| `npx evalai init` | Full project scaffolder — creates everything you need |
+| `npx evalai gate` | Run regression gate locally |
+| `npx evalai gate --format json` | Machine-readable JSON output |
+| `npx evalai gate --format github` | GitHub Step Summary with delta table |
+| `npx evalai baseline init` | Create starter `evals/baseline.json` |
+| `npx evalai baseline update` | Re-run tests and update baseline with real scores |
+| `npx evalai upgrade --full` | Upgrade from Tier 1 (built-in) to Tier 2 (full gate) |
+### API Gate (requires account)
+| Command | Description |
+|---------|-------------|
+| `npx evalai check` | Gate on quality score from dashboard |
+| `npx evalai doctor` | Verify CI/CD setup |
+| `npx evalai share` | Create share link for a run |
+### Gate Exit Codes
+| Code | Meaning |
+|------|---------|
+| 0 | Pass — no regression |
+| 1 | Regression detected |
+| 2 | Infra error (baseline missing, tests crashed) |
+### Check Exit Codes (API mode)
+| Code | Meaning |
+|------|---------|
+| 0 | Pass |
+| 1 | Score below threshold |
+| 2 | Regression failure |
+| 3 | Policy violation |
+| 4 | API error |
+| 5 | Bad arguments |
+| 6 | Low test count |
+| 7 | Weak evidence |
+| 8 | Warn (soft regression) |
+---
+## How the Gate Works
+**Built-in mode** (any Node project, no config needed):
+- Runs `<pm> test`, captures exit code + test count
+- Compares against `evals/baseline.json`
+- Writes `evals/regression-report.json`
+- Fails CI if tests regress
+**Project mode** (advanced, for full regression gate):
+- If `eval:regression-gate` script exists in `package.json`, delegates to it
+- Supports golden eval scores, confidence tests, p95 latency, cost tracking
+- Full delta table with tolerances
+---
+## Run a Regression Test Locally (no account)
 ```bash
 npm install @pauly4010/evalai-sdk openai
+```
+```typescript
 import { openAIChatEval } from "@pauly4010/evalai-sdk";
 await openAIChatEval({
@@ -26,14 +112,14 @@ await openAIChatEval({
     { input: "2 + 2 = ?", expectedOutput: "4" },
   ],
 });
-Set:
+```
-OPENAI_API_KEY=...
-✅ Vitest integration (recommended)
-import {
-  openAIChatEval,
-  extendExpectWithToPassGate,
-} from "@pauly4010/evalai-sdk";
+Output: `PASS 2/2 (score: 100)`. No account needed. Just a score.
+### Vitest Integration
+```typescript
+import { openAIChatEval, extendExpectWithToPassGate } from "@pauly4010/evalai-sdk";
 import { expect } from "vitest";
 extendExpectWithToPassGate(expect);
@@ -46,278 +132,113 @@ it("passes gate", async () => {
       { input: "2 + 2 = ?", expectedOutput: "4" },
     ],
   });
   expect(result).toPassGate();
 });
-Example output
-PASS 2/2 (score: 100)
-Tip: Want dashboards and history?
-Set EVALAI_API_KEY and connect this to the platform.
-Failures show:
-FAIL 9/10 (score: 90)
-with failed cases and CI guidance.
-⚡ 2) Optional: Add a CI gate (2 minutes)
-When you're ready to gate PRs on quality and regressions:
-npx -y @pauly4010/evalai-sdk@^1 init
-Create an evaluation in the dashboard and paste its ID into:
-{
-  "evaluationId": "42"
-}
-Add to your CI:
-- name: EvalAI gate
-  env:
-    EVALAI_API_KEY: ${{ secrets.EVALAI_API_KEY }}
-  run: npx -y @pauly4010/evalai-sdk@^1 check --format github --onFail import --warnDrop 1
-You’ll get:
-GitHub annotations
-Step summary
-Optional dashboard link
-PASS / WARN / FAIL (v1.5.7)
-EvalAI introduces a WARN band so teams can see meaningful regressions without always blocking merges.
-Behavior
-PASS → within thresholds
-WARN → regression > warnDrop but < maxDrop
-FAIL → regression > maxDrop
-Key flags
---warnDrop → soft regression warning
---maxDrop → hard regression fail
---fail-on-flake → fail if unknown test is unstable
-This lets teams tune signal vs noise in CI.
-🔒 3) No lock-in
-To stop using EvalAI:
-rm evalai.config.json
-Your local openAIChatEval runs continue to work exactly the same.
-No account cancellation. No data export required.
-📦 Installation
-npm install @pauly4010/evalai-sdk openai
-# or
-yarn add @pauly4010/evalai-sdk openai
-# or
-pnpm add @pauly4010/evalai-sdk openai
-🖥️ Environment Support
-This SDK works in both Node.js and browsers, with some Node-only features.
-✅ Works Everywhere (Node.js + Browser)
-Traces API
-Evaluations API
-LLM Judge API
-Annotations API
-Developer API (API Keys, Webhooks, Usage)
-Organizations API
-Assertions Library
-Test Suites
-Error Handling
-CJS/ESM Compatibility
+```
-🟡 Node.js Only
-These require Node.js:
+---
-Snapshot Testing
+## SDK Exports
-Local Storage Mode
+### Regression Gate Constants
-CLI Tool
+```typescript
+import {
+  GATE_EXIT,           // { PASS: 0, REGRESSION: 1, INFRA_ERROR: 2, ... }
+  GATE_CATEGORY,       // { PASS: "pass", REGRESSION: "regression", INFRA_ERROR: "infra_error" }
+  REPORT_SCHEMA_VERSION,
+  ARTIFACTS,           // { BASELINE, REGRESSION_REPORT, CONFIDENCE_SUMMARY, LATENCY_BENCHMARK }
+} from "@pauly4010/evalai-sdk";
-Export to File
+// Or tree-shakeable:
+import { GATE_EXIT } from "@pauly4010/evalai-sdk/regression";
+```
-🔄 Context Propagation
-Node.js: full async context via AsyncLocalStorage
+### Types
+```typescript
+import type {
+  RegressionReport,
+  RegressionDelta,
+  Baseline,
+  BaselineTolerance,
+  GateExitCode,
+  GateCategory,
+} from "@pauly4010/evalai-sdk/regression";
+```
-Browser: basic support (not safe across all async boundaries)
+### Platform Client
-🧠 AIEvalClient (Platform API)
+```typescript
 import { AIEvalClient } from "@pauly4010/evalai-sdk";
-// From env
-const client = AIEvalClient.init();
-// Explicit
-const client2 = new AIEvalClient({
-  apiKey: "your-api-key",
-  organizationId: 123,
-  debug: true,
-});
-🧪 evalai CLI (v1.5.7)
-The CLI gates deployments on quality, regression, and policy.
-Quick start
-npx -y @pauly4010/evalai-sdk@^1 check \
-  --evaluationId 42 \
-  --apiKey $EVALAI_API_KEY
-evalai check
-Option	Description
---evaluationId <id>	Required. Evaluation to gate on
---apiKey <key>	API key (or EVALAI_API_KEY)
---format <fmt>	human, json, or github
---onFail import	Import failing run to dashboard
---explain	Show score breakdown
---minScore <n>	Fail if score < n
---warnDrop <n>	Warn if regression exceeds n
---maxDrop <n>	Fail if regression exceeds n
---minN <n>	Fail if test count < n
---allowWeakEvidence	Permit weak evidence
---policy <name>	HIPAA, SOC2, GDPR, PCI_DSS, FINRA_4511
---baseline <mode>	published, previous, production
---fail-on-flake	Fail if unknown case is flaky
---baseUrl <url>	Override API base URL
-Exit codes
-Code	Meaning
-0	PASS
-8	WARN
-1	Score below threshold
-2	Regression failure
-3	Policy violation
-4	API error
-5	Bad arguments
-6	Low test count
-7	Weak evidence
-evalai doctor
-Verify CI setup before running the gate:
-npx -y @pauly4010/evalai-sdk@^1 doctor \
-  --evaluationId 42 \
-  --apiKey $EVALAI_API_KEY
-If doctor passes, check will work.
-🧯 Error Handling
-import { EvalAIError, RateLimitError } from "@pauly4010/evalai-sdk";
-try {
-  await client.traces.create({ name: "User Query" });
-} catch (err) {
-  if (err instanceof RateLimitError) {
-    console.log("Retry after:", err.retryAfter);
-  } else if (err instanceof EvalAIError) {
-    console.log(err.code, err.message, err.requestId);
-  }
-}
-🔍 Traces
-const trace = await client.traces.create({
-  name: "User Query",
-  traceId: "trace-123",
-  metadata: { userId: "456" },
-});
-📝 Evaluations
-import { EvaluationTemplates } from "@pauly4010/evalai-sdk";
-const evaluation = await client.evaluations.create({
-  name: "Chatbot Responses",
-  type: EvaluationTemplates.OUTPUT_QUALITY,
-  createdBy: userId,
-});
+const client = AIEvalClient.init(); // from EVALAI_API_KEY env
+// or
+const client = new AIEvalClient({ apiKey: "...", organizationId: 123 });
+```
+### Framework Integrations
-🔌 Framework Integrations
+```typescript
 import { traceOpenAI } from "@pauly4010/evalai-sdk/integrations/openai";
-import OpenAI from "openai";
-const openai = traceOpenAI(new OpenAI(), client);
-await openai.chat.completions.create({
-  model: "gpt-4",
-  messages: [{ role: "user", content: "Hello" }],
-});
-🧭 Changelog
-v1.5.8 (Latest)
-Fixed secureRoute TypeScript overload compatibility
-Fixed test infrastructure (expect.any, NextRequest constructor)
-Fixed 304 response handling in exports API
-Improved error catalog test coverage
-v1.5.7
-Documentation updates for CJS compatibility
-Version alignment across README and changelog
-Environment support section updated
-v1.5.6
-PASS/WARN/FAIL gate semantics
+import { traceAnthropic } from "@pauly4010/evalai-sdk/integrations/anthropic";
+```
---warnDrop soft regression band
+---
-Flake intelligence + per-case pass rates
+## Installation
---fail-on-flake enforcement
+```bash
+npm install @pauly4010/evalai-sdk
+# or
+yarn add @pauly4010/evalai-sdk
+# or
+pnpm add @pauly4010/evalai-sdk
+```
-Golden regression suite
+Add `openai` as a peer dependency if using `openAIChatEval`:
-Nightly determinism + performance audits
+```bash
+npm install openai
+```
-Audit trail, observability, retention, and migration safety docs
+## Environment Support
-CJS compatibility for all subpath exports
+| Feature | Node.js | Browser |
+|---------|---------|---------|
+| Platform APIs (Traces, Evaluations, LLM Judge) | ✅ | ✅ |
+| Assertions, Test Suites, Error Handling | ✅ | ✅ |
+| CJS/ESM | ✅ | ✅ |
+| CLI, Snapshots, File Export | ✅ | — |
+| Context Propagation | ✅ Full | ⚠️ Basic |
-v1.5.0
-GitHub annotations formatter
+## No Lock-in
-JSON formatter
+```bash
+rm evalai.config.json
+```
---onFail import
+Your local `openAIChatEval` runs continue to work. No account cancellation. No data export required.
---explain
+## Changelog
-evalai doctor
+See [CHANGELOG.md](CHANGELOG.md) for the full release history.
-CI pinned invocation guidance
+**v1.7.0** — `evalai init` scaffolder, `evalai upgrade --full`, `detectRunner()`, machine-readable gate output, init test matrix
+**v1.6.0** — `evalai gate`, `evalai baseline`, regression gate constants & types
-Environment Variable Safety
+**v1.5.8** — secureRoute fix, test infra fixes, 304 handling fix
-The SDK never assumes `process.env` exists. All environment reads are guarded, so the client can initialize safely in browser, edge, and server runtimes.
+**v1.5.5** — PASS/WARN/FAIL semantics, flake intelligence, golden regression suite
-If environment variables are unavailable, the SDK falls back to explicit config.
+**v1.5.0** — GitHub annotations, `--onFail import`, `evalai doctor`
+## License
-📄 License
 MIT
-🤝 Support
-Documentation:
-https://v0-ai-evaluation-platform-nu.vercel.app/documentation
+## Support
-Issues:
-https://github.com/pauly7610/ai-evaluation-platform/issues
-```
+- **Docs:** https://v0-ai-evaluation-platform-nu.vercel.app/documentation
+- **Issues:** https://github.com/pauly7610/ai-evaluation-platform/issues

package/dist/cli/baseline.d.ts ADDED Viewed

@@ -0,0 +1,10 @@
+/**
+ * evalai baseline — Baseline management commands
+ *
+ * Subcommands:
+ *   evalai baseline init    — Create a starter evals/baseline.json
+ *   evalai baseline update  — Run tests + update baseline with real scores
+ */
+export declare function runBaselineInit(cwd: string): number;
+export declare function runBaselineUpdate(cwd: string): number;
+export declare function runBaseline(argv: string[]): number;