@pauly4010/evalai-sdk 1.6.0 → 1.8.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +114 -0
- package/README.md +198 -236
- package/dist/cli/baseline.js +1 -1
- package/dist/cli/check.js +15 -0
- package/dist/cli/doctor.d.ts +80 -3
- package/dist/cli/doctor.js +576 -43
- package/dist/cli/explain.d.ts +58 -0
- package/dist/cli/explain.js +429 -0
- package/dist/cli/formatters/github.js +5 -0
- package/dist/cli/formatters/types.d.ts +3 -0
- package/dist/cli/formatters/types.js +3 -0
- package/dist/cli/index.js +47 -4
- package/dist/cli/init.d.ts +11 -2
- package/dist/cli/init.js +239 -16
- package/dist/cli/print-config.d.ts +29 -0
- package/dist/cli/print-config.js +251 -0
- package/dist/cli/regression-gate.d.ts +6 -2
- package/dist/cli/regression-gate.js +246 -61
- package/dist/cli/report/build-check-report.d.ts +1 -1
- package/dist/cli/report/build-check-report.js +2 -0
- package/dist/cli/upgrade.d.ts +15 -0
- package/dist/cli/upgrade.js +491 -0
- package/dist/index.d.ts +1 -1
- package/dist/index.js +7 -7
- package/dist/version.d.ts +1 -1
- package/dist/version.js +1 -1
- package/package.json +1 -1
package/CHANGELOG.md
CHANGED
|
@@ -5,6 +5,120 @@ All notable changes to the @pauly4010/evalai-sdk package will be documented in t
|
|
|
5
5
|
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
|
6
6
|
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
|
7
7
|
|
|
8
|
+
## [1.8.0] - 2026-02-26
|
|
9
|
+
|
|
10
|
+
### ✨ Added
|
|
11
|
+
|
|
12
|
+
#### CLI — `evalai doctor` Rewrite (Comprehensive Checklist)
|
|
13
|
+
|
|
14
|
+
- **9 itemized checks** with pass/fail/warn/skip status and exact remediation commands:
|
|
15
|
+
1. Project detection (package.json + lockfile + package manager)
|
|
16
|
+
2. Config file validity (evalai.config.json)
|
|
17
|
+
3. Baseline file (evals/baseline.json — schema, staleness)
|
|
18
|
+
4. Authentication (API key presence, redacted display)
|
|
19
|
+
5. Evaluation target (evaluationId configured)
|
|
20
|
+
6. API connectivity (reachable, latency)
|
|
21
|
+
7. Evaluation access (permissions, baseline presence)
|
|
22
|
+
8. CI wiring (.github/workflows/evalai-gate.yml)
|
|
23
|
+
9. Provider env vars (OpenAI/Anthropic/Azure — optional)
|
|
24
|
+
- **Exit codes**: `0` ready, `2` not ready, `3` infrastructure error
|
|
25
|
+
- **`--report`** flag outputs full JSON diagnostic bundle (versions, hashes, latency, all checks)
|
|
26
|
+
- **`--format json`** for machine-readable output
|
|
27
|
+
|
|
28
|
+
#### CLI — `evalai explain` (New Command)
|
|
29
|
+
|
|
30
|
+
- **Offline report explainer** — reads `.evalai/last-report.json` or `evals/regression-report.json` with zero flags
|
|
31
|
+
- **Top 3 failing test cases** with input/expected/actual
|
|
32
|
+
- **What changed** — baseline vs current with directional indicators
|
|
33
|
+
- **Root cause classification**: prompt drift, retrieval drift, formatting drift, tool-use drift, safety/cost/latency regression, coverage drop, baseline stale
|
|
34
|
+
- **Prioritized suggested fixes** with actionable commands
|
|
35
|
+
- Works with both `evalai check` reports (CheckReport) and `evalai gate` reports (BuiltinReport)
|
|
36
|
+
- **`--format json`** for CI pipeline consumption
|
|
37
|
+
|
|
38
|
+
#### Guided Failure Flow
|
|
39
|
+
|
|
40
|
+
- **`evalai check` now writes `.evalai/last-report.json`** automatically after every run
|
|
41
|
+
- **Failure hint**: prints `Next: evalai explain` on gate failure
|
|
42
|
+
- **GitHub step summary**: adds tip about `evalai explain` and report artifact location on failure
|
|
43
|
+
|
|
44
|
+
#### CI Template Improvements
|
|
45
|
+
|
|
46
|
+
- **Doctor preflight step** added to generated workflow (`continue-on-error: true`)
|
|
47
|
+
- **Report artifact upload** now includes both `evals/regression-report.json` and `.evalai/last-report.json`
|
|
48
|
+
|
|
49
|
+
#### `evalai init` Output Updated
|
|
50
|
+
|
|
51
|
+
- First recommendation: `npx evalai doctor` (verify setup)
|
|
52
|
+
- Full command reference: doctor, gate, check, explain, baseline update
|
|
53
|
+
|
|
54
|
+
#### CLI — `evalai print-config` (New Command)
|
|
55
|
+
|
|
56
|
+
- **Resolved config viewer** — prints every config field with its current value
|
|
57
|
+
- **Source-of-truth annotations**: `[file]`, `[env]`, `[default]`, `[profile]`, `[arg]` for each field
|
|
58
|
+
- **Secrets redacted** — API keys shown as `sk_t...abcd`
|
|
59
|
+
- **Environment summary** — shows all relevant env vars (EVALAI_API_KEY, OPENAI_API_KEY, CI, etc.)
|
|
60
|
+
- **`--format json`** for machine-readable output
|
|
61
|
+
- Accepts `--evaluationId`, `--baseUrl`, etc. to show how CLI args would merge
|
|
62
|
+
|
|
63
|
+
#### Minimal Green Example
|
|
64
|
+
|
|
65
|
+
- **`examples/minimal-green/`** — passes on first run, no account needed
|
|
66
|
+
- Zero dependencies, 3 `node:test` tests
|
|
67
|
+
- Clone → init → doctor → gate → ✅
|
|
68
|
+
|
|
69
|
+
### 🔧 Changed
|
|
70
|
+
|
|
71
|
+
- `evalai doctor` exit codes changed: was `0`/`1`, now `0`/`2`/`3`
|
|
72
|
+
- SDK README: added Debugging & Diagnostics section with guided flow diagram
|
|
73
|
+
- SDK README: added Doctor Exit Codes table
|
|
74
|
+
- Doctor test count: 4 → 29 tests; added 9 explain tests (38 total new tests)
|
|
75
|
+
|
|
76
|
+
---
|
|
77
|
+
|
|
78
|
+
## [1.7.0] - 2026-02-25
|
|
79
|
+
|
|
80
|
+
### ✨ Added
|
|
81
|
+
|
|
82
|
+
#### CLI — `evalai init` Full Project Scaffolder
|
|
83
|
+
|
|
84
|
+
- **`evalai init`** — Zero-to-gate in under 5 minutes:
|
|
85
|
+
- Detects Node repo + package manager (npm/yarn/pnpm)
|
|
86
|
+
- Runs existing tests to capture real pass/fail + test count
|
|
87
|
+
- Creates `evals/baseline.json` with provenance metadata
|
|
88
|
+
- Installs `.github/workflows/evalai-gate.yml` (package-manager aware)
|
|
89
|
+
- Creates `evalai.config.json`
|
|
90
|
+
- Prints copy-paste next steps — just commit and push
|
|
91
|
+
- Idempotent: skips files that already exist
|
|
92
|
+
|
|
93
|
+
#### CLI — `evalai upgrade --full` (Tier 1 → Tier 2)
|
|
94
|
+
|
|
95
|
+
- **`evalai upgrade --full`** — Upgrade from built-in gate to full gate:
|
|
96
|
+
- Creates `scripts/regression-gate.ts` (full gate with `--update-baseline`)
|
|
97
|
+
- Adds `eval:regression-gate` + `eval:baseline-update` npm scripts
|
|
98
|
+
- Creates `.github/workflows/baseline-governance.yml` (label + diff enforcement)
|
|
99
|
+
- Upgrades `evalai-gate.yml` to project mode
|
|
100
|
+
- Adds `CODEOWNERS` entry for `evals/baseline.json`
|
|
101
|
+
|
|
102
|
+
#### Gate Output — Machine-Readable Improvements
|
|
103
|
+
|
|
104
|
+
- **`detectRunner()`** — Identifies test runner from `package.json` scripts: vitest, jest, mocha, node:test, ava, tap, or unknown
|
|
105
|
+
- **BuiltinReport** now always emits: `durationMs`, `command`, `runner`, `baseline` metadata
|
|
106
|
+
- Report schema updated with optional `durationMs`, `command`, `runner` properties
|
|
107
|
+
|
|
108
|
+
#### Init Scaffolder Integration Tests
|
|
109
|
+
|
|
110
|
+
- 4 fixtures: npm+jest, pnpm+vitest, yarn+jest, pnpm monorepo
|
|
111
|
+
- 25 tests: files created, YAML valid, pm-aware workflow, idempotent runs
|
|
112
|
+
- All fixtures use `node:test` (zero external deps)
|
|
113
|
+
|
|
114
|
+
### 🔧 Changed
|
|
115
|
+
|
|
116
|
+
- CLI help text updated to include `upgrade` command
|
|
117
|
+
- Gate report includes runner detection and timing metadata
|
|
118
|
+
- SDK test count: 147 → 172 tests (12 → 15 contract tests)
|
|
119
|
+
|
|
120
|
+
---
|
|
121
|
+
|
|
8
122
|
## [1.6.0] - 2026-02-24
|
|
9
123
|
|
|
10
124
|
### ✨ Added
|
package/README.md
CHANGED
|
@@ -1,323 +1,285 @@
|
|
|
1
1
|
# @pauly4010/evalai-sdk
|
|
2
2
|
|
|
3
|
-
[](https://www.npmjs.com/package/@pauly4010/evalai-sdk)
|
|
3
|
+
[](https://www.npmjs.com/package/@pauly4010/evalai-sdk)
|
|
4
4
|
[](https://www.npmjs.com/package/@pauly4010/evalai-sdk)
|
|
5
|
+
[](https://www.typescriptlang.org/)
|
|
6
|
+
[](#)
|
|
7
|
+
[](#)
|
|
8
|
+
[](https://opensource.org/licenses/MIT)
|
|
5
9
|
|
|
6
10
|
**Stop LLM regressions in CI in minutes.**
|
|
7
11
|
|
|
8
|
-
|
|
9
|
-
No lock-in — remove by deleting `evalai.config.json`.
|
|
12
|
+
Zero to gate in under 5 minutes. No infra. No lock-in. Remove anytime.
|
|
10
13
|
|
|
11
14
|
---
|
|
12
15
|
|
|
13
|
-
|
|
14
|
-
|
|
15
|
-
Install, run, get a score.
|
|
16
|
-
No EvalAI account. No API key. No dashboard required.
|
|
16
|
+
## Quick Start (2 minutes)
|
|
17
17
|
|
|
18
18
|
```bash
|
|
19
|
-
|
|
20
|
-
|
|
21
|
-
|
|
22
|
-
|
|
23
|
-
|
|
24
|
-
cases: [
|
|
25
|
-
{ input: "Hello", expectedOutput: "greeting" },
|
|
26
|
-
{ input: "2 + 2 = ?", expectedOutput: "4" },
|
|
27
|
-
],
|
|
28
|
-
});
|
|
29
|
-
Set:
|
|
30
|
-
|
|
31
|
-
OPENAI_API_KEY=...
|
|
32
|
-
✅ Vitest integration (recommended)
|
|
33
|
-
import {
|
|
34
|
-
openAIChatEval,
|
|
35
|
-
extendExpectWithToPassGate,
|
|
36
|
-
} from "@pauly4010/evalai-sdk";
|
|
37
|
-
import { expect } from "vitest";
|
|
38
|
-
|
|
39
|
-
extendExpectWithToPassGate(expect);
|
|
40
|
-
|
|
41
|
-
it("passes gate", async () => {
|
|
42
|
-
const result = await openAIChatEval({
|
|
43
|
-
name: "chat-regression",
|
|
44
|
-
cases: [
|
|
45
|
-
{ input: "Hello", expectedOutput: "greeting" },
|
|
46
|
-
{ input: "2 + 2 = ?", expectedOutput: "4" },
|
|
47
|
-
],
|
|
48
|
-
});
|
|
49
|
-
|
|
50
|
-
expect(result).toPassGate();
|
|
51
|
-
});
|
|
52
|
-
Example output
|
|
53
|
-
PASS 2/2 (score: 100)
|
|
54
|
-
|
|
55
|
-
Tip: Want dashboards and history?
|
|
56
|
-
Set EVALAI_API_KEY and connect this to the platform.
|
|
57
|
-
Failures show:
|
|
58
|
-
|
|
59
|
-
FAIL 9/10 (score: 90)
|
|
60
|
-
with failed cases and CI guidance.
|
|
61
|
-
|
|
62
|
-
⚡ 2) Optional: Add a CI gate (2 minutes)
|
|
63
|
-
When you're ready to gate PRs on quality and regressions:
|
|
64
|
-
|
|
65
|
-
npx -y @pauly4010/evalai-sdk@^1 init
|
|
66
|
-
Create an evaluation in the dashboard and paste its ID into:
|
|
67
|
-
|
|
68
|
-
{
|
|
69
|
-
"evaluationId": "42"
|
|
70
|
-
}
|
|
71
|
-
Add to your CI:
|
|
19
|
+
npx @pauly4010/evalai-sdk init
|
|
20
|
+
git add evals/ .github/workflows/evalai-gate.yml evalai.config.json
|
|
21
|
+
git commit -m "chore: add EvalAI regression gate"
|
|
22
|
+
git push
|
|
23
|
+
```
|
|
72
24
|
|
|
73
|
-
|
|
74
|
-
env:
|
|
75
|
-
EVALAI_API_KEY: ${{ secrets.EVALAI_API_KEY }}
|
|
76
|
-
run: npx -y @pauly4010/evalai-sdk@^1 check --format github --onFail import --warnDrop 1
|
|
77
|
-
You’ll get:
|
|
25
|
+
That's it. Open a PR and CI blocks regressions automatically.
|
|
78
26
|
|
|
79
|
-
GitHub
|
|
27
|
+
`evalai init` detects your project, creates a baseline from your current tests, and installs a GitHub Actions workflow. No manual config needed.
|
|
80
28
|
|
|
81
|
-
|
|
29
|
+
---
|
|
82
30
|
|
|
83
|
-
|
|
31
|
+
## What `evalai init` does
|
|
84
32
|
|
|
85
|
-
|
|
86
|
-
|
|
33
|
+
1. **Detects** your Node repo and package manager (npm/yarn/pnpm)
|
|
34
|
+
2. **Runs your tests** to capture a real baseline (pass/fail + test count)
|
|
35
|
+
3. **Creates `evals/baseline.json`** with provenance metadata
|
|
36
|
+
4. **Installs `.github/workflows/evalai-gate.yml`** (package-manager aware)
|
|
37
|
+
5. **Creates `evalai.config.json`**
|
|
38
|
+
6. **Prints next steps** — just commit and push
|
|
87
39
|
|
|
88
|
-
|
|
40
|
+
---
|
|
89
41
|
|
|
90
|
-
|
|
42
|
+
## CLI Commands
|
|
91
43
|
|
|
92
|
-
|
|
44
|
+
### Regression Gate (local, no account needed)
|
|
93
45
|
|
|
94
|
-
|
|
46
|
+
| Command | Description |
|
|
47
|
+
|---------|-------------|
|
|
48
|
+
| `npx evalai init` | Full project scaffolder — creates everything you need |
|
|
49
|
+
| `npx evalai gate` | Run regression gate locally |
|
|
50
|
+
| `npx evalai gate --format json` | Machine-readable JSON output |
|
|
51
|
+
| `npx evalai gate --format github` | GitHub Step Summary with delta table |
|
|
52
|
+
| `npx evalai baseline init` | Create starter `evals/baseline.json` |
|
|
53
|
+
| `npx evalai baseline update` | Re-run tests and update baseline with real scores |
|
|
54
|
+
| `npx evalai upgrade --full` | Upgrade from Tier 1 (built-in) to Tier 2 (full gate) |
|
|
95
55
|
|
|
96
|
-
|
|
56
|
+
### API Gate (requires account)
|
|
97
57
|
|
|
98
|
-
|
|
58
|
+
| Command | Description |
|
|
59
|
+
|---------|-------------|
|
|
60
|
+
| `npx evalai check` | Gate on quality score from dashboard |
|
|
61
|
+
| `npx evalai share` | Create share link for a run |
|
|
99
62
|
|
|
100
|
-
|
|
63
|
+
### Debugging & Diagnostics
|
|
101
64
|
|
|
102
|
-
|
|
65
|
+
| Command | Description |
|
|
66
|
+
|---------|-------------|
|
|
67
|
+
| `npx evalai doctor` | Comprehensive preflight checklist — verifies config, baseline, auth, API, CI wiring |
|
|
68
|
+
| `npx evalai explain` | Offline report explainer — top failures, root cause classification, suggested fixes |
|
|
69
|
+
| `npx evalai print-config` | Show resolved config with source-of-truth annotations (file/env/default/arg) |
|
|
103
70
|
|
|
104
|
-
|
|
71
|
+
**Guided failure flow:**
|
|
105
72
|
|
|
106
|
-
|
|
107
|
-
|
|
73
|
+
```
|
|
74
|
+
evalai check → fails → "Next: evalai explain"
|
|
75
|
+
↓
|
|
76
|
+
evalai explain → root causes + fixes
|
|
77
|
+
```
|
|
108
78
|
|
|
109
|
-
|
|
110
|
-
Your local openAIChatEval runs continue to work exactly the same.
|
|
79
|
+
**GitHub Actions step summary** — gate result at a glance:
|
|
111
80
|
|
|
112
|
-
|
|
81
|
+

|
|
113
82
|
|
|
114
|
-
|
|
115
|
-
npm install @pauly4010/evalai-sdk openai
|
|
116
|
-
# or
|
|
117
|
-
yarn add @pauly4010/evalai-sdk openai
|
|
118
|
-
# or
|
|
119
|
-
pnpm add @pauly4010/evalai-sdk openai
|
|
120
|
-
🖥️ Environment Support
|
|
121
|
-
This SDK works in both Node.js and browsers, with some Node-only features.
|
|
83
|
+
**`evalai explain` terminal output** — root causes + fix commands:
|
|
122
84
|
|
|
123
|
-
|
|
124
|
-
Traces API
|
|
85
|
+

|
|
125
86
|
|
|
126
|
-
|
|
87
|
+
`check` automatically writes `.evalai/last-report.json` so `explain` works with zero flags.
|
|
127
88
|
|
|
128
|
-
|
|
89
|
+
`doctor` uses exit codes: **0** = ready, **2** = not ready, **3** = infra error. Use `--report` for a JSON diagnostic bundle.
|
|
129
90
|
|
|
130
|
-
|
|
91
|
+
### Gate Exit Codes
|
|
131
92
|
|
|
132
|
-
|
|
93
|
+
| Code | Meaning |
|
|
94
|
+
|------|---------|
|
|
95
|
+
| 0 | Pass — no regression |
|
|
96
|
+
| 1 | Regression detected |
|
|
97
|
+
| 2 | Infra error (baseline missing, tests crashed) |
|
|
133
98
|
|
|
134
|
-
|
|
99
|
+
### Check Exit Codes (API mode)
|
|
135
100
|
|
|
136
|
-
|
|
101
|
+
| Code | Meaning |
|
|
102
|
+
|------|---------|
|
|
103
|
+
| 0 | Pass |
|
|
104
|
+
| 1 | Score below threshold |
|
|
105
|
+
| 2 | Regression failure |
|
|
106
|
+
| 3 | Policy violation |
|
|
107
|
+
| 4 | API error |
|
|
108
|
+
| 5 | Bad arguments |
|
|
109
|
+
| 6 | Low test count |
|
|
110
|
+
| 7 | Weak evidence |
|
|
111
|
+
| 8 | Warn (soft regression) |
|
|
137
112
|
|
|
138
|
-
|
|
113
|
+
### Doctor Exit Codes
|
|
139
114
|
|
|
140
|
-
|
|
115
|
+
| Code | Meaning |
|
|
116
|
+
|------|---------|
|
|
117
|
+
| 0 | Ready — all checks passed |
|
|
118
|
+
| 2 | Not ready — one or more checks failed |
|
|
119
|
+
| 3 | Infrastructure error |
|
|
141
120
|
|
|
142
|
-
|
|
121
|
+
---
|
|
143
122
|
|
|
144
|
-
|
|
145
|
-
These require Node.js:
|
|
123
|
+
## How the Gate Works
|
|
146
124
|
|
|
147
|
-
|
|
125
|
+
**Built-in mode** (any Node project, no config needed):
|
|
126
|
+
- Runs `<pm> test`, captures exit code + test count
|
|
127
|
+
- Compares against `evals/baseline.json`
|
|
128
|
+
- Writes `evals/regression-report.json`
|
|
129
|
+
- Fails CI if tests regress
|
|
148
130
|
|
|
149
|
-
|
|
131
|
+
**Project mode** (advanced, for full regression gate):
|
|
132
|
+
- If `eval:regression-gate` script exists in `package.json`, delegates to it
|
|
133
|
+
- Supports golden eval scores, confidence tests, p95 latency, cost tracking
|
|
134
|
+
- Full delta table with tolerances
|
|
150
135
|
|
|
151
|
-
|
|
136
|
+
---
|
|
152
137
|
|
|
153
|
-
|
|
138
|
+
## Run a Regression Test Locally (no account)
|
|
154
139
|
|
|
155
|
-
|
|
156
|
-
|
|
140
|
+
```bash
|
|
141
|
+
npm install @pauly4010/evalai-sdk openai
|
|
142
|
+
```
|
|
157
143
|
|
|
158
|
-
|
|
144
|
+
```typescript
|
|
145
|
+
import { openAIChatEval } from "@pauly4010/evalai-sdk";
|
|
159
146
|
|
|
160
|
-
|
|
161
|
-
|
|
147
|
+
await openAIChatEval({
|
|
148
|
+
name: "chat-regression",
|
|
149
|
+
cases: [
|
|
150
|
+
{ input: "Hello", expectedOutput: "greeting" },
|
|
151
|
+
{ input: "2 + 2 = ?", expectedOutput: "4" },
|
|
152
|
+
],
|
|
153
|
+
});
|
|
154
|
+
```
|
|
162
155
|
|
|
163
|
-
|
|
164
|
-
const client = AIEvalClient.init();
|
|
156
|
+
Output: `PASS 2/2 (score: 100)`. No account needed. Just a score.
|
|
165
157
|
|
|
166
|
-
|
|
167
|
-
const client2 = new AIEvalClient({
|
|
168
|
-
apiKey: "your-api-key",
|
|
169
|
-
organizationId: 123,
|
|
170
|
-
debug: true,
|
|
171
|
-
});
|
|
172
|
-
🧪 evalai CLI (v1.5.7)
|
|
173
|
-
The CLI gates deployments on quality, regression, and policy.
|
|
174
|
-
|
|
175
|
-
Quick start
|
|
176
|
-
npx -y @pauly4010/evalai-sdk@^1 check \
|
|
177
|
-
--evaluationId 42 \
|
|
178
|
-
--apiKey $EVALAI_API_KEY
|
|
179
|
-
evalai check
|
|
180
|
-
Option Description
|
|
181
|
-
--evaluationId <id> Required. Evaluation to gate on
|
|
182
|
-
--apiKey <key> API key (or EVALAI_API_KEY)
|
|
183
|
-
--format <fmt> human, json, or github
|
|
184
|
-
--onFail import Import failing run to dashboard
|
|
185
|
-
--explain Show score breakdown
|
|
186
|
-
--minScore <n> Fail if score < n
|
|
187
|
-
--warnDrop <n> Warn if regression exceeds n
|
|
188
|
-
--maxDrop <n> Fail if regression exceeds n
|
|
189
|
-
--minN <n> Fail if test count < n
|
|
190
|
-
--allowWeakEvidence Permit weak evidence
|
|
191
|
-
--policy <name> HIPAA, SOC2, GDPR, PCI_DSS, FINRA_4511
|
|
192
|
-
--baseline <mode> published, previous, production
|
|
193
|
-
--fail-on-flake Fail if unknown case is flaky
|
|
194
|
-
--baseUrl <url> Override API base URL
|
|
195
|
-
|
|
196
|
-
Exit codes
|
|
197
|
-
Code Meaning
|
|
198
|
-
0 PASS
|
|
199
|
-
8 WARN
|
|
200
|
-
1 Score below threshold
|
|
201
|
-
2 Regression failure
|
|
202
|
-
3 Policy violation
|
|
203
|
-
4 API error
|
|
204
|
-
5 Bad arguments
|
|
205
|
-
6 Low test count
|
|
206
|
-
7 Weak evidence
|
|
207
|
-
evalai doctor
|
|
208
|
-
Verify CI setup before running the gate:
|
|
209
|
-
|
|
210
|
-
npx -y @pauly4010/evalai-sdk@^1 doctor \
|
|
211
|
-
--evaluationId 42 \
|
|
212
|
-
--apiKey $EVALAI_API_KEY
|
|
213
|
-
If doctor passes, check will work.
|
|
214
|
-
|
|
215
|
-
🧯 Error Handling
|
|
216
|
-
import { EvalAIError, RateLimitError } from "@pauly4010/evalai-sdk";
|
|
217
|
-
|
|
218
|
-
try {
|
|
219
|
-
await client.traces.create({ name: "User Query" });
|
|
220
|
-
} catch (err) {
|
|
221
|
-
if (err instanceof RateLimitError) {
|
|
222
|
-
console.log("Retry after:", err.retryAfter);
|
|
223
|
-
} else if (err instanceof EvalAIError) {
|
|
224
|
-
console.log(err.code, err.message, err.requestId);
|
|
225
|
-
}
|
|
226
|
-
}
|
|
227
|
-
|
|
228
|
-
|
|
229
|
-
🔍 Traces
|
|
230
|
-
const trace = await client.traces.create({
|
|
231
|
-
name: "User Query",
|
|
232
|
-
traceId: "trace-123",
|
|
233
|
-
metadata: { userId: "456" },
|
|
234
|
-
});
|
|
158
|
+
### Vitest Integration
|
|
235
159
|
|
|
160
|
+
```typescript
|
|
161
|
+
import { openAIChatEval, extendExpectWithToPassGate } from "@pauly4010/evalai-sdk";
|
|
162
|
+
import { expect } from "vitest";
|
|
236
163
|
|
|
237
|
-
|
|
238
|
-
import { EvaluationTemplates } from "@pauly4010/evalai-sdk";
|
|
164
|
+
extendExpectWithToPassGate(expect);
|
|
239
165
|
|
|
240
|
-
|
|
241
|
-
|
|
242
|
-
|
|
243
|
-
|
|
166
|
+
it("passes gate", async () => {
|
|
167
|
+
const result = await openAIChatEval({
|
|
168
|
+
name: "chat-regression",
|
|
169
|
+
cases: [
|
|
170
|
+
{ input: "Hello", expectedOutput: "greeting" },
|
|
171
|
+
{ input: "2 + 2 = ?", expectedOutput: "4" },
|
|
172
|
+
],
|
|
173
|
+
});
|
|
174
|
+
expect(result).toPassGate();
|
|
244
175
|
});
|
|
176
|
+
```
|
|
245
177
|
|
|
178
|
+
---
|
|
246
179
|
|
|
247
|
-
|
|
248
|
-
import { traceOpenAI } from "@pauly4010/evalai-sdk/integrations/openai";
|
|
249
|
-
import OpenAI from "openai";
|
|
250
|
-
|
|
251
|
-
const openai = traceOpenAI(new OpenAI(), client);
|
|
180
|
+
## SDK Exports
|
|
252
181
|
|
|
253
|
-
|
|
254
|
-
model: "gpt-4",
|
|
255
|
-
messages: [{ role: "user", content: "Hello" }],
|
|
256
|
-
});
|
|
182
|
+
### Regression Gate Constants
|
|
257
183
|
|
|
184
|
+
```typescript
|
|
185
|
+
import {
|
|
186
|
+
GATE_EXIT, // { PASS: 0, REGRESSION: 1, INFRA_ERROR: 2, ... }
|
|
187
|
+
GATE_CATEGORY, // { PASS: "pass", REGRESSION: "regression", INFRA_ERROR: "infra_error" }
|
|
188
|
+
REPORT_SCHEMA_VERSION,
|
|
189
|
+
ARTIFACTS, // { BASELINE, REGRESSION_REPORT, CONFIDENCE_SUMMARY, LATENCY_BENCHMARK }
|
|
190
|
+
} from "@pauly4010/evalai-sdk";
|
|
258
191
|
|
|
259
|
-
|
|
260
|
-
|
|
261
|
-
|
|
192
|
+
// Or tree-shakeable:
|
|
193
|
+
import { GATE_EXIT } from "@pauly4010/evalai-sdk/regression";
|
|
194
|
+
```
|
|
262
195
|
|
|
263
|
-
|
|
196
|
+
### Types
|
|
197
|
+
|
|
198
|
+
```typescript
|
|
199
|
+
import type {
|
|
200
|
+
RegressionReport,
|
|
201
|
+
RegressionDelta,
|
|
202
|
+
Baseline,
|
|
203
|
+
BaselineTolerance,
|
|
204
|
+
GateExitCode,
|
|
205
|
+
GateCategory,
|
|
206
|
+
} from "@pauly4010/evalai-sdk/regression";
|
|
207
|
+
```
|
|
264
208
|
|
|
265
|
-
|
|
209
|
+
### Platform Client
|
|
266
210
|
|
|
267
|
-
|
|
211
|
+
```typescript
|
|
212
|
+
import { AIEvalClient } from "@pauly4010/evalai-sdk";
|
|
268
213
|
|
|
269
|
-
|
|
270
|
-
|
|
214
|
+
const client = AIEvalClient.init(); // from EVALAI_API_KEY env
|
|
215
|
+
// or
|
|
216
|
+
const client = new AIEvalClient({ apiKey: "...", organizationId: 123 });
|
|
217
|
+
```
|
|
271
218
|
|
|
272
|
-
|
|
219
|
+
### Framework Integrations
|
|
273
220
|
|
|
274
|
-
|
|
221
|
+
```typescript
|
|
222
|
+
import { traceOpenAI } from "@pauly4010/evalai-sdk/integrations/openai";
|
|
223
|
+
import { traceAnthropic } from "@pauly4010/evalai-sdk/integrations/anthropic";
|
|
224
|
+
```
|
|
275
225
|
|
|
276
|
-
|
|
277
|
-
PASS/WARN/FAIL gate semantics
|
|
226
|
+
---
|
|
278
227
|
|
|
279
|
-
|
|
228
|
+
## Installation
|
|
280
229
|
|
|
281
|
-
|
|
230
|
+
```bash
|
|
231
|
+
npm install @pauly4010/evalai-sdk
|
|
232
|
+
# or
|
|
233
|
+
yarn add @pauly4010/evalai-sdk
|
|
234
|
+
# or
|
|
235
|
+
pnpm add @pauly4010/evalai-sdk
|
|
236
|
+
```
|
|
282
237
|
|
|
283
|
-
|
|
238
|
+
Add `openai` as a peer dependency if using `openAIChatEval`:
|
|
284
239
|
|
|
285
|
-
|
|
240
|
+
```bash
|
|
241
|
+
npm install openai
|
|
242
|
+
```
|
|
286
243
|
|
|
287
|
-
|
|
244
|
+
## Environment Support
|
|
288
245
|
|
|
289
|
-
|
|
246
|
+
| Feature | Node.js | Browser |
|
|
247
|
+
|---------|---------|---------|
|
|
248
|
+
| Platform APIs (Traces, Evaluations, LLM Judge) | ✅ | ✅ |
|
|
249
|
+
| Assertions, Test Suites, Error Handling | ✅ | ✅ |
|
|
250
|
+
| CJS/ESM | ✅ | ✅ |
|
|
251
|
+
| CLI, Snapshots, File Export | ✅ | — |
|
|
252
|
+
| Context Propagation | ✅ Full | ⚠️ Basic |
|
|
290
253
|
|
|
291
|
-
|
|
254
|
+
## No Lock-in
|
|
292
255
|
|
|
293
|
-
|
|
294
|
-
|
|
256
|
+
```bash
|
|
257
|
+
rm evalai.config.json
|
|
258
|
+
```
|
|
295
259
|
|
|
296
|
-
|
|
260
|
+
Your local `openAIChatEval` runs continue to work. No account cancellation. No data export required.
|
|
297
261
|
|
|
298
|
-
|
|
262
|
+
## Changelog
|
|
299
263
|
|
|
300
|
-
|
|
264
|
+
See [CHANGELOG.md](CHANGELOG.md) for the full release history.
|
|
301
265
|
|
|
302
|
-
evalai doctor
|
|
266
|
+
**v1.8.0** — `evalai doctor` rewrite (9-check checklist), `evalai explain` command, guided failure flow, CI template with doctor preflight
|
|
303
267
|
|
|
304
|
-
|
|
268
|
+
**v1.7.0** — `evalai init` scaffolder, `evalai upgrade --full`, `detectRunner()`, machine-readable gate output, init test matrix
|
|
305
269
|
|
|
270
|
+
**v1.6.0** — `evalai gate`, `evalai baseline`, regression gate constants & types
|
|
306
271
|
|
|
307
|
-
|
|
272
|
+
**v1.5.8** — secureRoute fix, test infra fixes, 304 handling fix
|
|
308
273
|
|
|
309
|
-
|
|
274
|
+
**v1.5.5** — PASS/WARN/FAIL semantics, flake intelligence, golden regression suite
|
|
310
275
|
|
|
311
|
-
|
|
276
|
+
**v1.5.0** — GitHub annotations, `--onFail import`, `evalai doctor`
|
|
312
277
|
|
|
278
|
+
## License
|
|
313
279
|
|
|
314
|
-
📄 License
|
|
315
280
|
MIT
|
|
316
281
|
|
|
317
|
-
|
|
318
|
-
Documentation:
|
|
319
|
-
https://v0-ai-evaluation-platform-nu.vercel.app/documentation
|
|
282
|
+
## Support
|
|
320
283
|
|
|
321
|
-
|
|
322
|
-
https://github.com/pauly7610/ai-evaluation-platform/issues
|
|
323
|
-
```
|
|
284
|
+
- **Docs:** https://v0-ai-evaluation-platform-nu.vercel.app/documentation
|
|
285
|
+
- **Issues:** https://github.com/pauly7610/ai-evaluation-platform/issues
|
package/dist/cli/baseline.js
CHANGED
|
@@ -142,7 +142,7 @@ function runBaselineUpdate(cwd) {
|
|
|
142
142
|
}
|
|
143
143
|
if (!pkg.scripts?.["eval:baseline-update"]) {
|
|
144
144
|
console.error("❌ Missing 'eval:baseline-update' script in package.json.");
|
|
145
|
-
console.error(
|
|
145
|
+
console.error(' Add it: "eval:baseline-update": "npx tsx scripts/regression-gate.ts --update-baseline"');
|
|
146
146
|
return 1;
|
|
147
147
|
}
|
|
148
148
|
console.log("📊 Running baseline update...\n");
|