@dutchmanlabs/evalstudio 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +106 -0
- package/VALIDATION.md +69 -0
- package/dist/index.js +2369 -0
- package/package.json +51 -0
package/README.md
ADDED
|
@@ -0,0 +1,106 @@
|
|
|
1
|
+
# Eval Studio CLI
|
|
2
|
+
|
|
3
|
+
Local-first CLI for scanning AI agents, generating eval suites through Dutchman Labs, running them locally, and uploading results back to Eval Studio.
|
|
4
|
+
|
|
5
|
+
## Install and Run
|
|
6
|
+
|
|
7
|
+
From the monorepo during development:
|
|
8
|
+
|
|
9
|
+
```bash
|
|
10
|
+
npm run build:cli
|
|
11
|
+
node packages/cli/dist/index.js --help
|
|
12
|
+
node packages/cli/dist/index.js login
|
|
13
|
+
```
|
|
14
|
+
|
|
15
|
+
After install, the quick start is:
|
|
16
|
+
|
|
17
|
+
```bash
|
|
18
|
+
evalstudio login
|
|
19
|
+
evalstudio init
|
|
20
|
+
evalstudio detect
|
|
21
|
+
evalstudio generate
|
|
22
|
+
evalstudio run
|
|
23
|
+
```
|
|
24
|
+
|
|
25
|
+
Users will be able to get that command UX either via `npx`:
|
|
26
|
+
|
|
27
|
+
```bash
|
|
28
|
+
npx @dutchmanlabs/evalstudio@latest login
|
|
29
|
+
npx @dutchmanlabs/evalstudio@latest init
|
|
30
|
+
npx @dutchmanlabs/evalstudio@latest detect
|
|
31
|
+
npx @dutchmanlabs/evalstudio@latest generate
|
|
32
|
+
npx @dutchmanlabs/evalstudio@latest run
|
|
33
|
+
```
|
|
34
|
+
|
|
35
|
+
Or by installing globally:
|
|
36
|
+
|
|
37
|
+
```bash
|
|
38
|
+
npm install -g @dutchmanlabs/evalstudio
|
|
39
|
+
evalstudio --help
|
|
40
|
+
```
|
|
41
|
+
|
|
42
|
+
## Commands
|
|
43
|
+
|
|
44
|
+
- `evalstudio login`
|
|
45
|
+
- `evalstudio init`
|
|
46
|
+
- `evalstudio detect`
|
|
47
|
+
- `evalstudio scan` (alias)
|
|
48
|
+
- `evalstudio generate`
|
|
49
|
+
- `evalstudio run`
|
|
50
|
+
- `evalstudio status`
|
|
51
|
+
- `evalstudio export`
|
|
52
|
+
|
|
53
|
+
## Local Files
|
|
54
|
+
|
|
55
|
+
The CLI writes state in the current repo under `.evalstudio/`:
|
|
56
|
+
|
|
57
|
+
- `.evalstudio/config.json`
|
|
58
|
+
- `.evalstudio/scan-results.json`
|
|
59
|
+
- `.evalstudio/latest-suite.json`
|
|
60
|
+
- `.evalstudio/latest-run.json`
|
|
61
|
+
- `.evalstudio/exports/`
|
|
62
|
+
|
|
63
|
+
Global auth is stored in `~/.evalstudio/config.json`.
|
|
64
|
+
|
|
65
|
+
`generate` writes the current hosted suite to `.evalstudio/latest-suite.json`.
|
|
66
|
+
|
|
67
|
+
`run` executes the suite locally, saves the result set to `.evalstudio/latest-run.json`, and then uploads those results to the Dutchman Labs dashboard.
|
|
68
|
+
|
|
69
|
+
`export` is local-only. It transforms `.evalstudio/latest-run.json` into JSONL, CSV, or pytest artifacts under `.evalstudio/exports/`.
|
|
70
|
+
|
|
71
|
+
## Help
|
|
72
|
+
|
|
73
|
+
```bash
|
|
74
|
+
evalstudio --help
|
|
75
|
+
evalstudio help
|
|
76
|
+
evalstudio generate --help
|
|
77
|
+
evalstudio help run
|
|
78
|
+
```
|
|
79
|
+
|
|
80
|
+
## Demo Target
|
|
81
|
+
|
|
82
|
+
The canonical sibling demo repo used during validation is:
|
|
83
|
+
|
|
84
|
+
`/Users/riyadsarsour/Desktop/dutchman/testagent`
|
|
85
|
+
|
|
86
|
+
That demo agent listens on:
|
|
87
|
+
|
|
88
|
+
`http://127.0.0.1:3000/api/chat`
|
|
89
|
+
|
|
90
|
+
Typical demo flow:
|
|
91
|
+
|
|
92
|
+
```bash
|
|
93
|
+
cd /Users/riyadsarsour/Desktop/dutchman/testagent
|
|
94
|
+
node /Users/riyadsarsour/Desktop/dutchman/dutchmanlabs/packages/cli/dist/index.js init
|
|
95
|
+
node /Users/riyadsarsour/Desktop/dutchman/dutchmanlabs/packages/cli/dist/index.js detect
|
|
96
|
+
node /Users/riyadsarsour/Desktop/dutchman/dutchmanlabs/packages/cli/dist/index.js generate
|
|
97
|
+
node /Users/riyadsarsour/Desktop/dutchman/dutchmanlabs/packages/cli/dist/index.js run --url http://127.0.0.1:3000/api/chat
|
|
98
|
+
node /Users/riyadsarsour/Desktop/dutchman/dutchmanlabs/packages/cli/dist/index.js export
|
|
99
|
+
```
|
|
100
|
+
|
|
101
|
+
Artifacts to inspect after the demo:
|
|
102
|
+
|
|
103
|
+
- `.evalstudio/scan-results.json`
|
|
104
|
+
- `.evalstudio/latest-suite.json`
|
|
105
|
+
- `.evalstudio/latest-run.json`
|
|
106
|
+
- `.evalstudio/exports/`
|
package/VALIDATION.md
ADDED
|
@@ -0,0 +1,69 @@
|
|
|
1
|
+
# Eval Studio CLI Validation
|
|
2
|
+
|
|
3
|
+
Validation date: 2026-03-31
|
|
4
|
+
|
|
5
|
+
## Checklist
|
|
6
|
+
|
|
7
|
+
- [x] Happy path
|
|
8
|
+
- [x] Invalid API key
|
|
9
|
+
- [x] Generation limit exceeded
|
|
10
|
+
- [x] Multiple candidates
|
|
11
|
+
- [x] Local endpoint down
|
|
12
|
+
- [x] Export after run
|
|
13
|
+
|
|
14
|
+
## Results
|
|
15
|
+
|
|
16
|
+
### Happy path
|
|
17
|
+
|
|
18
|
+
- Environment: release-readiness pass used the local mock backend at `http://127.0.0.1:8791` plus a real local HTTP agent endpoint at `http://127.0.0.1:3000/api/chat`
|
|
19
|
+
- Commands exercised: `init`, `detect --candidate 1`, `generate`, `run --url http://127.0.0.1:3000/api/chat --payload '{"input":"{{prompt}}"}'`, `status`, `export`
|
|
20
|
+
- Result: end-to-end success on the polished build
|
|
21
|
+
- Notes:
|
|
22
|
+
- project created: `proj_72c7f75e`
|
|
23
|
+
- detect found 2 candidates and selected `app/api/chat/route.ts`
|
|
24
|
+
- suite created: `suite_3b2835ae`
|
|
25
|
+
- run uploaded: `run_08bd0e4c`
|
|
26
|
+
- JSONL, CSV, and pytest exports written locally under `.evalstudio/exports/`
|
|
27
|
+
|
|
28
|
+
Additional note:
|
|
29
|
+
- the core product loop was also validated earlier on 2026-03-31 against the real hosted backend with a real API key and local sibling demo repo
|
|
30
|
+
|
|
31
|
+
### Invalid API key
|
|
32
|
+
|
|
33
|
+
- Environment: temp home directory with a bogus `es_live_...` key against the real hosted backend
|
|
34
|
+
- Command exercised: `status`
|
|
35
|
+
- Result: CLI shows a human-readable invalid/revoked key error without a stack trace
|
|
36
|
+
|
|
37
|
+
### Generation limit exceeded
|
|
38
|
+
|
|
39
|
+
- Environment: local mock backend returning `429 generation_limit_exceeded`
|
|
40
|
+
- Command exercised: `generate`
|
|
41
|
+
- Result: CLI shows a friendly daily limit message and points the user to `evalstudio status`
|
|
42
|
+
|
|
43
|
+
### Multiple candidates
|
|
44
|
+
|
|
45
|
+
- Environment: temp validation repo with two detected candidates
|
|
46
|
+
- Command exercised: `detect --candidate 1`
|
|
47
|
+
- Result: CLI prints a ranked list, supports clean selection by number, and stores the selected candidate in `.evalstudio/config.json`
|
|
48
|
+
|
|
49
|
+
### Local endpoint down
|
|
50
|
+
|
|
51
|
+
- Environment: valid local suite with no service listening at the configured URL
|
|
52
|
+
- Command exercised: `run --url http://127.0.0.1:3999/api/chat --payload '{"input":"{{prompt}}"}'`
|
|
53
|
+
- Result: CLI surfaces an actionable local endpoint error instead of a raw fetch failure
|
|
54
|
+
|
|
55
|
+
### Export after run
|
|
56
|
+
|
|
57
|
+
- Environment: local run cache present
|
|
58
|
+
- Command exercised: `export`
|
|
59
|
+
- Result: CLI writes JSONL, CSV, and pytest artifacts and prints their saved paths
|
|
60
|
+
|
|
61
|
+
## Additional error smoke tests
|
|
62
|
+
|
|
63
|
+
- Missing API key: `status` reports `No Eval Studio API key is saved on this machine.`
|
|
64
|
+
- Project not initialized: `status` reports `This repo is not initialized for Eval Studio yet.`
|
|
65
|
+
- No candidates found: `detect` on an empty repo reports `No likely AI agent candidates were found in this repo.`
|
|
66
|
+
- No candidate selected: `generate` on an initialized repo without detection results reports `No candidate selected.`
|
|
67
|
+
- No eval suite generated: `run` on an initialized repo without a suite reports `No eval suite is saved for this repo.`
|
|
68
|
+
- Malformed local agent response: `run` against a server returning `{"ok":true}` reports `Your local agent responded, but the response shape wasn't recognized.`
|
|
69
|
+
- Backend upload failure: `run` against a backend stub that fails result uploads reports `Eval Studio couldn't upload your run results.` and leaves `.evalstudio/latest-run.json` with `"uploadStatus": "pending"`
|