@netlify/axis 0.2.0 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (54) hide show
  1. package/README.md +24 -916
  2. package/dist/cli.js +2 -4
  3. package/dist/cli.js.map +1 -1
  4. package/dist/config/validator.d.ts.map +1 -1
  5. package/dist/config/validator.js +2 -20
  6. package/dist/config/validator.js.map +1 -1
  7. package/dist/index.d.ts +1 -1
  8. package/dist/index.d.ts.map +1 -1
  9. package/dist/index.js +1 -1
  10. package/dist/index.js.map +1 -1
  11. package/dist/report-ui/index.html +51 -45
  12. package/dist/reports/writer.d.ts +1 -1
  13. package/dist/reports/writer.d.ts.map +1 -1
  14. package/dist/reports/writer.js +2 -1
  15. package/dist/reports/writer.js.map +1 -1
  16. package/dist/runner/runner.d.ts.map +1 -1
  17. package/dist/runner/runner.js +18 -29
  18. package/dist/runner/runner.js.map +1 -1
  19. package/dist/scoring/index.d.ts +1 -7
  20. package/dist/scoring/index.d.ts.map +1 -1
  21. package/dist/scoring/index.js +1 -16
  22. package/dist/scoring/index.js.map +1 -1
  23. package/dist/scoring/sparse-index.d.ts +2 -2
  24. package/dist/scoring/sparse-index.d.ts.map +1 -1
  25. package/dist/scoring/sparse-index.js +8 -9
  26. package/dist/scoring/sparse-index.js.map +1 -1
  27. package/dist/skills/resolver.d.ts +2 -4
  28. package/dist/skills/resolver.d.ts.map +1 -1
  29. package/dist/skills/resolver.js +4 -10
  30. package/dist/skills/resolver.js.map +1 -1
  31. package/dist/transcript/categorize.d.ts +1 -3
  32. package/dist/transcript/categorize.d.ts.map +1 -1
  33. package/dist/transcript/categorize.js +0 -27
  34. package/dist/transcript/categorize.js.map +1 -1
  35. package/dist/types/config.d.ts +4 -17
  36. package/dist/types/config.d.ts.map +1 -1
  37. package/dist/types/report.d.ts +2 -0
  38. package/dist/types/report.d.ts.map +1 -1
  39. package/dist/types/scenario.d.ts +1 -2
  40. package/dist/types/scenario.d.ts.map +1 -1
  41. package/dist/types/scoring.d.ts +0 -4
  42. package/dist/types/scoring.d.ts.map +1 -1
  43. package/package.json +5 -3
  44. package/dist/docs-site/_astro/cli.DDWZtG0-.css +0 -1
  45. package/dist/docs-site/cli/index.html +0 -18
  46. package/dist/docs-site/configuration/index.html +0 -121
  47. package/dist/docs-site/content-assets.mjs +0 -1
  48. package/dist/docs-site/content-modules.mjs +0 -1
  49. package/dist/docs-site/data-store.json +0 -9
  50. package/dist/docs-site/index.html +0 -69
  51. package/dist/docs-site/quickstart/index.html +0 -59
  52. package/dist/docs-site/running/index.html +0 -87
  53. package/dist/docs-site/scoring/index.html +0 -135
  54. package/dist/report-ui/mock-data.json +0 -298
package/README.md CHANGED
@@ -2,7 +2,7 @@
2
2
 
3
3
  AXIS is a synthetic testing framework for measuring how well systems support AI agent interaction. Think [Lighthouse](https://developer.chrome.com/docs/lighthouse), but instead of scoring user experience, AXIS scores **agent experience**.
4
4
 
5
- Given a scenario, an agent, and a prompt — AXIS runs the agent, captures a full transcript of the interaction, and produces a graded score across multiple dimensions.
5
+ Given a scenario, an agent, and a prompt — AXIS runs the agent, captures a full transcript of the interaction, and produces a graded score across four independent dimensions: Goal Achievement, Environment, Service, and Agent.
6
6
 
7
7
  ## Why AXIS
8
8
 
@@ -10,20 +10,13 @@ The web has Lighthouse. APIs have contract testing. Performance has k6. But ther
10
10
 
11
11
  As agents become a primary interface for interacting with websites, APIs, and developer platforms, the systems they interact with need to be measured and optimized for that experience — just like we optimize for page load time or accessibility.
12
12
 
13
- AXIS provides:
14
-
15
- - **A universal score** for agent experience across any target system
16
- - **Repeatable, synthetic tests** that can run in CI or on a schedule
17
- - **Agent-agnostic execution** — plug in Claude Code, Codex CLI, or any agent that implements the adapter contract
18
- - **A grading system** that combines user-defined rubrics with automatically measured signals
19
-
20
13
  ## Quick Start
21
14
 
22
15
  ```bash
23
16
  npm install @netlify/axis
24
17
  ```
25
18
 
26
- Create an `axis.config.json`:
19
+ `axis.config.json`:
27
20
 
28
21
  ```json
29
22
  {
@@ -32,7 +25,7 @@ Create an `axis.config.json`:
32
25
  }
33
26
  ```
34
27
 
35
- Create a scenario file at `scenarios/hello-world.json`:
28
+ `scenarios/hello-world.json`:
36
29
 
37
30
  ```json
38
31
  {
@@ -45,933 +38,48 @@ Create a scenario file at `scenarios/hello-world.json`:
45
38
  }
46
39
  ```
47
40
 
48
- Run it:
49
-
50
41
  ```bash
51
42
  axis run
52
43
  ```
53
44
 
54
- AXIS will execute the scenario, score the results, and display a live summary. Reports are automatically saved to `.axis/reports/` for later review.
55
-
56
- ## Configuration
57
-
58
- `axis.config.json` is the project-level config. It defines which agents to run and where to find scenarios.
59
-
60
- ```json
61
- {
62
- "scenarios": "./scenarios",
63
- "agents": [
64
- "claude-code",
65
- {
66
- "adapter": "claude-code",
67
- "model": "sonnet",
68
- "scenarios": ["hello-world", "cms/*"],
69
- "flags": {
70
- "dangerously-skip-permissions": true
71
- }
72
- }
73
- ],
74
- "env": ["ANTHROPIC_API_KEY", "MY_API_TOKEN"],
75
- "defaults": {
76
- "scoring_weights": {
77
- "goal_achievement": 0.4,
78
- "environment": 0.2,
79
- "service": 0.2,
80
- "agent": 0.2
81
- }
82
- }
83
- }
84
- ```
85
-
86
- | Field | Required | Description |
87
- | -------------------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------- |
88
- | `scenarios` | yes | Path to the scenarios directory (relative to config file) |
89
- | `agents` | yes | Array of agents to run. Each entry is a string (adapter name) or an agent config object |
90
- | `adapters` | no | Custom adapter modules. Keys are names, values are paths to JS/TS modules. See [Custom Adapters](#custom-adapters) |
91
- | `env` | no | Environment variable names to pass through to agent processes. Defaults to `["ANTHROPIC_API_KEY", "CODEX_API_KEY", "GEMINI_API_KEY"]` |
92
- | `mcp_servers` | no | MCP servers available to all agents. See [MCP Servers](#mcp-servers) |
93
- | `skills` | no | Skills available to all agents. See [Skills](#skills) |
94
- | `defaults.concurrency` | no | Maximum parallel jobs. Defaults to unlimited (all jobs run simultaneously) |
95
- | `defaults.scoring_weights` | no | Weights for the composite AXIS result. Defaults to `0.4 / 0.2 / 0.2 / 0.2` (goal / env / svc / agent) |
96
-
97
- ### Agent Config
98
-
99
- When an agent entry is an object, it supports these fields:
100
-
101
- | Field | Required | Description |
102
- | ----------- | -------- | ------------------------------------------------------------------------------- |
103
- | `adapter` | yes | Adapter type (`"claude-code"`, `"codex"`, `"gemini"`, or a custom adapter name) |
104
- | `command` | no | Executable command for custom adapters (e.g. `"aider"`, `"./my-agent.sh"`) |
105
- | `model` | no | Model override passed to the adapter |
106
- | `scenarios` | no | Subset of scenarios to run. Supports exact keys and glob patterns (`"cms/*"`) |
107
- | `skills` | no | Per-agent skills (merged with top-level `skills`). See [Skills](#skills) |
108
- | `flags` | no | Adapter-specific CLI flags (e.g. `{"dangerously-skip-permissions": true}`) |
109
-
110
- Multiple entries with the same adapter are auto-named: `claude-code`, `claude-code-2`, etc.
111
-
112
- ### Agent Isolation
113
-
114
- AXIS runs every scenario job in a fully isolated environment. This ensures results are reproducible regardless of where you run them — your local machine, a colleague's laptop, or CI.
115
-
116
- **Workspace** — Each job gets its own temporary directory. Setup scripts, agent execution, and teardown all run inside this workspace. It is automatically cleaned up after the job completes.
117
-
118
- **HOME** — The `HOME` environment variable points to the job's workspace, not your real home directory. This prevents agents from picking up user-specific configuration (e.g. `~/.claude/settings.json`, `~/.claude/CLAUDE.md`) that would cause results to differ between machines.
119
-
120
- **Environment variables** — Agent processes receive a minimal set of environment variables:
121
-
122
- - **System variables** (always included): `PATH`, `USER`, `SHELL`, `LANG`, `TERM`, `TMPDIR`
123
- - **User-configured variables** (via `env`): defaults to `["ANTHROPIC_API_KEY", "CODEX_API_KEY", "GEMINI_API_KEY"]`
124
-
125
- Everything else from the host environment is stripped. To pass additional variables (e.g. API tokens needed by setup scripts), add them to `env`:
126
-
127
- ```json
128
- {
129
- "env": ["ANTHROPIC_API_KEY", "MY_SERVICE_TOKEN", "DATABASE_URL"]
130
- }
131
- ```
132
-
133
- #### Claude Code Isolation
134
-
135
- The `claude-code` adapter sets these additional defaults to prevent the Claude CLI from loading host configuration:
136
-
137
- | Variable | Default | Purpose |
138
- | --------------------------------- | --------------------- | ---------------------------------------------------------- |
139
- | `CLAUDE_CONFIG_DIR` | `<workspace>/.claude` | Config discovery uses the clean workspace, not `~/.claude` |
140
- | `CLAUDE_CODE_DISABLE_AUTO_MEMORY` | `1` | No memory side effects between runs |
141
- | `DISABLE_AUTOUPDATER` | `1` | No update checks during test execution |
142
- | `DISABLE_TELEMETRY` | `1` | No telemetry from synthetic test runs |
143
-
144
- To override any of these defaults, add the variable to `env` and set it in your shell environment before running `axis run`. Variables in `env` take precedence over adapter defaults.
145
-
146
- ### MCP Servers
147
-
148
- MCP (Model Context Protocol) servers give agents access to external tools during execution. Servers are defined at the top level of `axis.config.json` and are available to **all** agents — every agent gets the same set of tools so they're tested on equal footing.
149
-
150
- Two transport types are supported:
151
-
152
- - **stdio** — Spawns a local process that communicates over stdin/stdout
153
- - **http** — Connects to a remote MCP server over HTTP
154
-
155
- ```json
156
- {
157
- "scenarios": "./scenarios",
158
- "agents": ["claude-code", "gemini", "codex"],
159
- "mcp_servers": {
160
- "filesystem": {
161
- "type": "stdio",
162
- "command": "npx",
163
- "args": ["-y", "@modelcontextprotocol/server-filesystem", "/tmp"],
164
- "env": { "LOG_LEVEL": "info" }
165
- },
166
- "remote-tools": {
167
- "type": "http",
168
- "url": "https://mcp.example.com/tools",
169
- "headers": { "Authorization": "Bearer ${API_TOKEN}" }
170
- }
171
- }
172
- }
173
- ```
174
-
175
- | Field | Type | Required | Description |
176
- | --------- | --------------------- | -------- | -------------------------------------------- |
177
- | `type` | `"stdio"` or `"http"` | yes | Transport type |
178
- | `command` | string | stdio | Command to spawn |
179
- | `args` | string[] | no | Arguments for the command |
180
- | `env` | object | no | Environment variables for the server process |
181
- | `url` | string | http | URL of the remote MCP endpoint |
182
- | `headers` | object | no | HTTP headers (e.g. authentication) |
183
-
184
- AXIS writes the appropriate native config file for each adapter before spawning the agent:
185
-
186
- - **Claude Code**: `.mcp.json` in the workspace root (project-scoped)
187
- - **Codex**: `config.toml` in `CODEX_HOME`
188
- - **Gemini**: `settings.json` in `GEMINI_CLI_HOME`
189
-
190
- MCP server processes inherit the agent's environment. If a server needs env vars not in the default passthrough list, either add them to the server's `env` field or to the top-level `env` array.
191
-
192
- ### Skills
193
-
194
- Skills augment what agents can do during test runs. They follow the [SKILL.md standard](https://github.com/anthropics/claude-code/blob/main/SKILLS.md) — each skill is a directory containing a `SKILL.md` file and optional supporting files (scripts, references, assets).
195
-
196
- Skills can be defined at the top level (shared across all agents) or per-agent (via `AgentConfig.skills`). Both are merged at runtime — top-level first, then per-agent, deduplicated.
197
-
198
- ```json
199
- {
200
- "scenarios": "./scenarios",
201
- "agents": ["claude-code", { "adapter": "codex", "skills": ["./skills/codex-specific"] }],
202
- "skills": ["netlify/axis-skill-deploy", "./skills/custom-lint"]
203
- }
204
- ```
205
-
206
- Three source formats are supported:
207
-
208
- | Format | Example | Resolution |
209
- | ---------------- | ------------------------------------------------ | -------------------------------------------------- |
210
- | Local path | `"./skills/deploy"` | Resolved relative to the config file directory |
211
- | GitHub shorthand | `"netlify/axis-skill-deploy"` | Cloned from `github.com/netlify/axis-skill-deploy` |
212
- | GitHub URL | `"https://github.com/netlify/axis-skill-deploy"` | Cloned from the URL |
213
-
214
- Skills are resolved once during pre-flight and copied into each adapter's native discovery path before spawning the agent:
215
-
216
- | Adapter | Discovery Path | Notes |
217
- | ----------- | ------------------------------------ | ----------------------------------- |
218
- | Claude Code | `{workspace}/.claude/skills/{name}/` | Native Claude Code skill discovery |
219
- | Codex | `{workspace}/.agents/skills/{name}/` | Native Codex skill discovery |
220
- | Gemini | `{GEMINI_CLI_HOME}/skills/{name}/` | Native Gemini CLI skill discovery |
221
- | CLI | _(skipped)_ | No native skill discovery mechanism |
222
-
223
- Skills are purely environmental — AXIS copies the skill directory as-is without modifying content or altering the agent's prompt. The `.git`, `.github`, and `node_modules` directories are excluded from copies.
224
-
225
- **Caching**: Remote skills (GitHub shorthand and URLs) are cached in `.axis/skills-cache/`. Use `--refresh-skills` to force a re-clone.
226
-
227
- **SKILL.md discovery**: AXIS looks for `SKILL.md` at the root of the resolved directory. If not found, it checks one level of subdirectories (for repos where the skill lives in a subdirectory).
228
-
229
- ## Scenarios
230
-
231
- A scenario is the complete unit of test. It defines the prompt, rubric, and any setup/teardown needed. Scenarios live as individual `.json` files in the configured scenarios directory.
232
-
233
- ```json
234
- {
235
- "name": "Create a new blog post via CMS",
236
-
237
- "setup": [{ "action": "run_script", "command": "npm run seed -- fixtures/empty-blog.json" }],
238
-
239
- "prompt": "Navigate to the CMS at https://cms.example.com. Create a new blog post titled \"Hello World\" with at least two paragraphs of content. Publish it and verify it appears on the public site.",
240
-
241
- "rubric": [
242
- { "check": "Blog post exists on public site", "weight": 0.5 },
243
- { "check": "Post has title 'Hello World'", "weight": 0.25 },
244
- { "check": "Post contains at least two paragraphs", "weight": 0.25 }
245
- ],
246
-
247
- "teardown": [{ "action": "run_script", "command": "npm run cleanup -- fixtures/empty-blog.json" }]
248
- }
249
- ```
250
-
251
- | Field | Required | Description |
252
- | ---------- | -------- | ----------------------------------------------------------------------------------- |
253
- | `name` | yes | Human-readable scenario title |
254
- | `prompt` | yes | Task description for the agent |
255
- | `rubric` | yes | Success criteria — either a freeform string or an array of checks (weight optional) |
256
- | `setup` | no | Lifecycle actions to run before the agent executes |
257
- | `teardown` | no | Lifecycle actions to run after scoring completes |
258
- | `agents` | no | When set, only these agents run this scenario (overrides the global agents list) |
259
-
260
- ### Agent Override
261
-
262
- A scenario can specify which agents should run it. When the `agents` field is present, it completely overrides the global agents list — only the named agents are paired with that scenario.
263
-
264
- ```json
265
- {
266
- "name": "Gemini-specific test",
267
- "prompt": "Test something that only Gemini supports",
268
- "rubric": "Verify it works",
269
- "agents": ["gemini"]
270
- }
271
- ```
272
-
273
- Agent names match the adapter name in the config (e.g. `"claude-code"`, `"gemini"`, `"codex"`). Scenarios without the `agents` field run with all configured agents as usual.
274
-
275
- ### Rubric Formats
276
-
277
- **Array rubric** — each check is scored independently by the LLM judge. Weights are optional — unweighted checks split the remaining budget equally:
278
-
279
- ```json
280
- "rubric": [
281
- { "check": "Blog post exists on public site", "weight": 0.5 },
282
- { "check": "Post has title 'Hello World'", "weight": 0.25 },
283
- { "check": "Post contains at least two paragraphs", "weight": 0.25 }
284
- ]
285
- ```
286
-
287
- **String rubric** — freeform description evaluated holistically by the LLM judge:
288
-
289
- ```json
290
- "rubric": "The agent should create a blog post with the title 'Hello World' containing at least two paragraphs of content, and verify it appears on the public site."
291
- ```
292
-
293
- ### Scenario Keys
294
-
295
- Scenario keys are derived from their file path relative to the scenarios directory. For example:
296
-
297
- - `scenarios/hello-world.json` → `hello-world`
298
- - `scenarios/cms/create-post.json` → `cms/create-post`
299
-
300
- ### Lifecycle Actions
301
-
302
- Setup and teardown support `run_script` actions that execute shell commands. Commands run in the job's isolated workspace directory. Teardown runs **after** scoring completes, so the LLM judge can verify resources before they're cleaned up.
303
-
304
- ## CLI Reference
305
-
306
- ### `axis run`
307
-
308
- Run scenarios against configured agents.
309
-
310
- ```bash
311
- axis run [options]
312
- ```
313
-
314
- | Option | Description |
315
- | --------------------------- | -------------------------------------------------------------------------------------------------------------------- |
316
- | `-c, --config <path>` | Path to axis.config.json (default: `axis.config.json`) |
317
- | `-s, --scenario <key>` | Run a specific scenario by key (e.g. `hello-world`, `cms/create-post`) |
318
- | `-a, --agent <name>` | Run with a specific agent only |
319
- | `--json` | Output results as JSON to stdout (no live display) |
320
- | `--concurrency <n>` | Max parallel jobs (default: unlimited). Overrides `defaults.concurrency` |
321
- | `-v, --verbose` | Show detailed per-step logging |
322
- | `--debug` | Show debug output (workspace paths, env, lifecycle) |
323
- | `-o, --output-dir <dir>` | Also write `axis-report-[timestamp].json` to this directory |
324
- | `--no-score` | Skip scoring (raw results only, saves LLM cost) |
325
- | `--refresh-skills` | Force re-clone of cached remote skills |
326
- | `--compare-baseline [name]` | Compare results against a baseline after scoring (default: `default`). Exits with code 1 if regressions are detected |
327
-
328
- In interactive mode, AXIS renders a live display showing scenario progress, scoring status, and per-job AXIS results. Each agent row shows a ticking elapsed-time timer and a smooth count-up token estimate (e.g. `(12.3s) ~1,234 tok`) — tokens are conservative by design so the displayed number never has to reverse, and both values remain visible once the job completes. When all jobs complete, the final summary persists in the terminal.
329
-
330
- Reports are automatically saved to `.axis/reports/` on every run.
331
-
332
- ### `axis reports`
333
-
334
- View past AXIS reports.
335
-
336
- ```bash
337
- axis reports [reportId] [scenarioKey] [options]
338
- ```
339
-
340
- | Argument / Option | Description |
341
- | ----------------------- | ------------------------------------------------------------------------- |
342
- | `[reportId]` | Report ID or `latest` (omit to list all) |
343
- | `[scenarioKey]` | Drill into a specific scenario's detailed result |
344
- | `-c, --config <path>` | Path to axis.config.json (default: `axis.config.json`) |
345
- | `-a, --agent <name...>` | Filter scenario detail to specific agent(s), repeatable (defaults to all) |
346
- | `--json` | Output as JSON |
347
- | `-n, --limit <count>` | Max reports to list (default: `10`) |
348
-
349
- Examples:
350
-
351
- ```bash
352
- axis reports # List recent reports
353
- axis reports latest # View most recent report summary
354
- axis reports 2025-04-13-183042 # View specific report
355
- axis reports latest hello-world # View all agents for a scenario
356
- axis reports latest hello-world -a codex # View result for a specific agent
357
- axis reports latest hello-world -a codex -a claude-code # View results for multiple agents
358
- ```
359
-
360
- ## Scoring
361
-
362
- AXIS scores every run across **four independent dimensions**, each measuring a different aspect of the agent's interaction. The dimensions roll up into a composite **AXIS Result** (0–100).
363
-
364
- | Dimension | What it measures | How |
365
- | -------------------- | ----------------------------------------- | --------------------------------------- |
366
- | **Goal Achievement** | Did the agent accomplish the task? | LLM judge evaluates against your rubric |
367
- | **Environment** | OS/filesystem/tooling interaction quality | Interaction audit + category scoring |
368
- | **Service** | Service/API interaction quality | Interaction audit + category scoring |
369
- | **Agent** | Agent reasoning/planning quality | Interaction audit + category scoring |
370
-
371
- ### Goal Achievement _(LLM-judged)_
372
-
373
- An LLM judge evaluates the agent's transcript and final result against the user-defined rubric.
374
-
375
- - **Array rubric**: Each check is scored 0–10, weighted, and normalized to 0–100
376
- - **String rubric**: The judge produces a single holistic 0–10 score, scaled to 0–100
377
-
378
- The judge has full context: the scenario prompt, a condensed transcript, and the agent's final result. It is instructed to independently verify claimed outcomes when possible (e.g. visiting URLs, checking endpoints).
379
-
380
- **How the judge is invoked:** The judge runs using the **same adapter and config** as the agent being evaluated — same CLI, same model, same flags. For example, if a scenario runs with `claude-code` using `claude-sonnet-4-5-20250929`, the judge also spawns `claude` with that model. This means scoring requires the agent's underlying CLI to be capable of returning structured JSON in response to the judge prompt. The built-in adapters (`claude-code`, `codex`, `gemini`) all work out of the box. Custom adapters will work as long as the underlying agent can follow the judge's instructions and return valid JSON.
381
-
382
- > **Note:** A dedicated `judge` config (to use a different adapter/model for scoring) is on the roadmap but not yet supported. For now, consider using a capable model in your agent config if scoring accuracy is important.
383
-
384
- ### Environment, Service & Agent _(interaction-based)_
385
-
386
- Every transcript entry is classified into one of three categories:
387
-
388
- - **Environment**: OS, filesystem, and dev tooling — `Bash`, `Read`, `Write`, `Edit`, `Glob`, `Grep`, git, npm, pip, cargo, etc.
389
- - **Agent**: Internal agent operations — `assistant` entries (thinking/planning), `system` entries
390
- - **Service**: Everything else — `WebFetch`, `WebSearch`, MCP tools, API calls, any tool use not classified as environment
391
-
392
- Each category is scored through a multi-step evaluation pipeline:
393
-
394
- 1. **Sparse index** — The full transcript is compressed into a scannable reference (~10 KB vs 50 KB+ raw). Each interaction gets a one-line summary with its category, tool name, outcome, size, and duration.
395
-
396
- 2. **Triage** _(LLM call)_ — The sparse index is sent to a judge that flags up to 30 interactions for deep review and identifies patterns (retry loops, redundant calls, wasted reasoning).
397
-
398
- 3. **Deep evaluation** _(LLM call)_ — All interactions are sent with full content to a judge that scores each on three dimensions:
399
- - **Success** (0–1): Did it complete without errors?
400
- - **Weight** (0–1): Was the context size proportional to what was needed?
401
- - **Context relevance** (0–1): How much of the information was actionable?
402
-
403
- **Speed** (0–1) is computed heuristically from interaction duration — no LLM needed. Thresholds are calibrated per category (environment operations are expected to be near-instant, service calls get more tolerance for network latency).
404
-
405
- The judge also evaluates **necessity** per category: were all interactions needed, or were some unnecessary?
406
-
407
- Interactions the LLM does not evaluate receive default scores (`success: 1.0, speed: 1.0, weight: 1.0, relevance: 1.0`).
408
-
409
- 4. **Category scoring** — Per-category dimension scores are aggregated and folded with the necessity judgment, then mapped through a log-normal CDF to produce a 0–100 score. The log-normal mapping means bad-to-mediocre improvement scores more than good-to-great. Speed uses a severity-weighted average so that slow outliers pull the score down disproportionately rather than being hidden by many fast interactions. Other dimensions are weighted by context size.
410
-
411
- ### Composite AXIS Result
412
-
413
- ```
414
- AXIS Result = Goal Achievement × 0.40 + Environment × 0.20 + Service × 0.20 + Agent × 0.20
415
- ```
416
-
417
- Weights are configurable via `defaults.scoring_weights` in the config. Use `--no-score` to skip scoring entirely and get raw results only.
418
-
419
- ### Scoring Pipeline
420
-
421
- Scoring runs **in parallel with execution**. As each job completes, its result is immediately scored while other jobs continue running. The scoring pipeline makes 3 LLM calls per run: goal achievement, triage, and deep evaluation. The triage and goal achievement calls run in parallel, then deep eval runs with the triage results. Teardown only runs after scoring finishes, so the LLM judge can verify deployed resources.
422
-
423
- ### Token Usage
424
-
425
- AXIS reports token usage at two levels: a **live estimate** during execution and the **exact count** after the agent finishes.
426
-
427
- **Live estimate** — During a run, AXIS estimates tokens from the agent's stdout stream using a conservative `chars / 5` ratio. This drives the `~1,234 tok` counter in the live display. The estimate intentionally underestimates so the counter never has to reverse. Only text that flows through stdout is counted — the agent's system prompt, tool definitions, and input context are invisible to this estimator.
428
-
429
- **Exact count** — After the agent process exits, the adapter reads the real token usage from the CLI's final output event and reports it as `metadata.tokenUsage`. This is the authoritative number stored in reports. The live counter animates up to this final value.
430
-
431
- | Adapter | Token source | Fields reported |
432
- | ------------- | -------------------------- | ----------------------------------- |
433
- | `claude-code` | `result` event → `usage` | `input`, `output`, `cacheReadInput` |
434
- | `codex` | `turn.completed` → `usage` | `input`, `output`, `cacheReadInput` |
435
- | `gemini` | `result` event → `stats` | `input`, `output` |
436
-
437
- **Baseline system prompt overhead** — Agent CLIs inject their own system prompts (coding persona, tool definitions, safety rules) before your scenario prompt reaches the model. This means every run has a baseline input token cost regardless of prompt length. For example, Gemini CLI uses ~14,000 input tokens and Codex ~10,000 input tokens just for system prompt overhead on a trivial prompt. These tokens are real API cost but are not visible in the AXIS stdout stream — they appear only in the final `tokenUsage` numbers. Keep this in mind when comparing token counts across agents: differences in input tokens often reflect system prompt size rather than agent behavior.
438
-
439
- ## Reports
440
-
441
- Every `axis run` automatically writes a structured report to `.axis/reports/`.
442
-
443
- ```
444
- .axis/reports/
445
- 2025-04-13-183042/
446
- report.json # Lightweight manifest (no transcripts)
447
- scenarios/
448
- hello-world/
449
- claude-code.json # Full result + transcript + scores
450
- cms/create-post/
451
- claude-code.json
452
- ```
453
-
454
- **`report.json`** contains the summary, per-result scores, duration, and token usage — everything needed to render a summary table without loading full transcripts.
455
-
456
- **`scenarios/{key}/{agent}.json`** contains the complete result including the full transcript, rubric, prompt, and scoring rationale. This is what you see when drilling into a specific scenario with `axis reports <id> <scenarioKey>`.
457
-
458
- **`scenarios/{key}/{agent}.raw.ndjson`** — When `--debug` is enabled, the raw NDJSON lines from the agent's stdout are written alongside the scenario result. This is useful for debugging adapter parsing issues — you can diff the raw CLI output against the parsed transcript.
459
-
460
- **`scenarios/{key}/{agent}.sparse-index.txt`** — When `--debug` is enabled and scoring is active, the sparse index used for triage and deep evaluation is written as a human-readable text file. Each line represents one interaction with its category, type, and outcome — useful for understanding how the scoring pipeline classified and evaluated the agent's work.
461
-
462
- Report IDs are UTC timestamps in `YYYY-MM-DD-HHmmss` format (e.g. `2025-04-13-183042`).
463
-
464
- Reports are local artifacts — consider adding `.axis/reports/` and `.axis/skills-cache/` to your `.gitignore`. If you use baselines (see [Baselines](#baselines)), keep `.axis/baselines/` tracked so your team shares the same regression targets.
465
-
466
- ## Baselines
467
-
468
- Baselines are snapshots of AXIS results that you can compare future runs against. They're the foundation for regression detection — if a code change causes an agent's score to drop, the baseline diff will flag it.
45
+ AXIS executes the scenario, scores the result, and writes a report to `.axis/reports/`.
469
46
 
470
- Most projects only need one baseline, so baseline commands default to a single baseline named `default`. You can pass an explicit name for multi-baseline workflows (e.g. one per model version or per branch).
47
+ ## Documentation
471
48
 
472
- Baselines are **accumulated**. Running `axis run -s hello-world` and setting that as a baseline only updates `hello-world` entries — previously stored scenarios are preserved. This means you can build up a baseline incrementally across multiple focused runs.
49
+ Full documentation lives at **https://axis-docs.netlify.app**:
473
50
 
474
- Baselines are stored at `.axis/baselines/{name}.json` and are designed to be committed to version control so your team shares the same regression targets.
475
-
476
- ### `axis baseline set`
477
-
478
- Create or update a baseline from a report.
479
-
480
- ```bash
481
- axis baseline set [name] [options]
482
- ```
483
-
484
- | Option | Description |
485
- | --------------------- | ----------------------------------------- |
486
- | `--from <reportId>` | Use a specific report (default: `latest`) |
487
- | `-c, --config <path>` | Path to axis.config.json |
488
-
489
- ```bash
490
- axis run # Generate a scored report
491
- axis baseline set # Save latest report as the default baseline
492
- axis run -s hello-world # Run a single scenario
493
- axis baseline set # Updates only hello-world, preserves others
494
-
495
- # Multi-baseline workflows:
496
- axis baseline set claude-4 # Save as a named baseline
497
- axis baseline set gemini-2 # Keep a separate baseline per model
498
- ```
499
-
500
- ### `axis baseline list`
501
-
502
- List all baselines with their scenario/agent counts.
503
-
504
- ```bash
505
- axis baseline list
506
- ```
507
-
508
- ### `axis baseline show`
509
-
510
- Display the contents of a baseline.
511
-
512
- ```bash
513
- axis baseline show [name] [--json]
514
- ```
515
-
516
- ### `axis baseline diff`
517
-
518
- Compare a report against a baseline. Exits with code 1 if any regressions are detected.
519
-
520
- ```bash
521
- axis baseline diff [name] [options]
522
- ```
523
-
524
- | Option | Description |
525
- | --------------------- | --------------------------------------------- |
526
- | `--report <reportId>` | Compare a specific report (default: `latest`) |
527
- | `--json` | Output diff as JSON |
528
- | `-c, --config <path>` | Path to axis.config.json |
529
-
530
- Only scenarios and agents present in **both** the baseline and report are compared. New scenarios (in the report but not the baseline) are counted as informational. Missing scenarios (in the baseline but not the report) are ignored — partial runs don't trigger regressions.
531
-
532
- A noise tolerance of 1 point applies: score deltas of 0 or 1 are classified as "unchanged".
533
-
534
- ### `axis baseline delete`
535
-
536
- Delete a baseline.
537
-
538
- ```bash
539
- axis baseline delete [name]
540
- ```
541
-
542
- ### Comparing on Run
543
-
544
- Use `--compare-baseline` with `axis run` to automatically diff after scoring:
545
-
546
- ```bash
547
- axis run --compare-baseline # Diff against the default baseline
548
- axis run --compare-baseline=claude-4 # Diff against a named baseline
549
- ```
550
-
551
- This runs the scenarios, scores them, saves the report, then diffs against the baseline. If any regressions are found, the command exits with code 1 — useful for CI gating.
552
-
553
- ## Agent Adapters
554
-
555
- AXIS is agent-agnostic. Any agent can be used as long as its adapter implements the `AgentAdapter` interface.
556
-
557
- ### CLI Resolution
558
-
559
- The built-in adapters (`claude-code`, `codex`, `gemini`) automatically resolve their CLI tools at startup. If the CLI is installed globally or available on `PATH`, AXIS uses it directly. If not, AXIS falls back to `npx` to download and run the package on demand. This means you don't need to install agent CLIs globally — AXIS handles it for you.
560
-
561
- | Adapter | CLI Command | npx Package |
562
- | ------------- | ----------- | --------------------------- |
563
- | `claude-code` | `claude` | `@anthropic-ai/claude-code` |
564
- | `codex` | `codex` | `@openai/codex` |
565
- | `gemini` | `gemini` | `@google/gemini-cli` |
566
-
567
- Custom adapters handle their own command resolution — see [Custom Adapters](#custom-adapters).
568
-
569
- ### Adapter Interface
570
-
571
- ```typescript
572
- interface AgentAdapter {
573
- readonly name: string;
574
- run(input: AgentInput): Promise<AgentOutput>;
575
- }
576
-
577
- interface AgentInput {
578
- prompt: string;
579
- config: AgentConfig;
580
- scenario: Scenario;
581
- workingDirectory: string;
582
- env?: Record<string, string>;
583
- registerCleanup?: (fn: () => void) => void;
584
- }
585
-
586
- interface AgentOutput {
587
- transcript: TranscriptEntry[];
588
- result: string | null;
589
- metadata: {
590
- startTime: string;
591
- endTime: string;
592
- durationMs: number;
593
- tokenUsage?: { input: number; output: number; cacheReadInput?: number };
594
- totalCostUsd?: number;
595
- exitCode: number;
596
- sessionId?: string;
597
- error?: string;
598
- };
599
- }
600
-
601
- interface TranscriptEntry {
602
- type: "assistant" | "user" | "tool_use" | "tool_result" | "system" | "error";
603
- timestamp: string;
604
- content: Record<string, unknown>;
605
- }
606
- ```
607
-
608
- ### Built-in: Claude Code Adapter
609
-
610
- The `claude-code` adapter spawns the `claude` CLI with `--output-format stream-json --verbose`. It parses the NDJSON stream from stdout, maps events to `TranscriptEntry` format, and captures metadata (tokens, timing, cost, exit code).
611
-
612
- Adapter-specific flags can be passed via the `flags` config:
613
-
614
- ```json
615
- {
616
- "adapter": "claude-code",
617
- "model": "sonnet",
618
- "flags": {
619
- "dangerously-skip-permissions": true
620
- }
621
- }
622
- ```
623
-
624
- The `dangerously-skip-permissions` flag defaults to `true` for automated testing.
625
-
626
- ### Custom Adapters
627
-
628
- AXIS supports custom adapters for testing any CLI-based agent. Create an adapter module using the `createAgentAdapter` factory, declare it in your config, and reference it in your agents.
629
-
630
- **1. Create the adapter module** (`adapters/my-agent.ts`):
631
-
632
- ```typescript
633
- import { createAgentAdapter } from "@netlify/axis";
634
-
635
- // The type parameter defines the shape of per-run mutable state that your
636
- // streamConfig handlers and getResult callback share.
637
- export default createAgentAdapter<{ stdout: string }>({
638
- // Unique name for this adapter — used in logs and error messages.
639
- name: "my-agent",
640
-
641
- // How to find the CLI binary. Return the command and any prefix args
642
- // (e.g. from npx resolution). Omit to use the built-in npx fallback
643
- // with `cliCommand`, or to read `command` from the agent config at runtime.
644
- resolveCommand: () => ({ command: "my-agent-cli", prefixArgs: [] }),
645
-
646
- // Build the CLI arguments for the agent process. Receives the full AgentInput
647
- // so you can read the prompt, config flags, model, etc.
648
- buildArgs: (input) => {
649
- const args: string[] = [];
650
- const flags = input.config.flags ?? {};
651
- for (const [key, value] of Object.entries(flags)) {
652
- if (value === true) args.push(`--${key}`);
653
- else if (value !== false) args.push(`--${key}`, String(value));
654
- }
655
- args.push(input.prompt);
656
- return args;
657
- },
658
-
659
- // Factory that creates a fresh state object for each run. This state is
660
- // passed to streamConfig handlers and getResult so they can accumulate
661
- // data across the agent's output stream.
662
- initialState: () => ({ stdout: "" }),
663
-
664
- // How to process the agent's stdout stream. Two modes:
665
- // "aggregate" — raw chunks, good for plain-text CLI output
666
- // "lines" — one line at a time, good for NDJSON-streaming agents
667
- streamConfig: {
668
- mode: "aggregate",
669
- onChunk: (chunk, ctx) => {
670
- ctx.state.stdout += chunk;
671
- },
672
- },
673
-
674
- // Called after the agent process exits. Build the final result string
675
- // and optionally push transcript entries or return metadata overrides.
676
- // Return { result: null } when the agent produced no usable output.
677
- getResult: (ctx) => {
678
- const result = ctx.state.stdout.trim() || null;
679
- if (result) {
680
- ctx.transcript.push({
681
- type: "assistant",
682
- timestamp: ctx.endTime.toISOString(),
683
- content: { text: result },
684
- });
685
- }
686
- return { result };
687
- },
688
- });
689
- ```
690
-
691
- **2. Declare in config** (`axis.config.json`):
692
-
693
- ```json
694
- {
695
- "adapters": {
696
- "my-agent": "./adapters/my-agent.ts"
697
- },
698
- "scenarios": "./scenarios",
699
- "agents": [{ "adapter": "my-agent" }]
700
- }
701
- ```
702
-
703
- Adapter paths are resolved relative to the config file. The module must export a valid `AgentAdapter` as its default export or a named `adapter` export.
704
-
705
- **3. Run as usual:**
706
-
707
- ```bash
708
- npx @netlify/axis run
709
- ```
710
-
711
- The `aggregate` mode in `streamConfig` captures raw stdout as the result and produces a minimal transcript — a single `assistant` entry. For richer transcripts (tool use, reasoning, token usage), use `lines` mode with NDJSON parsing. See the built-in adapters for examples.
712
-
713
- For programmatic registration (without config), use the `registerAdapter` export:
714
-
715
- ```typescript
716
- import { registerAdapter, createAgentAdapter } from "@netlify/axis";
717
-
718
- const adapter = createAgentAdapter({
719
- /* spec */
720
- });
721
- registerAdapter("my-agent", adapter);
722
- ```
723
-
724
- ### Built-in: Codex Adapter
725
-
726
- The `codex` adapter spawns `codex exec --json` and parses the NDJSON event stream. It maps Codex events (`item.started`, `item.completed`, `turn.completed`, etc.) to `TranscriptEntry` format, extracts the final `agent_message` as the result, and captures token usage from `turn.completed` events.
727
-
728
- ```json
729
- {
730
- "adapter": "codex",
731
- "model": "o4-mini",
732
- "flags": {
733
- "full-auto": true,
734
- "sandbox": "workspace-write"
735
- }
736
- }
737
- ```
738
-
739
- The `full-auto` flag defaults to `true` for automated testing.
740
-
741
- #### Codex Isolation
742
-
743
- The `codex` adapter sets these environment defaults to prevent loading host configuration:
744
-
745
- | Variable | Default | Purpose |
746
- | ------------------------- | -------------------- | ----------------------------------------------------- |
747
- | `CODEX_HOME` | `<workspace>/.codex` | Config/state uses the clean workspace, not `~/.codex` |
748
- | `CODEX_DISABLE_TELEMETRY` | `1` | No telemetry from synthetic test runs |
749
-
750
- #### Codex Event Mapping
751
-
752
- | Codex Event | Transcript Type | Notes |
753
- | ----------------------------------------------------------- | --------------------- | --------------------------------- |
754
- | `item.completed` (agent_message) | `assistant` | Final text extracted as result |
755
- | `item.started` (command_execution) | `tool_use` | Command about to run |
756
- | `item.completed` (command_execution) | `tool_result` | Command output |
757
- | `item.completed` (reasoning) | `assistant` | Agent reasoning |
758
- | `item.completed` (file_changes, web_search, mcp_tool_calls) | `tool_result` | Tool outputs |
759
- | `error` / `turn.failed` | `error` | Error events |
760
- | `turn.completed` | _(not in transcript)_ | Token usage extracted to metadata |
761
-
762
- ### Built-in: Gemini Adapter
763
-
764
- The `gemini` adapter spawns the `gemini` CLI (Google's Gemini CLI) with `--output-format stream-json` and parses the NDJSON event stream. It maps Gemini events to `TranscriptEntry` format, extracts the final assistant message as the result, and captures token usage from the `result` event.
765
-
766
- ```json
767
- {
768
- "adapter": "gemini",
769
- "model": "gemini-2.5-pro",
770
- "flags": {
771
- "yolo": true,
772
- "sandbox": "docker"
773
- }
774
- }
775
- ```
776
-
777
- The `yolo` flag defaults to `true` for automated testing (auto-approves all tool actions).
778
-
779
- **Requirements:** Set `GEMINI_API_KEY` (from Google AI Studio). The Gemini CLI is resolved automatically (see [CLI Resolution](#cli-resolution)).
780
-
781
- #### Gemini Isolation
782
-
783
- The `gemini` adapter sets these environment defaults to prevent loading host configuration:
784
-
785
- | Variable | Default | Purpose |
786
- | -------------------------- | --------------------- | ------------------------------------------------------ |
787
- | `GEMINI_CLI_HOME` | `<workspace>/.gemini` | Config/state uses the clean workspace, not `~/.gemini` |
788
- | `GEMINI_TELEMETRY_ENABLED` | `false` | No telemetry from synthetic test runs |
789
-
790
- The adapter also writes `settings.json` into `GEMINI_CLI_HOME` with context discovery disabled:
791
-
792
- ```json
793
- {
794
- "context": {
795
- "discoveryMaxDirs": 0,
796
- "memoryBoundaryMarkers": []
797
- }
798
- }
799
- ```
800
-
801
- This prevents Gemini from scanning the workspace directory tree on startup. Without it, Gemini will explore the project structure before addressing the prompt, adding unnecessary tool calls and latency — especially noticeable in AXIS workspaces which are ephemeral temp directories with no meaningful project structure. MCP server configuration is merged into the same `settings.json` when configured.
802
-
803
- #### Gemini Event Mapping
804
-
805
- | Gemini Event | Transcript Type | Notes |
806
- | -------------------------- | --------------------- | ------------------------------------------ |
807
- | `message` (role=assistant) | `assistant` | Last assistant message extracted as result |
808
- | `message` (role=user) | `tool_result` | User messages (typically tool outputs) |
809
- | `tool_use` | `tool_use` | Tool invocation with parameters |
810
- | `tool_result` | `tool_result` | Tool output with status |
811
- | `error` | `error` | Error events with severity |
812
- | `init` | _(not in transcript)_ | Session ID extracted to metadata |
813
- | `result` | _(not in transcript)_ | Token usage extracted to metadata |
51
+ - [Overview](https://axis-docs.netlify.app/) what AXIS measures and why
52
+ - [Quick Start](https://axis-docs.netlify.app/quickstart) — install through your first scored run
53
+ - [Configuration](https://axis-docs.netlify.app/configuration) — `axis.config.json`, scenarios, MCP servers, skills
54
+ - [CLI Reference](https://axis-docs.netlify.app/cli) — `axis run`, `axis reports`, `axis baseline`
55
+ - [Running Tests](https://axis-docs.netlify.app/running) execution model, workspace isolation, custom adapters, CI integration
56
+ - [Scoring Framework](https://axis-docs.netlify.app/scoring) — the four dimensions, signals, calibration
814
57
 
815
58
  ## Programmatic API
816
59
 
817
60
  `@netlify/axis` exports its core functionality for use as a library:
818
61
 
819
- ```typescript
820
- import {
821
- // Core execution
822
- run,
823
-
824
- // Configuration
825
- loadConfig,
826
- discoverScenarios,
827
-
828
- // Scoring
829
- scoreResults,
830
- scoreRunResult,
831
- buildScoredOutput,
832
- buildSparseIndex,
833
- categorizeInteraction,
834
-
835
- // Transcript
836
- normalizeTranscript,
837
- toTranscriptAnalysis,
838
-
839
- // Reports
840
- writeReportToStore,
841
- listReports,
842
- readReport,
843
- readScenarioResult,
844
-
845
- // Baselines
846
- setBaseline,
847
- readBaseline,
848
- listBaselines,
849
- deleteBaseline,
850
- diffBaseline,
851
- DEFAULT_BASELINE_NAME,
852
-
853
- // Adapters
854
- getAdapter,
855
- registerAdapter,
856
- createAgentAdapter,
857
- } from "@netlify/axis";
858
- ```
859
-
860
- ### Running scenarios programmatically
861
-
862
- ```typescript
863
- import { run } from "@netlify/axis";
864
-
865
- const output = await run({
866
- configPath: "axis.config.json",
867
- scenarioFilter: ["hello-world"],
868
- agentFilter: ["claude-code"],
869
- onResult: async (result) => {
870
- // Called per-job as each completes (before teardown)
871
- console.log(`${result.scenarioKey}: ${result.output.result}`);
872
- },
873
- });
874
-
875
- console.log(`Completed: ${output.summary.completed}/${output.summary.total}`);
876
- ```
877
-
878
- ### Scoring results
879
-
880
62
  ```typescript
881
63
  import { run, scoreResults } from "@netlify/axis";
882
64
 
883
65
  const output = await run({ configPath: "axis.config.json" });
884
- const scored = await scoreResults(output, {
885
- weights: { goal_achievement: 0.4, environment: 0.2, service: 0.2, agent: 0.2 },
886
- });
66
+ const scored = await scoreResults(output);
887
67
 
888
68
  console.log(`Average AXIS Result: ${scored.summary.averageAxisScore}`);
889
69
  ```
890
70
 
891
- ## Roadmap
892
-
893
- - [x] **Core runner** — scenario parsing, agent lifecycle, transcript capture
894
- - [x] **Agent contract spec** — formal interface definition
895
- - [x] **Claude Code adapter** — first agent implementation
896
- - [x] **LLM-as-judge scoring** — rubric evaluation pipeline
897
- - [x] **System-measured scoring** — efficiency and resilience metrics from transcripts
898
- - [x] **CLI output** — human-readable score reports (ink-based live display)
899
- - [x] **JSON output** — structured results for CI
900
- - [x] **Parallel scenario execution** — run scenario suites concurrently
901
- - [x] **Concurrency control** — `--concurrency` flag to limit parallel jobs
902
- - [x] **Custom adapter API** — `createAgentAdapter` factory + config-based loading for any CLI agent
903
- - [x] **Codex adapter** — NDJSON stream adapter for OpenAI Codex CLI
904
- - [x] **Gemini adapter** — NDJSON stream adapter for Google Gemini CLI
905
- - [x] **MCP server setup** — wire `mcp_servers` config through to adapters
906
- - [x] **Skills setup** — wire `skills` config through to adapters
907
- - [x] **Baseline snapshots** — save runs as baselines, diff future runs for regression detection
908
- - [x] **Transcript normalization** — adapter-agnostic signal extraction (tool names, URLs, domains, classifications)
909
- - [x] **Interaction-based scoring** — 4-dimension evaluation (Goal Achievement + Environment + Service + Agent) with sparse index, triage, and deep eval pipeline
910
- - [ ] **Human interruption detection** — detect and penalize agent requests for human intervention
911
- - [ ] **Score thresholds** — CI gating with configurable pass/fail thresholds
912
- - [ ] **Report cleanup** — pruning old reports with `axis reports prune`
913
- - [ ] **Markdown report output** — `axis reports <id> --format md`
914
- - [ ] **Historical trending** — score regression detection and alerting
915
- - [ ] **AXIS Badge** — embeddable score badge for READMEs (like Lighthouse badges)
916
-
917
- ## Execution Plan
918
-
919
- How we build this, broken into phases. Each phase produces a working increment.
920
-
921
- ### Phase 1: Foundation ✅
922
-
923
- The skeleton. Get a scenario running end-to-end with a single agent, even if scoring is basic.
71
+ The package also exports `loadConfig`, `discoverScenarios`, `setBaseline`, `diffBaseline`, `createAgentAdapter`, `registerAdapter`, and the underlying scoring primitives (`buildSparseIndex`, `categorizeInteraction`, `normalizeTranscript`). See [`src/index.ts`](./src/index.ts) for the full surface.
924
72
 
925
- - [x] **Project scaffolding** — TypeScript + Node.js, npm, `@netlify/axis` package structure
926
- - [x] **Config loader** — parse and validate `axis.config.json`, discover scenario files from configured directory
927
- - [x] **Scenario schema** — JSON Schema definition for scenario files, validation, loading
928
- - [x] **Agent contract interface** — TypeScript interface/types for agent adapters (input, output, transcript format)
929
- - [x] **Claude Code adapter** — first concrete adapter: shells out to `claude` CLI, captures structured JSON output, normalizes to contract
930
- - [x] **Setup/teardown executor** — runs `run_script` actions (shell commands) defined in scenarios
931
- - [x] **Runner MVP** — parallel execution: load config → discover scenarios → resolve agent → run setup → launch agent via adapter → capture transcript → run teardown
932
- - [x] **Basic CLI** — `axis run` entry point that executes scenarios and prints raw results
933
-
934
- ### Phase 2: Scoring ✅
935
-
936
- Make the output meaningful. Turn raw transcripts into graded scores.
937
-
938
- - [x] **Goal achievement judge** — LLM evaluates transcript + rubric, per-check weighted scoring
939
- - [x] **Interaction classification** — deterministic tool-name-to-category mapping (environment / service / agent)
940
- - [x] **Sparse index** — compressed transcript representation for efficient LLM evaluation
941
- - [x] **Triage + deep eval pipeline** — LLM flags interactions for deep review, scores on success/speed/weight/relevance/necessity
942
- - [x] **Category scoring** — log-normal CDF mapping, per-category dimension aggregation
943
- - [x] **Score aggregation** — weighted composite AXIS result across 4 dimensions
944
- - [x] **CLI report** — ink-based live terminal display with per-job scores
945
- - [x] **JSON report** — structured output file for programmatic consumption
946
- - [x] **Report storage** — persistent `.axis/reports/` with manifest + per-scenario detail files
947
- - [x] **Report viewer** — `axis reports` command for listing and drilling into past results
948
-
949
- ### Phase 3: Execution Control & Extensibility ✅
950
-
951
- Fine-grained control over how jobs run, and proof the adapter contract is truly agent-agnostic.
952
-
953
- - [x] **Concurrency control** — `--concurrency <n>` flag and `defaults.concurrency` config to limit parallel jobs
954
- - [x] **Custom adapter API** — `createAgentAdapter` factory + `registerAdapter` for any CLI-based agent
955
- - [x] **Codex adapter** — NDJSON stream adapter for OpenAI Codex CLI
956
- - [x] **Gemini adapter** — NDJSON stream adapter for Google Gemini CLI
957
- - [x] **MCP server setup** — wire `mcp_servers` agent config through to adapter execution
958
- - [x] **Skills setup** — wire `skills` agent config through to adapter execution
959
-
960
- ### Phase 4: CI & Regression
961
-
962
- Make it runnable in CI with automated pass/fail gating and regression detection.
963
-
964
- - [ ] **Score thresholds** — `defaults.thresholds` config with per-category minimums, non-zero exit when below
965
- - [x] **Baseline snapshots** — `axis baseline set` saves a run, `axis baseline diff` compares future runs against it
966
- - [ ] **Report cleanup** — `axis reports prune --keep <n>` to manage disk usage
73
+ ## Roadmap
967
74
 
968
- ### Phase 5: Output & Ecosystem
75
+ Delivered: scenario runner, four-dimension scoring pipeline, baselines with regression detection, MCP/skills wiring, custom adapter API, built-in adapters for Claude Code, Codex, and Gemini.
969
76
 
970
- Richer output formats and platform features.
77
+ Planned:
971
78
 
972
- - [ ] **Configurable judge** — separate adapter/model for scoring (e.g. run with a custom adapter, judge with `claude-code`)
973
- - [ ] **Markdown report output** — `axis reports <id> --format md` for pasting into PRs/docs
974
- - [ ] **Scenario library** curated and community-contributed scenario templates
975
- - [ ] **Historical trending** — store results over time, detect regressions
976
- - [ ] **AXIS Badge** embeddable score badge (like Lighthouse)
977
- - [ ] **Dashboard**web UI for browsing results, trends, and comparisons
79
+ - **Configurable judge** — separate adapter/model for scoring, independent of the agent under test
80
+ - **Score thresholds** — CI gating with configurable pass/fail thresholds
81
+ - **Human interruption detection** — penalize agent requests for human intervention
82
+ - **Report cleanup** — `axis reports prune` for managing disk usage
83
+ - **Markdown report output** — `axis reports <id> --format md` for PR/doc embedding
84
+ - **Historical trending** — score regression detection over time
85
+ - **AXIS Badge** — embeddable score badge for READMEs