@cleocode/skills 2026.4.0 → 2026.4.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/package.json +1 -1
- package/skills/_shared/manifest-operations.md +1 -2
- package/skills/_shared/skill-chaining-patterns.md +3 -7
- package/skills/_shared/subagent-protocol-base.cant +1 -1
- package/skills/ct-cleo/SKILL.md +56 -65
- package/skills/ct-cleo/references/orchestrator-constraints.md +0 -13
- package/skills/ct-cleo/references/session-protocol.md +3 -12
- package/skills/ct-codebase-mapper/SKILL.md +7 -7
- package/skills/ct-grade/SKILL.md +12 -46
- package/skills/ct-grade/agents/scenario-runner.md +11 -21
- package/skills/ct-grade/references/ab-test-methodology.md +14 -14
- package/skills/ct-grade/references/domains.md +72 -74
- package/skills/ct-grade/references/grade-spec.md +8 -11
- package/skills/ct-grade/references/scenario-playbook.md +77 -106
- package/skills/ct-grade-v2-1/SKILL.md +30 -32
- package/skills/ct-grade-v2-1/agents/scenario-runner.md +14 -34
- package/skills/ct-grade-v2-1/grade-viewer/eval-report.md +4 -1
- package/skills/ct-grade-v2-1/references/ab-testing.md +28 -88
- package/skills/ct-grade-v2-1/references/grade-spec-v2.md +5 -5
- package/skills/ct-grade-v2-1/references/playbook-v2.md +115 -183
- package/skills/ct-grade-v2-1/references/token-tracking.md +7 -9
- package/skills/ct-memory/SKILL.md +16 -35
- package/skills/ct-orchestrator/SKILL.md +58 -68
- package/skills/ct-skill-validator/SKILL.md +1 -1
- package/skills/ct-skill-validator/agents/ecosystem-checker.md +2 -2
- package/skills/ct-skill-validator/references/cleo-ecosystem-rules.md +19 -20
- package/skills/manifest.json +1 -1
- package/skills/signaldock-connect/SKILL.md +132 -0
- package/skills/signaldock-connect/assets/agent-card.json +48 -0
- package/skills/signaldock-connect/references/api-endpoints.md +131 -0
- package/skills.json +1 -1
|
@@ -4,22 +4,22 @@ description: >-
|
|
|
4
4
|
CLEO session grading and A/B behavioral analysis with token tracking. Evaluates agent
|
|
5
5
|
session quality via a 5-dimension rubric (S1 session discipline, S2 discovery efficiency,
|
|
6
6
|
S3 task hygiene, S4 error protocol, S5 progressive disclosure). Supports three modes:
|
|
7
|
-
(1) scenario — run playbook scenarios S1-S5
|
|
8
|
-
comparison of
|
|
7
|
+
(1) scenario — run playbook scenarios S1-S5 via CLI; (2) ab — blind A/B
|
|
8
|
+
comparison of different CLI configurations for same domain operations with token cost
|
|
9
9
|
measurement; (3) blind — spawn two agents with different configurations, blind-comparator
|
|
10
10
|
picks winner, analyzer produces recommendation. Use when grading agent sessions, running
|
|
11
|
-
grade playbook scenarios, comparing
|
|
12
|
-
usage across
|
|
11
|
+
grade playbook scenarios, comparing behavioral differences, measuring token
|
|
12
|
+
usage across configurations, or performing multi-run blind A/B evaluation with statistical
|
|
13
13
|
analysis and comparative report. Triggers on: grade session, evaluate agent behavior,
|
|
14
|
-
A/B test CLEO
|
|
15
|
-
protocol compliance scoring
|
|
16
|
-
argument-hint: "[mode=scenario|ab|blind] [scenario=s1-s5|all] [
|
|
14
|
+
A/B test CLEO configurations, run grade scenario, token usage analysis, behavioral rubric,
|
|
15
|
+
protocol compliance scoring.
|
|
16
|
+
argument-hint: "[mode=scenario|ab|blind] [scenario=s1-s5|all] [runs=N] [session-id=<id>]"
|
|
17
17
|
allowed-tools: ["Bash(python *)", "Bash(cleo-dev *)", "Bash(cleo *)", "Bash(kill *)", "Bash(lsof *)", "Agent", "Read", "Write", "Glob"]
|
|
18
18
|
---
|
|
19
19
|
|
|
20
20
|
# ct-grade v2.1 — CLEO Grading and A/B Testing
|
|
21
21
|
|
|
22
|
-
Session grading and A/B behavioral analysis for CLEO protocol compliance. Three operating modes cover everything from single-session scoring to multi-run blind comparisons between
|
|
22
|
+
Session grading and A/B behavioral analysis for CLEO protocol compliance. Three operating modes cover everything from single-session scoring to multi-run blind comparisons between different CLI configurations.
|
|
23
23
|
|
|
24
24
|
## On Every /ct-grade Invocation
|
|
25
25
|
|
|
@@ -48,7 +48,7 @@ echo "Grade viewer stopped."
|
|
|
48
48
|
| Mode | Purpose | Key Output |
|
|
49
49
|
|---|---|---|
|
|
50
50
|
| `scenario` | Run playbook scenarios S1-S5 as graded sessions | GradeResult per scenario |
|
|
51
|
-
| `ab` | Run same domain operations
|
|
51
|
+
| `ab` | Run same domain operations with two configurations, compare | comparison.json + token delta |
|
|
52
52
|
| `blind` | Two agents run same task, blind comparator picks winner | analysis.json + winner |
|
|
53
53
|
|
|
54
54
|
## Parameters
|
|
@@ -57,7 +57,7 @@ echo "Grade viewer stopped."
|
|
|
57
57
|
|---|---|---|---|
|
|
58
58
|
| `mode` | `scenario\|ab\|blind` | `scenario` | Operating mode |
|
|
59
59
|
| `scenario` | `s1\|s2\|s3\|s4\|s5\|all` | `all` | Grade playbook scenario(s) to run |
|
|
60
|
-
| `interface` | `
|
|
60
|
+
| `interface` | `cli` | `cli` | Interface to exercise (CLI only) |
|
|
61
61
|
| `domains` | comma list | `tasks,session` | Domains to test in `ab` mode |
|
|
62
62
|
| `runs` | integer | `3` | Runs per configuration for statistical confidence |
|
|
63
63
|
| `session-id` | string | — | Grade a specific existing session (skips execution) |
|
|
@@ -70,12 +70,12 @@ echo "Grade viewer stopped."
|
|
|
70
70
|
/ct-grade session-id=<id>
|
|
71
71
|
```
|
|
72
72
|
|
|
73
|
-
**Run scenario S4 (Full Lifecycle)
|
|
73
|
+
**Run scenario S4 (Full Lifecycle):**
|
|
74
74
|
```
|
|
75
|
-
/ct-grade mode=scenario scenario=s4
|
|
75
|
+
/ct-grade mode=scenario scenario=s4
|
|
76
76
|
```
|
|
77
77
|
|
|
78
|
-
**A/B compare
|
|
78
|
+
**A/B compare two configurations for tasks + session domains (3 runs each):**
|
|
79
79
|
```
|
|
80
80
|
/ct-grade mode=ab domains=tasks,session runs=3
|
|
81
81
|
```
|
|
@@ -93,10 +93,10 @@ echo "Grade viewer stopped."
|
|
|
93
93
|
|
|
94
94
|
1. Set up output dir with `python $CLAUDE_SKILL_DIR/scripts/setup_run.py --mode scenario --scenario <id> --output-dir <dir>`
|
|
95
95
|
2. For each scenario, spawn a `scenario-runner` agent:
|
|
96
|
-
- Agent start: `
|
|
96
|
+
- Agent start: `cleo session start --scope global --name "<scenario-id>" --grade`
|
|
97
97
|
- Agent executes the scenario operations (see [references/playbook-v2.md](references/playbook-v2.md))
|
|
98
|
-
- Agent end: `
|
|
99
|
-
- Agent runs: `
|
|
98
|
+
- Agent end: `cleo session end`
|
|
99
|
+
- Agent runs: `ct grade <sessionId>`
|
|
100
100
|
- Agent saves: `GradeResult` to `<output-dir>/<scenario>/grade.json`
|
|
101
101
|
3. Capture `total_tokens` + `duration_ms` from task notification → `timing.json`
|
|
102
102
|
4. Run: `python $CLAUDE_SKILL_DIR/scripts/generate_report.py --run-dir <dir> --mode scenario`
|
|
@@ -105,16 +105,16 @@ echo "Grade viewer stopped."
|
|
|
105
105
|
|
|
106
106
|
1. Set up run dir with `python $CLAUDE_SKILL_DIR/scripts/setup_run.py --mode ab --output-dir <dir>`
|
|
107
107
|
2. For each target domain, spawn TWO agents in the SAME turn:
|
|
108
|
-
- **Arm A
|
|
109
|
-
- **Arm B
|
|
108
|
+
- **Arm A**: `agents/scenario-runner.md` with configuration A
|
|
109
|
+
- **Arm B**: `agents/scenario-runner.md` with configuration B
|
|
110
110
|
- Capture tokens from both task notifications immediately
|
|
111
|
-
3. Pass both outputs to `agents/blind-comparator.md` (does NOT know which is
|
|
111
|
+
3. Pass both outputs to `agents/blind-comparator.md` (does NOT know which configuration is which)
|
|
112
112
|
4. Comparator writes `comparison.json`
|
|
113
113
|
5. Run `python $CLAUDE_SKILL_DIR/scripts/generate_report.py --run-dir <dir> --mode ab`
|
|
114
114
|
|
|
115
115
|
### Mode: blind
|
|
116
116
|
|
|
117
|
-
Same as `ab` but configurations may differ
|
|
117
|
+
Same as `ab` but configurations may differ (e.g., different session scopes, different agent prompts). The comparator is always blind to configuration identity.
|
|
118
118
|
|
|
119
119
|
---
|
|
120
120
|
|
|
@@ -127,7 +127,7 @@ timing = {
|
|
|
127
127
|
"total_tokens": task.total_tokens, # from task notification — EPHEMERAL
|
|
128
128
|
"duration_ms": task.duration_ms, # from task notification
|
|
129
129
|
"arm": "arm-A",
|
|
130
|
-
"interface": "
|
|
130
|
+
"interface": "cli",
|
|
131
131
|
"scenario": "s4",
|
|
132
132
|
"run": 1,
|
|
133
133
|
"executor_start": start_iso,
|
|
@@ -154,11 +154,9 @@ If running without task notifications (no total_tokens available):
|
|
|
154
154
|
| S2 Discovery Efficiency | 20 | `find:list` ratio ≥80% (+15), `tasks.show` used (+5) |
|
|
155
155
|
| S3 Task Hygiene | 20 | Starts 20, -5 per add without description, -3 if subtask no exists check |
|
|
156
156
|
| S4 Error Protocol | 20 | Starts 20, -5 per unrecovered E_NOT_FOUND, -5 if duplicates |
|
|
157
|
-
| S5 Progressive Disclosure | 20 | `admin.help`/skill lookup (+10),
|
|
157
|
+
| S5 Progressive Disclosure | 20 | `admin.help`/skill lookup (+10), progressive disclosure used (+10) |
|
|
158
158
|
|
|
159
|
-
**Grade letters:** A
|
|
160
|
-
|
|
161
|
-
**Note:** CLI-only sessions always score 0 on S5 — `metadata.gateway` is not set by the CLI adapter. MCP earns +10 automatically.
|
|
159
|
+
**Grade letters:** A>=90, B>=75, C>=60, D>=45, F<45
|
|
162
160
|
|
|
163
161
|
---
|
|
164
162
|
|
|
@@ -227,11 +225,11 @@ Shows historical grades from GRADES.jsonl, A/B summaries from any workspace subd
|
|
|
227
225
|
|
|
228
226
|
---
|
|
229
227
|
|
|
230
|
-
##
|
|
228
|
+
## CLI Grade Operations
|
|
231
229
|
|
|
232
|
-
|
|
|
233
|
-
|
|
234
|
-
| `
|
|
235
|
-
| `
|
|
236
|
-
| `
|
|
237
|
-
| `
|
|
230
|
+
| Command | Description |
|
|
231
|
+
|---------|-------------|
|
|
232
|
+
| `ct grade <sessionId>` | Grade a specific session |
|
|
233
|
+
| `ct grade --list` | List past grade results |
|
|
234
|
+
| `ct session start --scope global --name "<n>" --grade` | Start graded session |
|
|
235
|
+
| `ct session end` | End session |
|
|
@@ -1,12 +1,11 @@
|
|
|
1
1
|
# Scenario Runner Agent
|
|
2
2
|
|
|
3
|
-
You are a CLEO grade scenario executor. Your job is to run a specific grade playbook scenario using the
|
|
3
|
+
You are a CLEO grade scenario executor. Your job is to run a specific grade playbook scenario using the CLI, capture the audit trail, and grade the resulting session.
|
|
4
4
|
|
|
5
5
|
## Inputs
|
|
6
6
|
|
|
7
7
|
You will receive:
|
|
8
8
|
- `SCENARIO`: Which scenario to run (s1|s2|s3|s4|s5|s6|s7|s8|s9|s10)
|
|
9
|
-
- `INTERFACE`: Which interface to use (mcp|cli)
|
|
10
9
|
- `OUTPUT_DIR`: Where to write results
|
|
11
10
|
- `PROJECT_DIR`: Path to the CLEO project (for cleo-dev --cwd)
|
|
12
11
|
- `RUN_NUMBER`: Integer (1, 2, 3...) for repeated runs
|
|
@@ -17,30 +16,24 @@ You will receive:
|
|
|
17
16
|
|
|
18
17
|
Note the ISO timestamp before any operations.
|
|
19
18
|
|
|
20
|
-
### Step 2: Start a graded session
|
|
19
|
+
### Step 2: Start a graded session
|
|
21
20
|
|
|
22
|
-
```
|
|
23
|
-
|
|
21
|
+
```bash
|
|
22
|
+
cleo-dev --cwd <PROJECT_DIR> session start --grade --name "grade-<SCENARIO>-run<RUN>" --scope global
|
|
24
23
|
```
|
|
25
24
|
|
|
26
25
|
Save the returned `sessionId`.
|
|
27
26
|
|
|
28
27
|
If this fails (DB migration error, ENOENT, or non-zero exit):
|
|
29
28
|
- Write `grade.json: { "error": "DB_UNAVAILABLE", "totalScore": null }`
|
|
30
|
-
- Write `timing.json: { "error": "DB_UNAVAILABLE", "total_tokens": null, "duration_ms": null, "
|
|
29
|
+
- Write `timing.json: { "error": "DB_UNAVAILABLE", "total_tokens": null, "duration_ms": null, "scenario": "<SCENARIO>", "run": <RUN_NUMBER>, "interface": "cli", "executor_start": "<ISO>", "executor_end": "<ISO>" }`
|
|
31
30
|
- Output: `SESSION_START_FAILED: DB_UNAVAILABLE`
|
|
32
31
|
- Stop. Do NOT abort silently.
|
|
33
32
|
|
|
34
33
|
### Step 3: Execute scenario operations
|
|
35
34
|
|
|
36
|
-
Follow the exact operation sequence from the scenario playbook.
|
|
37
|
-
|
|
38
|
-
**MCP operations** use the query/mutate gateway:
|
|
39
|
-
```
|
|
40
|
-
query tasks find { "status": "active" }
|
|
41
|
-
```
|
|
35
|
+
Follow the exact operation sequence from the scenario playbook. All operations use the CLI.
|
|
42
36
|
|
|
43
|
-
**CLI operations** use cleo-dev (prefer) or cleo, with PROJECT_DIR as cwd if provided:
|
|
44
37
|
```bash
|
|
45
38
|
cleo-dev --cwd <PROJECT_DIR> find --status active
|
|
46
39
|
```
|
|
@@ -49,14 +42,14 @@ Scenario sequences are in [../references/playbook-v2.md](../references/playbook-
|
|
|
49
42
|
|
|
50
43
|
### Step 4: End the session
|
|
51
44
|
|
|
52
|
-
```
|
|
53
|
-
|
|
45
|
+
```bash
|
|
46
|
+
cleo-dev --cwd <PROJECT_DIR> session end
|
|
54
47
|
```
|
|
55
48
|
|
|
56
49
|
### Step 5: Grade the session
|
|
57
50
|
|
|
58
|
-
```
|
|
59
|
-
|
|
51
|
+
```bash
|
|
52
|
+
cleo-dev --cwd <PROJECT_DIR> check grade --session "<saved-id>"
|
|
60
53
|
```
|
|
61
54
|
|
|
62
55
|
Save the full GradeResult JSON.
|
|
@@ -65,7 +58,7 @@ Save the full GradeResult JSON.
|
|
|
65
58
|
|
|
66
59
|
Record every operation you executed as a JSONL file. Each line:
|
|
67
60
|
```json
|
|
68
|
-
{"seq": 1, "
|
|
61
|
+
{"seq": 1, "domain": "tasks", "operation": "find", "params": {}, "success": true, "interface": "cli", "timestamp": "..."}
|
|
69
62
|
```
|
|
70
63
|
|
|
71
64
|
### Step 7: Write output files
|
|
@@ -89,10 +82,9 @@ Write to `<OUTPUT_DIR>/<SCENARIO>/arm-<INTERFACE>/`:
|
|
|
89
82
|
**timing.json** — Fill in what you can; orchestrator fills `total_tokens` and `duration_ms`:
|
|
90
83
|
```json
|
|
91
84
|
{
|
|
92
|
-
"arm": "<INTERFACE>",
|
|
93
85
|
"scenario": "<SCENARIO>",
|
|
94
86
|
"run": <RUN_NUMBER>,
|
|
95
|
-
"interface": "
|
|
87
|
+
"interface": "cli",
|
|
96
88
|
"session_id": "<session-id>",
|
|
97
89
|
"executor_start": "<ISO>",
|
|
98
90
|
"executor_end": "<ISO>",
|
|
@@ -109,19 +101,8 @@ Note: `total_tokens` and `duration_ms` are filled by the orchestrator from the t
|
|
|
109
101
|
|
|
110
102
|
After receiving the grade result, record the exchange to persist token measurements:
|
|
111
103
|
|
|
112
|
-
```
|
|
113
|
-
|
|
114
|
-
"action": "record",
|
|
115
|
-
"sessionId": "<session-id>",
|
|
116
|
-
"transport": "mcp",
|
|
117
|
-
"domain": "admin",
|
|
118
|
-
"operation": "grade",
|
|
119
|
-
"metadata": {
|
|
120
|
-
"scenario": "<SCENARIO>",
|
|
121
|
-
"interface": "<INTERFACE>",
|
|
122
|
-
"run": <RUN_NUMBER>
|
|
123
|
-
}
|
|
124
|
-
}
|
|
104
|
+
```bash
|
|
105
|
+
cleo-dev --cwd <PROJECT_DIR> admin token record --session "<session-id>" --domain admin --operation grade --metadata '{"scenario":"<SCENARIO>","run":<RUN_NUMBER>}'
|
|
125
106
|
```
|
|
126
107
|
|
|
127
108
|
Save the returned `id` as `token_usage_id` in timing.json.
|
|
@@ -170,7 +151,6 @@ Do NOT do these during scenario execution — they will lower the grade intentio
|
|
|
170
151
|
When complete, summarize:
|
|
171
152
|
```
|
|
172
153
|
SCENARIO: <id>
|
|
173
|
-
INTERFACE: <interface>
|
|
174
154
|
RUN: <n>
|
|
175
155
|
SESSION_ID: <id>
|
|
176
156
|
TOTAL_SCORE: <n>/100
|
|
@@ -3,9 +3,12 @@
|
|
|
3
3
|
**Generated:** 2026-03-07 23:47 UTC
|
|
4
4
|
**Source:** `/tmp/ct-grade-eval`
|
|
5
5
|
|
|
6
|
+
> **DEPRECATED**: This report was generated when MCP was still supported. MCP has been removed.
|
|
7
|
+
> All operations now use the CLI exclusively. These results are retained for historical reference only.
|
|
8
|
+
|
|
6
9
|
---
|
|
7
10
|
|
|
8
|
-
## MCP vs CLI Blind A/B Results
|
|
11
|
+
## Historical: MCP vs CLI Blind A/B Results
|
|
9
12
|
|
|
10
13
|
**Overall winner: MCP**
|
|
11
14
|
|
|
@@ -1,18 +1,21 @@
|
|
|
1
1
|
# Blind A/B Testing Protocol
|
|
2
2
|
|
|
3
|
-
Methodology for blind comparison of
|
|
3
|
+
Methodology for blind comparison of grade scenario results in CLEO.
|
|
4
|
+
|
|
5
|
+
> **Note**: MCP support was removed. All operations now use the CLI exclusively.
|
|
6
|
+
> This protocol compares different CLI configurations, binary versions, or parameter sets.
|
|
4
7
|
|
|
5
8
|
---
|
|
6
9
|
|
|
7
10
|
## Agent-Based Execution (Canonical)
|
|
8
11
|
|
|
9
|
-
The canonical A/B approach uses Claude Code Agents to run scenarios end-to-end via the
|
|
12
|
+
The canonical A/B approach uses Claude Code Agents to run scenarios end-to-end via the CLI. This captures real token data from task notifications.
|
|
10
13
|
|
|
11
14
|
### Execution Flow
|
|
12
15
|
|
|
13
16
|
1. Run `python scripts/setup_run.py` to create run structure and print the execution plan
|
|
14
|
-
2. Follow the plan: spawn scenario-runner agents in parallel (arm-A
|
|
15
|
-
3. Immediately capture `total_tokens` from each task notification
|
|
17
|
+
2. Follow the plan: spawn scenario-runner agents in parallel (arm-A, arm-B with different configurations)
|
|
18
|
+
3. Immediately capture `total_tokens` from each task notification to `timing.json`
|
|
16
19
|
4. Spawn blind-comparator agent after both arms complete
|
|
17
20
|
5. Run `python scripts/token_tracker.py --run-dir <dir>` to aggregate tokens
|
|
18
21
|
6. Run `python scripts/generate_report.py --run-dir <dir>` for final report
|
|
@@ -22,10 +25,10 @@ The canonical A/B approach uses Claude Code Agents to run scenarios end-to-end v
|
|
|
22
25
|
```python
|
|
23
26
|
# After EACH agent task completes, fill timing.json immediately:
|
|
24
27
|
timing = {
|
|
25
|
-
"total_tokens": task.total_tokens, # EPHEMERAL
|
|
28
|
+
"total_tokens": task.total_tokens, # EPHEMERAL -- capture now or lose it
|
|
26
29
|
"duration_ms": task.duration_ms,
|
|
27
30
|
"arm": "arm-A",
|
|
28
|
-
"interface": "
|
|
31
|
+
"interface": "cli",
|
|
29
32
|
"scenario": "s4",
|
|
30
33
|
"run": 1,
|
|
31
34
|
}
|
|
@@ -35,30 +38,7 @@ Token data priority:
|
|
|
35
38
|
1. `total_tokens` from Claude Code Agent task notification (canonical)
|
|
36
39
|
2. OTel `claude_code.token.usage` (when `CLAUDE_CODE_ENABLE_TELEMETRY=1`)
|
|
37
40
|
3. `output_chars / 3.5` (JSON response estimate)
|
|
38
|
-
4. `entryCount
|
|
39
|
-
|
|
40
|
-
---
|
|
41
|
-
|
|
42
|
-
## Subprocess-Based Execution (Fallback)
|
|
43
|
-
|
|
44
|
-
For automated testing without agent delegation, use `run_ab_test.py`. This invokes CLEO via subprocess and requires a migrated `tasks.db`.
|
|
45
|
-
|
|
46
|
-
---
|
|
47
|
-
|
|
48
|
-
## What We're Testing
|
|
49
|
-
|
|
50
|
-
| Side | Interface | Mechanism |
|
|
51
|
-
|------|-----------|-----------|
|
|
52
|
-
| **A** (MCP) | JSON-RPC via stdio to CLEO MCP server | `node dist/mcp/index.js` with JSON-RPC messages |
|
|
53
|
-
| **B** (CLI) | Shell commands via subprocess | `cleo-dev <domain> <operation> [params]` |
|
|
54
|
-
|
|
55
|
-
Both sides call the same underlying `src/dispatch/` layer. The A/B test isolates:
|
|
56
|
-
- **Output format differences** — MCP returns structured JSON envelopes; CLI may add ANSI/formatting
|
|
57
|
-
- **Response size** — character counts as token proxy
|
|
58
|
-
- **Latency** — wall-clock time per operation
|
|
59
|
-
- **Data equivalence** — do they return the same logical data?
|
|
60
|
-
|
|
61
|
-
Blind assignment means the comparator does not know which result came from MCP vs CLI when producing the quality verdict.
|
|
41
|
+
4. `entryCount x 150` (coarse proxy from GRADES.jsonl)
|
|
62
42
|
|
|
63
43
|
---
|
|
64
44
|
|
|
@@ -89,11 +69,11 @@ ab-results/
|
|
|
89
69
|
## Blind Assignment
|
|
90
70
|
|
|
91
71
|
The `run_ab_test.py` script randomly shuffles which side gets labeled "A" vs "B" for each run. The comparator agent sees only:
|
|
92
|
-
- Output labeled "A"
|
|
93
|
-
- Output labeled "B"
|
|
72
|
+
- Output labeled "A"
|
|
73
|
+
- Output labeled "B"
|
|
94
74
|
- The original request prompt
|
|
95
75
|
|
|
96
|
-
The `meta.json` records the true identity
|
|
76
|
+
The `meta.json` records the true identity per run. `generate_report.py` de-blinds after all comparisons are done.
|
|
97
77
|
|
|
98
78
|
---
|
|
99
79
|
|
|
@@ -104,7 +84,7 @@ The `meta.json` records the true identity (`a_is_mcp: true|false`) per run. `gen
|
|
|
104
84
|
| `output_chars` | `len(response_json_str)` |
|
|
105
85
|
| `estimated_tokens` | `output_chars / 4` (approximation) |
|
|
106
86
|
| `duration_ms` | wall clock from subprocess start to end |
|
|
107
|
-
| `success` |
|
|
87
|
+
| `success` | exit code 0 |
|
|
108
88
|
| `data_equivalent` | compare key fields between A and B response |
|
|
109
89
|
|
|
110
90
|
---
|
|
@@ -136,17 +116,17 @@ After N runs, `generate_report.py` computes:
|
|
|
136
116
|
|
|
137
117
|
```json
|
|
138
118
|
{
|
|
139
|
-
"wins": { "
|
|
140
|
-
"win_rate": { "
|
|
119
|
+
"wins": { "arm_a": 0, "arm_b": 0, "tie": 0 },
|
|
120
|
+
"win_rate": { "arm_a": 0.0, "arm_b": 0.0 },
|
|
141
121
|
"token_delta": {
|
|
142
|
-
"
|
|
143
|
-
"
|
|
122
|
+
"mean_a_chars": 0,
|
|
123
|
+
"mean_b_chars": 0,
|
|
144
124
|
"delta_chars": 0,
|
|
145
125
|
"delta_pct": "+0%"
|
|
146
126
|
},
|
|
147
127
|
"latency_delta": {
|
|
148
|
-
"
|
|
149
|
-
"
|
|
128
|
+
"mean_a_ms": 0,
|
|
129
|
+
"mean_b_ms": 0,
|
|
150
130
|
"delta_ms": 0
|
|
151
131
|
},
|
|
152
132
|
"data_equivalence_rate": 1.0,
|
|
@@ -167,48 +147,18 @@ The blind comparator evaluates each side on:
|
|
|
167
147
|
| **Completeness** | Does the response contain all expected fields? |
|
|
168
148
|
| **Structure** | Is the response well-formed JSON? Clean envelope? |
|
|
169
149
|
| **Usability** | Can an agent consume this without post-processing? |
|
|
170
|
-
| **Verbosity** | Lower is better
|
|
150
|
+
| **Verbosity** | Lower is better -- same data, fewer chars = more efficient |
|
|
171
151
|
|
|
172
|
-
Rubric scores are 1
|
|
152
|
+
Rubric scores are 1-5 per criterion. Winner is the side with higher weighted total.
|
|
173
153
|
|
|
174
154
|
---
|
|
175
155
|
|
|
176
|
-
##
|
|
156
|
+
## CLI Invocation
|
|
177
157
|
|
|
178
|
-
|
|
158
|
+
All operations use the CLI:
|
|
179
159
|
|
|
180
|
-
```python
|
|
181
|
-
# Protocol sequence
|
|
182
|
-
# 1. Send initialize
|
|
183
|
-
# 2. Send tools/call (query or mutate)
|
|
184
|
-
# 3. Read response lines until tool result found
|
|
185
|
-
# 4. Terminate process
|
|
186
|
-
|
|
187
|
-
MCP_INIT = {
|
|
188
|
-
"jsonrpc": "2.0", "id": 0, "method": "initialize",
|
|
189
|
-
"params": {
|
|
190
|
-
"protocolVersion": "2024-11-05",
|
|
191
|
-
"capabilities": {},
|
|
192
|
-
"clientInfo": {"name": "ct-grade-ab-test", "version": "2.1.0"}
|
|
193
|
-
}
|
|
194
|
-
}
|
|
195
|
-
|
|
196
|
-
MCP_CALL = {
|
|
197
|
-
"jsonrpc": "2.0", "id": 1, "method": "tools/call",
|
|
198
|
-
"params": {
|
|
199
|
-
"name": "query", # or "mutate"
|
|
200
|
-
"arguments": {
|
|
201
|
-
"domain": "<domain>",
|
|
202
|
-
"operation": "<operation>",
|
|
203
|
-
"params": {}
|
|
204
|
-
}
|
|
205
|
-
}
|
|
206
|
-
}
|
|
207
|
-
```
|
|
208
|
-
|
|
209
|
-
**CLI equivalent:**
|
|
210
160
|
```bash
|
|
211
|
-
cleo-dev <
|
|
161
|
+
cleo-dev <command> [args] --json
|
|
212
162
|
```
|
|
213
163
|
|
|
214
164
|
---
|
|
@@ -217,17 +167,7 @@ cleo-dev <domain> <operation> [args] --json
|
|
|
217
167
|
|
|
218
168
|
| Outcome | Meaning | Action |
|
|
219
169
|
|---------|---------|--------|
|
|
220
|
-
|
|
|
221
|
-
|
|
|
170
|
+
| Arm A wins consistently | Configuration A output is cleaner/more complete | Investigate differences |
|
|
171
|
+
| Arm B wins consistently | Configuration B output is more complete or parseable | Investigate differences |
|
|
222
172
|
| Tie | Both equivalent | Focus on latency and token cost |
|
|
223
|
-
|
|
|
224
|
-
| Data divergence detected | MCP and CLI returning different data | File bug — should be dispatch-level consistent |
|
|
225
|
-
|
|
226
|
-
---
|
|
227
|
-
|
|
228
|
-
## Parity Scenarios
|
|
229
|
-
|
|
230
|
-
The P1-P3 parity scenarios (see playbook-v2.md) run a curated set of operations specifically chosen to stress:
|
|
231
|
-
- **P1**: tasks domain — high-frequency agent operations
|
|
232
|
-
- **P2**: session domain — lifecycle operations agents use at start/end
|
|
233
|
-
- **P3**: admin domain — help, dash, health (first calls in any session)
|
|
173
|
+
| Data divergence detected | Arms returning different data | File bug -- should be consistent |
|
|
@@ -82,16 +82,16 @@ Measures whether the agent recovers from `E_NOT_FOUND` (exit code 4) and avoids
|
|
|
82
82
|
|
|
83
83
|
### S5: Progressive Disclosure Use (20 pts)
|
|
84
84
|
|
|
85
|
-
Measures whether the agent uses CLEO's progressive disclosure system
|
|
85
|
+
Measures whether the agent uses CLEO's progressive disclosure system.
|
|
86
86
|
|
|
87
87
|
| Points | Condition | Evidence string |
|
|
88
88
|
|--------|-----------|-----------------|
|
|
89
89
|
| +10 | At least one help/skill call: `admin.help`, `tools.skill.show`, `tools.skill.list`, `tools.skill.find` | `Progressive disclosure used (Nx)` |
|
|
90
|
-
| +10 |
|
|
90
|
+
| +10 | Progressive disclosure used for efficient access | `Progressive disclosure active` |
|
|
91
91
|
|
|
92
92
|
**Flags:**
|
|
93
93
|
- `No admin.help or skill lookup calls (load ct-cleo for guidance)`
|
|
94
|
-
- `No
|
|
94
|
+
- `No progressive disclosure calls (use admin.help or skill lookups)`
|
|
95
95
|
|
|
96
96
|
**Scoring:** Starts at 0. Range: 0–20.
|
|
97
97
|
|
|
@@ -123,7 +123,7 @@ Grade results in v2.1 carry optional token metadata alongside the standard Grade
|
|
|
123
123
|
"session": 600,
|
|
124
124
|
"admin": 400
|
|
125
125
|
},
|
|
126
|
-
"
|
|
126
|
+
"queryTokens": 2100,
|
|
127
127
|
"cliTokens": 1100,
|
|
128
128
|
"auditEntries": 47
|
|
129
129
|
}
|
|
@@ -164,4 +164,4 @@ The rubric recognizes all 10 canonical domains in audit entries. Key domain-to-d
|
|
|
164
164
|
| `nexus` | S5 (gateway tracking only) |
|
|
165
165
|
| `sticky` | S5 (gateway tracking only) |
|
|
166
166
|
|
|
167
|
-
All 10 domains contribute to
|
|
167
|
+
All 10 domains contribute to the progressive disclosure count in S5 — any help or skill lookup call regardless of domain earns the +10.
|