@cleocode/cleo 2026.3.2 → 2026.3.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@cleocode/cleo",
3
- "version": "2026.3.2",
3
+ "version": "2026.3.4",
4
4
  "description": "CLEO V2 - TypeScript task management CLI for AI coding agents",
5
5
  "mcpName": "io.github.kryptobaseddev/cleo-mcp-server",
6
6
  "type": "module",
@@ -24,9 +24,9 @@
24
24
  "dev": "tsc --noEmit --watch",
25
25
  "dev:setup": "bash dev/setup-ts-dev.sh",
26
26
  "dev:watch": "node build.mjs --watch",
27
- "test": "vitest run",
28
- "test:watch": "vitest",
29
- "test:coverage": "vitest run --coverage",
27
+ "test": "LOG_LEVEL=silent NODE_NO_WARNINGS=1 vitest run",
28
+ "test:watch": "LOG_LEVEL=silent NODE_NO_WARNINGS=1 vitest",
29
+ "test:coverage": "LOG_LEVEL=silent NODE_NO_WARNINGS=1 vitest run --coverage",
30
30
  "lint": "tsc --noEmit",
31
31
  "prepare": "npm run build",
32
32
  "prepack": "npm run build",
@@ -0,0 +1,214 @@
1
+ ---
2
+ name: ct-grade
3
+ description: Session grading for agent behavioral analysis. Use when evaluating agent session quality, running grade scenarios, or interpreting grade results. Triggers on grading tasks, session quality checks, or behavioral analysis needs.
4
+ version: 1.0.0
5
+ tier: 2
6
+ core: false
7
+ category: quality
8
+ protocol: null
9
+ dependencies: []
10
+ sharedResources: []
11
+ compatibility:
12
+ - claude-code
13
+ - cursor
14
+ - windsurf
15
+ - gemini-cli
16
+ license: MIT
17
+ ---
18
+
19
+ # Session Grading Guide
20
+
21
+ Session grading evaluates agent behavioral patterns against the CLEO protocol. It reads the audit log for a completed session and applies a 5-dimension rubric to produce a score (0-100), letter grade (A-F), and diagnostic flags.
22
+
23
+ ## When to Use Grade Mode
24
+
25
+ Use grading when you need to:
26
+ - Evaluate how well an agent followed CLEO protocol during a session
27
+ - Identify behavioral anti-patterns (skipped discovery, missing session.end, etc.)
28
+ - Track improvement over time across multiple sessions
29
+ - Validate that orchestrated subagents followed protocol
30
+
31
+ Grading requires audit data. Sessions must be started with the `--grade` flag to enable audit log capture.
32
+
33
+ ## Starting a Grade Session
34
+
35
+ ### CLI
36
+
37
+ ```bash
38
+ # Start a session with grading enabled
39
+ ct session start --scope epic:T001 --name "Feature work" --grade
40
+
41
+ # The --grade flag enables detailed audit logging
42
+ # All MCP and CLI operations are recorded for later analysis
43
+ ```
44
+
45
+ ### MCP
46
+
47
+ ```
48
+ cleo_mutate({ domain: "session", operation: "start",
49
+ params: { scope: "epic:T001", name: "Feature work", grade: true }})
50
+ ```
51
+
52
+ ## Running Scenarios
53
+
54
+ The grading rubric evaluates 5 behavioral scenarios that map to protocol compliance:
55
+
56
+ ### 1. Fresh Discovery
57
+ Tests whether the agent checks existing sessions and tasks before starting work. Evaluates `session.list` and `tasks.find` calls at session start.
58
+
59
+ ### 2. Task Hygiene
60
+ Tests whether task creation follows protocol: descriptions provided, parent existence verified before subtask creation, no duplicate tasks.
61
+
62
+ ### 3. Error Recovery
63
+ Tests whether the agent handles errors correctly: follows up `E_NOT_FOUND` with recovery lookups (`tasks.find` or `tasks.exists`), avoids duplicate creates after failures.
64
+
65
+ ### 4. Full Lifecycle
66
+ Tests session discipline end-to-end: session listed before task ops, session properly ended, MCP-first usage patterns.
67
+
68
+ ### 5. Multi-Domain Analysis
69
+ Tests progressive disclosure: use of `admin.help` or skill lookups, preference for `cleo_query` (MCP) over CLI for programmatic access.
70
+
71
+ ## Evaluating Results
72
+
73
+ ### CLI
74
+
75
+ ```bash
76
+ # Grade a specific session
77
+ ct grade <sessionId>
78
+
79
+ # List all past grade results
80
+ ct grade --list
81
+ ```
82
+
83
+ ### MCP
84
+
85
+ ```
86
+ # Grade a session
87
+ cleo_query({ domain: "admin", operation: "grade",
88
+ params: { sessionId: "abc-123" }})
89
+
90
+ # List past grades
91
+ cleo_query({ domain: "admin", operation: "grade.list" })
92
+ ```
93
+
94
+ ## Understanding the 5 Dimensions
95
+
96
+ Each dimension scores 0-20 points, totaling 0-100.
97
+
98
+ ### S1: Session Discipline (20 pts)
99
+
100
+ | Points | Criteria |
101
+ |--------|----------|
102
+ | 10 | `session.list` called before first task operation |
103
+ | 10 | `session.end` called when work is complete |
104
+
105
+ **What it measures**: Does the agent check existing sessions before starting, and properly close sessions when done?
106
+
107
+ ### S2: Discovery Efficiency (20 pts)
108
+
109
+ | Points | Criteria |
110
+ |--------|----------|
111
+ | 0-15 | `find:list` ratio >= 80% earns full 15; scales linearly below |
112
+ | 5 | `tasks.show` used for detail retrieval |
113
+
114
+ **What it measures**: Does the agent prefer `tasks.find` (low context cost) over `tasks.list` (high context cost) for discovery?
115
+
116
+ ### S3: Task Hygiene (20 pts)
117
+
118
+ Starts at 20 and deducts for violations:
119
+
120
+ | Deduction | Violation |
121
+ |-----------|-----------|
122
+ | -5 each | `tasks.add` without a description |
123
+ | -3 | Subtasks created without `tasks.exists` parent check |
124
+
125
+ **What it measures**: Does the agent create well-formed tasks with descriptions and verify parents before creating subtasks?
126
+
127
+ ### S4: Error Protocol (20 pts)
128
+
129
+ Starts at 20 and deducts for violations:
130
+
131
+ | Deduction | Violation |
132
+ |-----------|-----------|
133
+ | -5 each | `E_NOT_FOUND` error not followed by recovery lookup within 5 ops |
134
+ | -5 | Duplicate task creates detected (same title in session) |
135
+
136
+ **What it measures**: Does the agent recover gracefully from errors and avoid creating duplicate tasks?
137
+
138
+ ### S5: Progressive Disclosure Use (20 pts)
139
+
140
+ | Points | Criteria |
141
+ |--------|----------|
142
+ | 10 | `admin.help` or skill lookup calls made |
143
+ | 10 | `cleo_query` (MCP gateway) used for programmatic access |
144
+
145
+ **What it measures**: Does the agent use progressive disclosure (help/skills) and prefer MCP over CLI?
146
+
147
+ ## Interpreting Scores
148
+
149
+ ### Letter Grades
150
+
151
+ | Grade | Score Range | Meaning |
152
+ |-------|-----------|---------|
153
+ | **A** | 90-100 | Excellent protocol adherence. Agent follows all best practices. |
154
+ | **B** | 75-89 | Good. Minor gaps in one or two dimensions. |
155
+ | **C** | 60-74 | Acceptable. Several protocol violations need attention. |
156
+ | **D** | 45-59 | Below expectations. Significant anti-patterns present. |
157
+ | **F** | 0-44 | Failing. Major protocol violations across multiple dimensions. |
158
+
159
+ ### Reading the Output
160
+
161
+ The grade result includes:
162
+ - **score/maxScore**: Raw numeric score (e.g., `85/100`)
163
+ - **percent**: Percentage score
164
+ - **grade**: Letter grade (A-F)
165
+ - **dimensions**: Per-dimension breakdown with score, max, and evidence
166
+ - **flags**: Specific violations or improvement suggestions
167
+ - **entryCount**: Number of audit entries analyzed
168
+
169
+ ### Flags
170
+
171
+ Flags are actionable diagnostic messages. Each flag identifies a specific behavioral issue:
172
+
173
+ - `session.list never called` -- Check existing sessions before starting new ones
174
+ - `session.end never called` -- Always end sessions when done
175
+ - `tasks.list used Nx` -- Prefer `tasks.find` for discovery
176
+ - `tasks.add without description` -- Always provide task descriptions
177
+ - `Subtasks created without tasks.exists parent check` -- Verify parent exists first
178
+ - `E_NOT_FOUND not followed by recovery lookup` -- Follow errors with `tasks.find` or `tasks.exists`
179
+ - `No admin.help or skill lookup calls` -- Load `ct-cleo` for protocol guidance
180
+ - `No MCP query calls` -- Prefer `cleo_query` over CLI
181
+
182
+ ## Common Anti-patterns
183
+
184
+ | Anti-pattern | Impact | Fix |
185
+ |-------------|--------|-----|
186
+ | Skipping `session.list` at start | -10 S1 | Always check existing sessions first |
187
+ | Forgetting `session.end` | -10 S1 | End sessions when work is complete |
188
+ | Using `tasks.list` instead of `tasks.find` | -up to 15 S2 | Use `find` for discovery, `list` only for known parent children |
189
+ | Creating tasks without descriptions | -5 each S3 | Always provide a description with `tasks.add` |
190
+ | Ignoring `E_NOT_FOUND` errors | -5 each S4 | Follow up with `tasks.find` or `tasks.exists` |
191
+ | Creating duplicate tasks | -5 S4 | Check for existing tasks before creating new ones |
192
+ | Never using `admin.help` | -10 S5 | Use progressive disclosure for protocol guidance |
193
+ | CLI-only usage (no MCP) | -10 S5 | Prefer `cleo_query`/`cleo_mutate` for programmatic access |
194
+
195
+ ## Grade Result Schema
196
+
197
+ Grade results are stored in `.cleo/metrics/GRADES.jsonl` as append-only JSONL. Each entry conforms to `schemas/grade.schema.json` with these fields:
198
+
199
+ - `sessionId` (string, required) -- Session that was graded
200
+ - `taskId` (string, optional) -- Associated task ID
201
+ - `totalScore` (number, 0-100) -- Aggregate score
202
+ - `maxScore` (number, default 100) -- Maximum possible score
203
+ - `dimensions` (object) -- Per-dimension `{ score, max, evidence[] }`
204
+ - `flags` (string[]) -- Specific violations or suggestions
205
+ - `timestamp` (ISO 8601) -- When the grade was computed
206
+ - `entryCount` (number) -- Audit entries analyzed
207
+ - `evaluator` (`auto` | `manual`) -- How the grade was computed
208
+
209
+ ## MCP Operations
210
+
211
+ | Gateway | Domain | Operation | Description |
212
+ |---------|--------|-----------|-------------|
213
+ | `cleo_query` | `admin` | `grade` | Grade a session (`params: { sessionId }`) |
214
+ | `cleo_query` | `admin` | `grade.list` | List all past grade results |
@@ -59,6 +59,19 @@ Always check `success` and exit code values. Treat non-zero exit code as failure
59
59
 
60
60
  Agents MUST NOT provide hours/days/week estimates. Use `small`, `medium`, `large` sizing.
61
61
 
62
+ ## Session Quick Reference
63
+
64
+ | Goal | Operation | Gateway |
65
+ |------|-----------|---------|
66
+ | Check active session | `session status` | query |
67
+ | Resume context from last session | `session handoff.show` | query |
68
+ | Start working | `session start` (scope required) | mutate |
69
+ | Stop working | `session end` | mutate |
70
+
71
+ For advanced session ops (find, suspend, resume, debrief, decisions): see `.cleo/agent-outputs/T5124-session-decision-tree.md`
72
+
73
+ **Budget note**: `session list` defaults to 10 results (~500-2000 tokens). Prefer `session find` for discovery (~200-400 tokens).
74
+
62
75
  ## Escalation
63
76
 
64
77
  For deeper guidance beyond this minimal protocol:
@@ -10,7 +10,7 @@ Template files used by `init.sh` to initialize new projects with the CLEO task m
10
10
  | `archive.template.json` | `.cleo/todo-archive.json` | Completed tasks archive (immutable after archival) |
11
11
  | `log.template.json` | `.cleo/todo-log.json` | Append-only change log for audit trail |
12
12
  | `config.template.json` | `.cleo/config.json` | Project configuration (archive, validation, display settings) |
13
- | `AGENT-INJECTION.md` | CLAUDE.md, AGENTS.md, GEMINI.md | Default injection template (works with any CLI agent instructions doc) |
13
+ | `CLEO-INJECTION.md` | AGENTS.md (via `@~/.cleo/templates/CLEO-INJECTION.md`) | Global injection template for task management protocol |
14
14
 
15
15
  ## Placeholder Contract
16
16
 
@@ -70,10 +70,10 @@ ajv validate -s "$TODO_DIR/schemas/todo.schema.json" -d "$TODO_DIR/todo.json"
70
70
  - Placeholders: NONE
71
71
  - Schema: `../schemas/config.schema.json`
72
72
 
73
- ### AGENT-INJECTION.md
74
- - Contains: concise CLEO CLI instructions with session protocol
73
+ ### CLEO-INJECTION.md
74
+ - Contains: CLEO protocol instructions with MCP-first operations and session protocol
75
75
  - Placeholders: NONE
76
- - Usage: Default injection template for CLAUDE.md (or any agent doc via --target flag)
76
+ - Usage: Global injection template referenced from AGENTS.md via `@~/.cleo/templates/CLEO-INJECTION.md`
77
77
 
78
78
  ## Schema Path Rewriting
79
79
 
@@ -26,24 +26,16 @@
26
26
  !setup-otel.sh
27
27
  !DATA-SAFETY-IMPLEMENTATION-SUMMARY.md
28
28
 
29
- # Step 4: Allow schemas directory
30
- !schemas/
31
- !schemas/**
32
-
33
- # Step 5: Allow templates directory
34
- !templates/
35
- !templates/**
36
-
37
- # Step 6: Allow ADRs directory (architecture decisions — project-level, not CLEO-specific)
29
+ # Step 4: Allow ADRs directory (architecture decisions — project-level, not CLEO-specific)
38
30
  !adrs/
39
31
  !adrs/**
40
32
 
41
- # Step 7: Allow RCASD lifecycle provenance directory
33
+ # Step 5: Allow RCASD lifecycle provenance directory
42
34
  # Structure: rcasd/{epic#}/research/, consensus/, architecture/, specs/, contributions/
43
35
  !rcasd/
44
36
  !rcasd/**
45
37
 
46
- # Step 8: Allow agent-outputs directory
38
+ # Step 6: Allow agent-outputs directory
47
39
  !agent-outputs/
48
40
  !agent-outputs/**
49
41
 
@@ -20,7 +20,7 @@ if [ -n "$BAD_FILES" ]; then
20
20
  fi
21
21
 
22
22
  # ── 2. Core file gitignore protection ─────────────────────────────────
23
- PROTECTED_FILES=".cleo/tasks.db .cleo/config.json .cleo/.gitignore .cleo/project-info.json .cleo/project-context.json"
23
+ PROTECTED_FILES=".cleo/config.json .cleo/.gitignore .cleo/project-info.json .cleo/project-context.json"
24
24
  IGNORED_CORE=""
25
25
 
26
26
  for f in $PROTECTED_FILES; do