npm - @cleocode/skills - Versions diffs - 2.0.0 - Mend

@cleocode/skills 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (171) hide show

package/skills/ct-grade/references/grade-spec.md ADDED Viewed

@@ -0,0 +1,236 @@
+# CLEO Grade Specification (Current)
+**Source**: `src/core/sessions/session-grade.ts`
+**Spec**: `docs/specs/CLEO-GRADE-SPEC.md`
+**Status**: Active
+This document reflects the **current rubric implementation** as of v2026.3.x. It is derived from the live source code, not the original spec (which may be outdated).
+---
+## Data Flow
+```
+session.start(grade: true)
+  -> CLEO_SESSION_GRADE=true (env)
+  -> audit middleware logs ALL ops (query + mutate)
+session.end
+  -> query check grade { sessionId }   # canonical registry surface
+  -> query admin grade { sessionId }   # runtime compatibility alias
+  -> Reads audit_log from tasks.db
+  -> Applies 5-dimension rubric
+  -> Appends to .cleo/metrics/GRADES.jsonl
+```
+Audit log entries in `tasks.db` contain:
+- `domain`, `operation`, `timestamp`
+- `params` (the operation parameters)
+- `result.success` (boolean)
+- `result.exitCode` (number; 4 = E_NOT_FOUND)
+- `metadata.gateway` (`"query"` | `"mutate"`)
+- `metadata.taskId` (if set)
+---
+## Dimension 1: Session Discipline (20 pts)
+**Source**: Lines 73-105 in session-grade.ts
+```
+sessionListCalls = entries where domain='session' AND operation='list'
+sessionEndCalls  = entries where domain='session' AND operation='end'
+taskOps          = entries where domain='tasks'
+```
+| Points | Condition |
+|--------|-----------|
+| +10 | `session.list` called AND first list timestamp ≤ first tasks op timestamp |
+| +10 | `session.end` called at least once |
+**Flags on violation:**
+- `session.list never called (check existing sessions before starting)` — no session.list found
+- `session.list called after task ops (should check sessions first)` — exists but wrong order
+- `session.end never called (always end sessions when done)` — no session.end
+**Edge cases:**
+- No task ops: `firstTaskTime = Infinity`, so session.list always satisfies the ordering check
+- Range: 0-20
+---
+## Dimension 2: Discovery Efficiency (20 pts)
+**Source**: Lines 107-144
+```
+findCalls  = entries where domain='tasks' AND operation='find'
+listCalls  = entries where domain='tasks' AND operation='list'
+showCalls  = entries where domain='tasks' AND operation='show'
+totalDiscoveryCalls = findCalls.length + listCalls.length
+```
+| Points | Condition |
+|--------|-----------|
+| +10 baseline | `totalDiscoveryCalls === 0` (no discovery needed) |
+| proportional 0-15 | `round(15 * findRatio)` where `findRatio = findCalls / totalDiscoveryCalls` |
+| full 15 | when `findRatio >= 0.80` |
+| +5 | `tasks.show` used at least once |
+**Cap**: `Math.min(20, score)` — maximum 20 points
+**Flags on violation:**
+- `tasks.list used Nx (prefer tasks.find for discovery)` — when findRatio < 0.80
+**Note**: `tasks.list` with filters (for known parent children) is acceptable but still counted toward the ratio.
+---
+## Dimension 3: Task Hygiene (20 pts)
+**Source**: Lines 146-180
+```
+addCalls     = entries where domain='tasks' AND operation='add' AND result.success=true
+existsCalls  = entries where domain='tasks' AND operation='exists'
+subtaskAdds  = addCalls where params.parent is truthy
+```
+**Starts at 20, deducts penalties:**
+| Deduction | Condition |
+|-----------|-----------|
+| -5 per add | `tasks.add` succeeded but `params.description` is empty/missing |
+| -3 | subtaskAdds > 0 AND existsCalls = 0 (no parent verification) |
+**Evidence (no deduction):**
+- `All N tasks.add calls had descriptions` — when no description violations
+- `Parent existence verified before subtask creation` — subtask adds with exists check
+**Floor**: `Math.max(0, score)` — cannot go below 0
+---
+## Dimension 4: Error Protocol (20 pts)
+**Source**: Lines 182-215
+```
+notFoundErrors = entries where result.success=false AND result.exitCode=4
+```
+For each E_NOT_FOUND error at index `errIdx`:
+```
+nextEntries = sessionEntries[errIdx+1 .. errIdx+5]
+hasRecovery = nextEntries contains (domain='tasks' AND operation IN ['find','exists'])
+```
+**Duplicate detection:**
+```
+creates = entries where domain='tasks' AND operation='add' AND result.success=true
+titles  = creates.map(e => e.params.title.toLowerCase().trim())
+duplicates = titles.length - new Set(titles).size
+```
+**Starts at 20, deducts penalties:**
+| Deduction | Condition |
+|-----------|-----------|
+| -5 per error | E_NOT_FOUND not followed by `tasks.find` or `tasks.exists` within 4 entries |
+| -5 | Any duplicate task creates detected (title collision within session) |
+**Floor**: `Math.max(0, score)`
+---
+## Dimension 5: Progressive Disclosure Use (20 pts)
+**Source**: Lines 217-249
+```
+helpCalls = entries where:
+  (domain='admin' AND operation='help')
+  OR (domain='tools' AND operation IN ['skill.show','skill.list'])
+  OR (domain='skills' AND operation IN ['list','show'])
+mcpQueryCalls = entries where metadata.gateway = 'query'
+```
+| Points | Condition |
+|--------|-----------|
+| +10 | `helpCalls.length > 0` |
+| +10 | `mcpQueryCalls.length > 0` |
+**Flags on violation:**
+- `No admin.help or skill lookup calls (load ct-cleo for guidance)`
+- `No MCP query calls (prefer query over CLI for programmatic access)`
+**Important**: The `metadata.gateway` field equals `'query'` for MCP query operations. CLI operations do not set this field. This is how MCP vs CLI usage is distinguished in the grade.
+---
+## Grade Letter Mapping
+| Grade | Threshold | Score Range |
+|-------|-----------|-------------|
+| A | ≥90% | 90-100 |
+| B | ≥75% | 75-89 |
+| C | ≥60% | 60-74 |
+| D | ≥45% | 45-59 |
+| F | <45% | 0-44 |
+---
+## GradeResult Schema
+```typescript
+interface GradeResult {
+  sessionId: string;
+  taskId?: string;
+  totalScore: number;          // 0-100
+  maxScore: number;            // 100
+  dimensions: {
+    sessionDiscipline:   DimensionScore; // score, max:20, evidence[]
+    discoveryEfficiency: DimensionScore;
+    taskHygiene:         DimensionScore;
+    errorProtocol:       DimensionScore;
+    disclosureUse:       DimensionScore;
+  };
+  flags: string[];
+  timestamp: string;           // ISO 8601
+  entryCount: number;
+  evaluator?: 'auto' | 'manual';
+}
+```
+---
+## Edge Cases
+| Case | Behavior |
+|------|----------|
+| No audit entries | All scores 0, flag: `No audit entries found for session` |
+| No task ops | S1 list check always passes (firstTaskTime=Infinity) |
+| No discovery calls | S2 awards 10 baseline points |
+| No task.add calls | S3 starts at 20 with no deductions |
+| No errors | S4 starts at 20 with no deductions |
+| GRADES.jsonl missing | readGrades() returns [] |
+| Write failure | Silently ignored (best-effort persistence) |
+---
+## MCP vs CLI Detection in S5
+The grading system detects MCP usage via `metadata.gateway === 'query'`. This means:
+- **MCP interface**: All query operations set `metadata.gateway = 'query'` → S5 gets +10
+- **CLI interface**: CLI operations do NOT set metadata.gateway → S5 loses +10
+- **Mixed**: Any single MCP query call is enough for the +10
+This is why A/B tests between MCP and CLI interfaces will reliably show S5 differences.
+## API Surface Update
+- Canonical reads now live under `check.grade` and `check.grade.list`.
+- `admin.grade` and `admin.grade.list` remain compatibility handlers during the registry transition.
+- Token telemetry should be read through `admin.token` with `action=summary|list|show` rather than inferring split legacy operations.
+- Web clients should use `POST /api/query` and `POST /api/mutate`; default HTTP responses carry LAFS metadata in `X-Cleo-*` headers.

package/skills/ct-grade/references/scenario-playbook.md ADDED Viewed

@@ -0,0 +1,234 @@
+# Grade Scenario Playbook (Updated)
+**Based on**: `docs/specs/GRADE-SCENARIO-PLAYBOOK.md` + current session-grade.ts implementation
+**Status**: Active — reflects current rubric
+Each scenario targets specific grade dimensions. Run via `agents/scenario-runner.md`.
+Use **cleo-dev** (local dev build) for MCP operations or **cleo** (production).
+Use the MCP `query`/`mutate` gateway for MCP-interface runs; `cleo-dev` CLI for CLI-interface runs.
+---
+## S1: Fresh Discovery
+**Purpose**: Validates S1 (Session Discipline) and S2 (Discovery Efficiency).
+**Target score**: 45/100 (S1 full, S2 partial, S5 partial — no admin.help)
+### Operation Sequence (MCP)
+```
+1. query session list                                          — S1: must be first
+2. query admin dash                                            — project overview
+3. query tasks find { "status": "active" }                    — S2: find not list
+4. query tasks show { "taskId": "T<any>" }                    — S2: show used
+5. mutate session end                                          — S1: session.end
+```
+### Operation Sequence (CLI)
+```bash
+1. cleo-dev session list
+2. cleo-dev dash
+3. cleo-dev find --status active
+4. cleo-dev show T<any>
+5. cleo-dev session end
+```
+### Scoring Targets
+| Dim | Expected | Reason |
+|-----|----------|--------|
+| S1 | 20/20 | session.list first (+10), session.end present (+10) |
+| S2 | 20/20 | find used exclusively (+15), show used (+5) |
+| S3 | 20/20 | No task adds (no deductions) |
+| S4 | 20/20 | No errors |
+| S5 (MCP) | 10/20 | query gateway used (+10), no admin.help call |
+| S5 (CLI) | 0/20 | No MCP query calls, no admin.help |
+**MCP total: ~90/100 (A)**
+**CLI total: ~80/100 (B)**
+### Anti-pattern Variant (for testing grader sensitivity)
+```
+query tasks find { "status": "active" }   ← task op BEFORE session.list
+query session list                         ← too late for S1
+(no session.end)
+```
+Expected S1: 0 — flags: `session.list called after task ops`, `session.end never called`
+---
+## S2: Task Creation Hygiene
+**Purpose**: Validates S3 (Task Hygiene) and S1.
+**Target score**: 60/100 (S1 full, S3 full, S5 partial MCP or 0 CLI)
+### Operation Sequence (MCP)
+```
+1. query session list                                             — S1
+2. query tasks exists { "taskId": "T100" }                       — S3: parent verify
+3. mutate tasks add { "title": "Implement auth",
+     "description": "Add JWT authentication to API endpoints",
+     "parent": "T100" }                                          — S3: desc + parent
+4. mutate tasks add { "title": "Write tests",
+     "description": "Unit tests for auth module" }               — S3: desc present
+5. mutate session end                                            — S1
+```
+### Scoring Targets
+| Dim | Expected | Reason |
+|-----|----------|--------|
+| S1 | 20/20 | session.list first, session.end present |
+| S3 | 20/20 | All adds have descriptions, parent verified via exists |
+| S5 (MCP) | 10/20 | query gateway used |
+| S5 (CLI) | 0/20 | no MCP query, no help |
+**MCP total: ~70/100 (C)**
+**CLI total: ~60/100 (C)**
+### Anti-pattern Variant
+```
+mutate tasks add { "title": "Implement auth", "parent": "T100" }  ← no desc, no exists check
+mutate tasks add { "title": "Write tests" }                         ← no desc
+```
+Expected S3: 7 (20 - 5 - 5 - 3 = 7)
+---
+## S3: Error Recovery
+**Purpose**: Validates S4 (Error Protocol).
+### Operation Sequence (MCP)
+```
+1. query session list                                            — S1
+2. query tasks show { "taskId": "T99999" }                      — triggers E_NOT_FOUND
+3. query tasks find { "query": "T99999" }                       — S4: recovery within 4 ops
+4. mutate tasks add { "title": "New feature",
+     "description": "Implement the feature that was not found" } — S3: desc present
+5. mutate session end                                            — S1
+```
+### Scoring Targets
+| Dim | Expected | Reason |
+|-----|----------|--------|
+| S1 | 20/20 | Proper session lifecycle |
+| S3 | 20/20 | Task created with description |
+| S4 | 20/20 | E_NOT_FOUND followed by recovery lookup within 4 entries |
+| S5 (MCP) | 10/20 | query gateway used |
+**MCP total: ~90/100 (A)**
+### Anti-pattern: Unrecovered Error
+```
+query tasks show { "taskId": "T99999" }        ← E_NOT_FOUND
+mutate tasks add { "title": "Something else",
+  "description": "Unrelated" }                 ← no recovery lookup
+```
+S4 deduction: -5 (no tasks.find within next 4 entries)
+### Anti-pattern: Duplicate Creates
+```
+mutate tasks add { "title": "New feature", "description": "First attempt" }
+mutate tasks add { "title": "New feature", "description": "Second attempt" }
+```
+S4 deduction: -5 (1 duplicate detected)
+---
+## S4: Full Lifecycle
+**Purpose**: Validates all 5 dimensions. Gold standard session.
+**Target score**: 100/100 (A) for MCP, ~80/100 (B) for CLI
+### Operation Sequence (MCP)
+```
+1.  query session list                                         — S1
+2.  query admin help                                           — S5: progressive disclosure
+3.  query admin dash                                           — overview
+4.  query tasks find { "status": "pending" }                  — S2: find not list
+5.  query tasks show { "taskId": "T200" }                     — S2: show for detail
+6.  mutate tasks update { "taskId": "T200", "status": "active" } — begin work
+(agent does work here)
+7.  mutate tasks complete { "taskId": "T200" }                — mark done
+8.  query tasks find { "status": "pending" }                  — check next
+9.  mutate session end { "note": "Completed T200" }           — S1
+```
+### Scoring Targets (MCP)
+| Dim | Expected | Reason |
+|-----|----------|--------|
+| S1 | 20/20 | session.list first (+10), session.end present (+10) |
+| S2 | 20/20 | find:list 100% (+15), show used (+5) |
+| S3 | 20/20 | No adds — no deductions |
+| S4 | 20/20 | No errors, no duplicates |
+| S5 | 20/20 | admin.help (+10), query gateway (+10) |
+**MCP total: 100/100 (A)**
+**CLI total: ~80/100 (B)** — loses S5 entirely
+---
+## S5: Multi-Domain Analysis
+**Purpose**: Validates cross-domain operations and advanced S5.
+**Target score**: 100/100 (MCP), ~80/100 (CLI)
+### Operation Sequence (MCP)
+```
+1.  query session list                                              — S1
+2.  query admin help                                               — S5
+3.  query tasks find { "parent": "T500" }                         — S2: epic subtasks
+4.  query tasks show { "taskId": "T501" }                         — S2: inspect
+5.  query session context.drift                                    — multi-domain
+6.  query session decision.log { "taskId": "T501" }               — decision history
+7.  mutate session record.decision { "taskId": "T501",
+      "decision": "Use adapter pattern",
+      "rationale": "Decouples provider logic" }                    — record decision
+8.  mutate tasks update { "taskId": "T501", "status": "active" }
+9.  mutate tasks complete { "taskId": "T501" }
+10. query tasks find { "parent": "T500", "status": "pending" }    — next subtask
+11. mutate session end                                             — S1
+```
+### Scoring Targets
+| Dim | Expected | Reason |
+|-----|----------|--------|
+| S1 | 20/20 | session.list first, session.end present |
+| S2 | 20/20 | find used exclusively, show used |
+| S3 | 20/20 | No task.add — no deductions |
+| S4 | 20/20 | No errors |
+| S5 | 20/20 | admin.help (+10), query gateway (+10) |
+**MCP total: 100/100 (A)**
+---
+## Scenario Quick Reference
+| Scenario | Primary Dims Tested | MCP Expected | CLI Expected |
+|---|---|---|---|
+| S1 | S1, S2 | ~90 (A) | ~80 (B) |
+| S2 | S1, S3 | ~70 (C) | ~60 (C) |
+| S3 | S1, S3, S4 | ~90 (A) | ~80 (B) |
+| S4 | All 5 | 100 (A) | ~80 (B) |
+| S5 | All 5, cross-domain | 100 (A) | ~80 (B) |
+**Key insight**: CLI interface will consistently score 0 on S5 Progressive Disclosure because:
+1. CLI operations don't set `metadata.gateway = 'query'` (no +10)
+2. `cleo-dev admin help` CLI call is not detected as `admin.help` MCP call (no +10)
+This is by design — the rubric rewards MCP-first behavior.

package/skills/ct-grade/references/token-tracking.md ADDED Viewed

@@ -0,0 +1,120 @@
+# Token Tracking Reference
+**Skill**: ct-grade
+**Version**: 1.0.0
+**Status**: APPROVED
+---
+## Overview
+Token tracking matters for three reasons in ct-grade evaluations:
+1. **Cost tracking**: Each A/B run consumes real tokens. Knowing the cost per run helps budget multi-scenario evaluations.
+2. **MCP vs CLI comparison**: The primary value of ct-grade is comparing MCP efficiency against CLI. Token consumption is a direct measure of interface efficiency — lower tokens for the same score means better efficiency.
+3. **Score-per-token efficiency**: A session scoring 85/100 with 2,000 tokens outperforms one scoring 90/100 with 8,000 tokens on an efficiency basis. The eval-viewer surfaces this ratio as `score_per_1k_tokens`.
+**Important constraint**: Claude Code does not expose per-call token counts during agent execution. There is no API to query "how many tokens did this operation consume" in real time. Token counts arrive only via OpenTelemetry telemetry (if configured) or must be approximated from response payload size. This is why ct-grade uses a three-layer estimation system.
+---
+## Three Estimation Methods
+Token estimates are produced by one of three methods, ordered by confidence:
+| Method | Confidence | When Available | How |
+|--------|-----------|----------------|-----|
+| OTel telemetry | REAL | When `CLAUDE_CODE_ENABLE_TELEMETRY=1` + OTel configured | Reads `~/.cleo/metrics/otel/*.jsonl`, field `claude_code.token.usage` |
+| Response chars ÷ 4 | ESTIMATED | After A/B test runs | Counts response payload characters, divides by 4 (industry standard approximation) |
+| Coarse op averages | COARSE | Always | Multiplies op count by `OP_TOKEN_AVERAGES` lookup table |
+The eval-viewer labels every token figure with its confidence level so you know how to interpret the number.
+---
+## OTel Setup
+OTel telemetry provides the most accurate token counts (REAL confidence). It requires a one-time shell setup.
+```bash
+# One-time setup — add to ~/.bashrc or ~/.zshrc
+source /path/to/.cleo/setup-otel.sh
+# What the script sets:
+export CLAUDE_CODE_ENABLE_TELEMETRY=1
+export OTEL_METRICS_EXPORTER=otlp
+export OTEL_EXPORTER_OTLP_PROTOCOL=http/json
+export OTEL_EXPORTER_OTLP_ENDPOINT="file://${HOME}/.cleo/metrics/otel/"
+```
+After sourcing, restart your shell or run `source ~/.bashrc` (or `~/.zshrc`).
+Once configured, Claude Code writes session token metrics to `~/.cleo/metrics/otel/` as JSONL files. The ct-grade analysis scripts read these files and match them to run sessions by timestamp overlap. The relevant field is `claude_code.token.usage` which contains `input`, `output`, and `cache_read` sub-fields.
+**Verification**: After a graded session, check that files exist under `~/.cleo/metrics/otel/`. If the directory is empty, telemetry is not active for your current shell session.
+---
+## Per-Operation Token Budget Table
+The coarse estimation layer uses the following lookup table (`OP_TOKEN_AVERAGES`). These averages were measured across real CLEO sessions and are used when neither OTel nor char-counting is available.
+| Operation | Estimated Tokens | Notes |
+|-----------|-----------------|-------|
+| tasks.find | ~750 | Depends on result count |
+| tasks.list | ~3,000 | Heavy — prefer tasks.find |
+| tasks.show | ~600 | Single task with full details |
+| tasks.exists | ~300 | Boolean + minimal data |
+| tasks.tree | ~800 | Hierarchy view |
+| tasks.plan | ~900 | Next task recommendations |
+| session.status | ~350 | Quick status check |
+| session.list | ~400 | Session list |
+| session.briefing.show | ~500 | Handoff briefing |
+| admin.dash | ~500 | Project overview |
+| admin.help | ~800 | Full operation reference |
+| admin.health | ~300 | Health check |
+| admin.stats | ~600 | Statistics summary |
+| memory.find | ~600 | Search results |
+| memory.timeline | ~500 | Timeline entries |
+| tools.skill.list | ~400 | Skill manifest |
+| tools.skill.show | ~350 | Single skill details |
+These figures are averages. Actual token counts vary based on the number of results returned, note field length, and payload verbosity. The coarse method is accurate within ±50% and is only used as a last resort when better data is unavailable.
+---
+## Confidence Labels
+Every token figure in the eval-viewer is annotated with one of three confidence labels:
+| Label | Source | Accuracy |
+|-------|--------|----------|
+| `REAL` | OTel telemetry (`claude_code.token.usage`) | Exact — from Claude Code instrumentation |
+| `ESTIMATED` | Response chars ÷ 4 | ±20% — good for JSON payloads |
+| `COARSE` | Operation count × `OP_TOKEN_AVERAGES` | ±50% — fallback only |
+When reading eval-viewer reports, treat REAL figures as authoritative, ESTIMATED figures as directionally accurate, and COARSE figures as rough order-of-magnitude only.
+**Recommendation**: Enable OTel telemetry before running multi-scenario or multi-run evaluations. The additional setup is minimal and the REAL confidence data significantly improves the reliability of MCP vs CLI efficiency comparisons.
+---
+## Chars ÷ 4 Rationale
+The chars/4 approximation is applied to response payload character counts when OTel data is unavailable but operation responses were captured in `operations.jsonl`.
+This approximation matches CLEO's own `src/core/metrics/token-estimation.ts` and is the same approximation used by OpenAI and Anthropic in their documentation. It is accurate within ±20% for JSON payloads.
+The approximation works because English text and JSON structure average roughly 4 characters per token across typical LLM tokenizers (cl100k_base, o200k_base). JSON keys, punctuation, and whitespace are slightly more token-dense than prose, but the ±20% margin accounts for this variance.
+For ct-grade specifically, both arms of an A/B test experience the same approximation error, so relative comparisons between MCP and CLI remain valid even when absolute counts are slightly off.
+---
+## References
+- `src/core/metrics/token-estimation.ts` — CLEO's token estimation implementation
+- `docs/specs/CLEO-METRICS-VALIDATION-SYSTEM-SPEC.md` — Metrics system specification
+- `.cleo/setup-otel.sh` — OTel environment setup script
+- `packages/ct-skills/skills/ct-grade/scripts/token_tracker.py` — Token aggregation script
+- `packages/ct-skills/skills/ct-grade/scripts/generate_report.py` — Report generator (uses confidence labels)

package/skills/ct-grade/scripts/__pycache__/audit_analyzer.cpython-314.pyc ADDED Viewed

Binary file

package/skills/ct-grade/scripts/__pycache__/run_ab_test.cpython-314.pyc ADDED Viewed

Binary file

package/skills/ct-grade/scripts/__pycache__/run_all.cpython-314.pyc ADDED Viewed

Binary file

package/skills/ct-grade/scripts/__pycache__/token_tracker.cpython-314.pyc ADDED Viewed

Binary file