npm - @cleocode/skills - Versions diffs - 2.0.0 - Mend

@cleocode/skills 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (171) hide show

package/skills/ct-grade-v2-1/references/ab-testing.md ADDED Viewed

@@ -0,0 +1,233 @@
+# Blind A/B Testing Protocol
+Methodology for blind comparison of MCP vs CLI interface usage in CLEO.
+---
+## Agent-Based Execution (Canonical)
+The canonical A/B approach uses Claude Code Agents to run scenarios end-to-end via the live MCP/CLI interfaces. This avoids subprocess initialization issues and captures real token data from task notifications.
+### Execution Flow
+1. Run `python scripts/setup_run.py` to create run structure and print the execution plan
+2. Follow the plan: spawn scenario-runner agents in parallel (arm-A MCP, arm-B CLI)
+3. Immediately capture `total_tokens` from each task notification → `timing.json`
+4. Spawn blind-comparator agent after both arms complete
+5. Run `python scripts/token_tracker.py --run-dir <dir>` to aggregate tokens
+6. Run `python scripts/generate_report.py --run-dir <dir>` for final report
+### Token Data from Task Notifications
+```python
+# After EACH agent task completes, fill timing.json immediately:
+timing = {
+  "total_tokens": task.total_tokens,   # EPHEMERAL — capture now or lose it
+  "duration_ms": task.duration_ms,
+  "arm": "arm-A",
+  "interface": "mcp",
+  "scenario": "s4",
+  "run": 1,
+}
+```
+Token data priority:
+1. `total_tokens` from Claude Code Agent task notification (canonical)
+2. OTel `claude_code.token.usage` (when `CLAUDE_CODE_ENABLE_TELEMETRY=1`)
+3. `output_chars / 3.5` (JSON response estimate)
+4. `entryCount × 150` (coarse proxy from GRADES.jsonl)
+---
+## Subprocess-Based Execution (Fallback)
+For automated testing without agent delegation, use `run_ab_test.py`. This invokes CLEO via subprocess and requires a migrated `tasks.db`.
+---
+## What We're Testing
+| Side | Interface | Mechanism |
+|------|-----------|-----------|
+| **A** (MCP) | JSON-RPC via stdio to CLEO MCP server | `node dist/mcp/index.js` with JSON-RPC messages |
+| **B** (CLI) | Shell commands via subprocess | `cleo-dev <domain> <operation> [params]` |
+Both sides call the same underlying `src/dispatch/` layer. The A/B test isolates:
+- **Output format differences** — MCP returns structured JSON envelopes; CLI may add ANSI/formatting
+- **Response size** — character counts as token proxy
+- **Latency** — wall-clock time per operation
+- **Data equivalence** — do they return the same logical data?
+Blind assignment means the comparator does not know which result came from MCP vs CLI when producing the quality verdict.
+---
+## Test Structure
+```
+ab-results/
+  <timestamp>/
+    meta.json               -- test parameters, domain, operations, runs
+    run-001/
+      side-a/
+        request.json        -- what was sent
+        response.json       -- raw response
+        metrics.json        -- output_chars, duration_ms, success
+      side-b/
+        request.json
+        response.json
+        metrics.json
+      comparison.json       -- blind comparator output (winner: A|B|TIE)
+    run-002/
+      ...
+    summary.json            -- aggregated stats across all runs
+    report.md               -- human-readable comparative analysis
+```
+---
+## Blind Assignment
+The `run_ab_test.py` script randomly shuffles which side gets labeled "A" vs "B" for each run. The comparator agent sees only:
+- Output labeled "A" (could be MCP or CLI)
+- Output labeled "B" (could be MCP or CLI)
+- The original request prompt
+The `meta.json` records the true identity (`a_is_mcp: true|false`) per run. `generate_report.py` de-blinds after all comparisons are done.
+---
+## Metrics Captured Per Run
+| Metric | How captured |
+|--------|-------------|
+| `output_chars` | `len(response_json_str)` |
+| `estimated_tokens` | `output_chars / 4` (approximation) |
+| `duration_ms` | wall clock from subprocess start to end |
+| `success` | `response.success === true` (MCP) or exit code 0 (CLI) |
+| `data_equivalent` | compare key fields between A and B response |
+---
+## Data Equivalence Check
+For each operation, define "equivalent" as the key response fields matching:
+```python
+EQUIVALENCE_FIELDS = {
+    "tasks.find":   ["data.tasks[].id", "data.total"],
+    "tasks.show":   ["data.id", "data.status", "data.title"],
+    "tasks.list":   ["data.tasks[].id"],
+    "session.list": ["data.sessions[].id"],
+    "session.status": ["data.currentSession.id", "data.hasActiveSession"],
+    "admin.dash":   ["data.stats.total", "data.stats.active"],
+    "admin.health": ["data.healthy"],
+    "admin.stats":  ["data.totalTasks"],
+}
+```
+Equivalence is checked before the blind comparison to flag data divergence independently of quality judgment.
+---
+## Statistical Analysis
+After N runs, `generate_report.py` computes:
+```json
+{
+  "wins": { "mcp": 0, "cli": 0, "tie": 0 },
+  "win_rate": { "mcp": 0.0, "cli": 0.0 },
+  "token_delta": {
+    "mean_mcp_chars": 0,
+    "mean_cli_chars": 0,
+    "delta_chars": 0,
+    "delta_pct": "+0%"
+  },
+  "latency_delta": {
+    "mean_mcp_ms": 0,
+    "mean_cli_ms": 0,
+    "delta_ms": 0
+  },
+  "data_equivalence_rate": 1.0,
+  "per_operation": { ... }
+}
+```
+**Recommended minimum runs:** 3 per operation for trend detection, 10+ for statistical confidence.
+---
+## Comparator Rubric
+The blind comparator evaluates each side on:
+| Criterion | Description |
+|-----------|-------------|
+| **Completeness** | Does the response contain all expected fields? |
+| **Structure** | Is the response well-formed JSON? Clean envelope? |
+| **Usability** | Can an agent consume this without post-processing? |
+| **Verbosity** | Lower is better — same data, fewer chars = more efficient |
+Rubric scores are 1–5 per criterion. Winner is the side with higher weighted total.
+---
+## MCP Server Invocation Details
+The `run_ab_test.py` script calls the CLEO MCP server via stdio JSON-RPC:
+```python
+# Protocol sequence
+# 1. Send initialize
+# 2. Send tools/call (query or mutate)
+# 3. Read response lines until tool result found
+# 4. Terminate process
+MCP_INIT = {
+  "jsonrpc": "2.0", "id": 0, "method": "initialize",
+  "params": {
+    "protocolVersion": "2024-11-05",
+    "capabilities": {},
+    "clientInfo": {"name": "ct-grade-ab-test", "version": "2.1.0"}
+  }
+}
+MCP_CALL = {
+  "jsonrpc": "2.0", "id": 1, "method": "tools/call",
+  "params": {
+    "name": "query",  # or "mutate"
+    "arguments": {
+      "domain": "<domain>",
+      "operation": "<operation>",
+      "params": {}
+    }
+  }
+}
+```
+**CLI equivalent:**
+```bash
+cleo-dev <domain> <operation> [args] --json
+```
+---
+## Interpreting Results
+| Outcome | Meaning | Action |
+|---------|---------|--------|
+| MCP wins consistently | MCP output is cleaner/more complete | Recommend MCP-first in agent protocols |
+| CLI wins consistently | CLI output is more complete or parseable | Investigate MCP envelope overhead |
+| Tie | Both equivalent | Focus on latency and token cost |
+| MCP tokens > CLI tokens | MCP envelope adds overhead | Quantify and document in CLEO-GRADE-SPEC |
+| Data divergence detected | MCP and CLI returning different data | File bug — should be dispatch-level consistent |
+---
+## Parity Scenarios
+The P1-P3 parity scenarios (see playbook-v2.md) run a curated set of operations specifically chosen to stress:
+- **P1**: tasks domain — high-frequency agent operations
+- **P2**: session domain — lifecycle operations agents use at start/end
+- **P3**: admin domain — help, dash, health (first calls in any session)

package/skills/ct-grade-v2-1/references/domains-ssot.md ADDED Viewed

@@ -0,0 +1,156 @@
+# CLEO Domains SSoT
+10 canonical domains for A/B test construction and grade analysis.
+Source: `docs/specs/CLEO-OPERATION-CONSTITUTION.md` + `src/dispatch/registry.ts`.
+---
+## Domain Summary
+| Domain | Gateway | Tier-0 ops | Key purpose |
+|--------|---------|-----------|-------------|
+| `tasks` | query+mutate | show, list, find, exists, tree, add, update, complete, cancel, delete | Task CRUD, hierarchy, deps |
+| `session` | query+mutate | status, list, show, history, decision.log, start, end, resume, gc | Session lifecycle |
+| `memory` | query+mutate | (tier 1+) show, find, timeline, fetch, observe | Cognitive memory (brain.db) |
+| `check` | query+mutate | schema, protocol, task, manifest, test.run | Validation and compliance |
+| `pipeline` | query+mutate | stage.validate, stage.status, manifest.*, release.* | RCASD-IVTR+C lifecycle, releases |
+| `orchestrate` | query+mutate | status, next, ready, waves, spawn, spawn.execute | Multi-agent coordination |
+| `tools` | query+mutate | skill.list, skill.show, skill.find, provider.list, issue.add.bug | Skills, providers |
+| `admin` | query+mutate | version, health, dash, help, stats, grade, grade.list | Config, diagnostics |
+| `nexus` | query+mutate | (tier 2) status, list, show, register, sync | Cross-project coordination |
+| `sticky` | query+mutate | list, show, add, convert, archive, purge | Quick capture notes |
+---
+## Tier-0 Operations (A/B test defaults)
+These are available without progressive disclosure. Use as the default test set.
+### tasks (17 query + 15 mutate)
+**Query (tier 0):**
+- `show` — single task details
+- `list` — tasks with filters (HEAVY — test against `find`)
+- `find` — search tasks (LIGHTWEIGHT — preferred)
+- `exists` — check task ID exists
+- `tree` — hierarchy tree
+- `blockers` — blocking deps
+- `depends` — dependency graph
+- `analyze` — task metrics
+- `next` — suggest next task
+- `plan` — composite planning view
+- `relates` — related tasks
+- `current` — currently active task
+**Mutate (tier 0):**
+- `add` — create task
+- `update` — modify task
+- `complete` — mark done
+- `cancel` — cancel task
+- `delete` — permanent remove
+- `archive` — soft delete
+- `restore` — restore from terminal
+- `start` — begin working
+- `stop` — stop working
+### session (11 query + 8 mutate)
+**Query (tier 0):**
+- `status` — current session status
+- `list` — list sessions
+- `show` — session details
+- `history` — session history
+- `decision.log` — decision log
+- `context.drift` — detect drift
+- `handoff.show` — handoff data
+- `briefing.show` — session-start context
+- `find` — lightweight session discovery
+**Mutate (tier 0):**
+- `start` — begin new session
+- `end` — end current session
+- `resume` — resume suspended
+- `suspend` — suspend without ending
+- `gc` — garbage-collect stale
+- `record.decision` — record decision
+- `record.assumption` — record assumption
+### admin (tier 0 subset)
+**Query:**
+- `version` — CLEO version
+- `health` — system health
+- `config.show` — configuration
+- `stats` — project statistics
+- `context` — project context
+- `runtime` — runtime info
+- `dash` — dashboard overview
+- `log` — audit log
+- `help` — progressive disclosure entry
+- `doctor` — health check diagnostics
+**Mutate:**
+- `init` — initialize CLEO
+- `config.set` — set config
+- `backup` — create backup
+- `sync` — synchronize data stores
+- `cleanup` — clean stale data
+- `fix` — auto-fix doctor checks
+- `detect` — refresh project-context.json
+### tools (tier 0 subset)
+**Query:**
+- `skill.list` — list installed skills
+- `skill.show` — skill details
+- `skill.find` — search skills
+- `skill.dispatch` — dispatch execution
+- `skill.verify` — verify skill
+- `provider.list` — list providers
+- `provider.detect` — detect providers
+**Mutate:**
+- `skill.install` — install skill
+- `skill.enable` / `skill.disable` — toggle
+- `skill.configure` — configure params
+- `skill.refresh` — refresh catalog
+- `provider.inject` — inject provider config
+---
+## For A/B Testing
+### Recommended test operation sets
+**Fast smoke test (5 ops):**
+```
+tasks.find, tasks.show, session.status, admin.dash, admin.health
+```
+**Standard parity test (15 ops):**
+```
+tasks.find, tasks.show, tasks.list, tasks.tree, tasks.plan,
+session.status, session.list, session.briefing.show,
+admin.dash, admin.health, admin.help, admin.stats,
+tools.skill.list, tools.provider.list, admin.doctor
+```
+**Full tier-0 sweep (all tier-0 query ops across all domains):**
+Use `--tier 0 --gateway query` flag in run_ab_test.py
+---
+## Known Token Cost Ranking
+Ordered by typical output size (most expensive first):
+1. `tasks.list` (no filter) — AVOID in agents, use `tasks.find`
+2. `admin.help --tier 2` — large operation catalog
+3. `memory.find` — FTS5 results
+4. `tasks.plan` — composite view
+5. `admin.dash` — multi-domain overview
+6. `admin.doctor` — comprehensive health
+7. `tasks.tree` — hierarchy visualization
+8. `session.history` — session log
+9. `tasks.find` (10 results) — standard discovery
+10. `admin.stats` — aggregate counts

package/skills/ct-grade-v2-1/references/grade-spec-v2.md ADDED Viewed

@@ -0,0 +1,167 @@
+# CLEO Grade Specification v2
+Updated for CLEO v2026.3+ with 10 canonical domains and 262 operations.
+Source of truth: `src/core/sessions/session-grade.ts` + `docs/specs/CLEO-GRADE-SPEC.md`.
+---
+## Rubric: 5 Dimensions (100 pts max)
+### S1: Session Discipline (20 pts)
+Measures whether the agent checks existing sessions before starting work and properly ends sessions.
+| Points | Condition | Evidence string |
+|--------|-----------|-----------------|
+| +10 | `session.list` called before first `tasks.*` operation | `session.list called before first task op` |
+| +10 | `session.end` called at least once | `session.end called` |
+**Flags on violation:**
+- `session.list never called (check existing sessions before starting)`
+- `session.list called after task ops (should check sessions first)`
+- `session.end never called (always end sessions when done)`
+**Scoring:** Starts at 0. Range: 0–20.
+---
+### S2: Discovery Efficiency (20 pts)
+Measures whether the agent uses `tasks.find` (lightweight, minimal fields) over `tasks.list` (heavy, full notes arrays).
+| Points | Condition | Evidence string |
+|--------|-----------|-----------------|
+| +15 | `find / (find + list)` ratio >= 80% | `find:list ratio N% >= 80%` |
+| partial | Proportional if ratio < 80%: `round(15 * ratio)` | — |
+| +10 | Zero discovery calls (benefit of doubt) | `No discovery calls needed` |
+| +5 | `tasks.show` used at least once | `tasks.show used Nx for detail` |
+**Flags:** `tasks.list used Nx (prefer tasks.find for discovery)`
+**Scoring:** Capped at 20. Range: 0–20.
+---
+### S3: Task Hygiene (20 pts)
+Measures whether tasks are created with proper descriptions and subtask parent verification.
+| Points | Condition | Evidence string |
+|--------|-----------|-----------------|
+| -5 each | `tasks.add` succeeded without a description | flag per violation |
+| -3 | Subtasks created (with `parent` param) but no preceding `tasks.exists` | `Subtasks created without tasks.exists parent check` |
+| (none) | All adds have descriptions | `All N tasks.add calls had descriptions` |
+| (none) | Subtasks preceded by `tasks.exists` | `Parent existence verified before subtask creation` |
+**Flags:**
+- `tasks.add without description (taskId: <id>)`
+- `Subtasks created without tasks.exists parent check`
+**Scoring:** Starts at 20, deducts penalties. Floor: 0.
+---
+### S4: Error Protocol (20 pts)
+Measures whether the agent recovers from `E_NOT_FOUND` (exit code 4) and avoids duplicate creates.
+| Points | Condition | Evidence string |
+|--------|-----------|-----------------|
+| -5 each | `E_NOT_FOUND` not followed by `tasks.find` or `tasks.exists` within next 4 entries | flag per violation |
+| -5 | Duplicate task creates (same title, case-insensitive) in session | `N potentially duplicate task create(s) detected` |
+| (none) | Error followed by recovery | `E_NOT_FOUND followed by recovery lookup` |
+| (none) | No violations | `No error protocol violations` |
+**Recovery window:** Checks `entries[errIdx+1 : errIdx+5]` for `tasks.find` or `tasks.exists`.
+**Duplicate detection:** Compares lowercased trimmed titles of all successful `tasks.add` calls.
+**Scoring:** Starts at 20, deducts penalties. Floor: 0.
+---
+### S5: Progressive Disclosure Use (20 pts)
+Measures whether the agent uses CLEO's progressive disclosure system and the MCP query gateway.
+| Points | Condition | Evidence string |
+|--------|-----------|-----------------|
+| +10 | At least one help/skill call: `admin.help`, `tools.skill.show`, `tools.skill.list`, `tools.skill.find` | `Progressive disclosure used (Nx)` |
+| +10 | At least one MCP query gateway call (`metadata.gateway === "query"`) | `query (MCP) used Nx` |
+**Flags:**
+- `No admin.help or skill lookup calls (load ct-cleo for guidance)`
+- `No MCP query calls (prefer query over CLI for programmatic access)`
+**Scoring:** Starts at 0. Range: 0–20.
+---
+## Grade Letter Mapping
+| Grade | Threshold | Profile |
+|-------|-----------|---------|
+| A | >= 90% | All dimensions near max, zero or minimal flags |
+| B | >= 75% | Minor violations in one or two dimensions |
+| C | >= 60% | Several protocol gaps |
+| D | >= 45% | Multiple anti-patterns |
+| F | < 45% | Severe protocol violations across most dimensions |
+---
+## Token Metadata (v2.1 addition)
+Grade results in v2.1 carry optional token metadata alongside the standard GradeResult — not a scored dimension, but captured for efficiency analysis:
+```json
+{
+  "_tokenMeta": {
+    "estimationMethod": "otel|output_chars",
+    "totalEstimatedTokens": 4200,
+    "perDomain": {
+      "tasks": 1800,
+      "session": 600,
+      "admin": 400
+    },
+    "mcpQueryTokens": 2100,
+    "cliTokens": 1100,
+    "auditEntries": 47
+  }
+}
+```
+This field is appended by the run_scenario.py and run_ab_test.py scripts. It does NOT affect the 0–100 score.
+---
+## Edge Cases
+| Scenario | Handling |
+|----------|----------|
+| No audit entries | All scores 0; flag `No audit entries found for session (use --grade flag when starting session)` |
+| No task operations | S1 session.list check passes (list is always "before" task ops when there are none) |
+| No discovery calls | S2 awards 10 baseline (benefit of doubt) |
+| No adds | S3 starts at 20 with no deductions |
+| No errors | S4 starts at 20 with no deductions |
+| No grade file | `readGrades()` returns `[]` |
+---
+## Updated Domain Recognition (v2.1)
+The rubric recognizes all 10 canonical domains in audit entries. Key domain-to-dimension mappings:
+| Domain | Affects |
+|--------|---------|
+| `session` | S1 (list/end), S5 (gateway) |
+| `tasks` | S1 (first task op timing), S2 (find/list/show), S3 (add/exists), S4 (error recovery) |
+| `admin` | S5 (admin.help progressive disclosure) |
+| `tools` | S5 (skill.show, skill.list, skill.find) |
+| `memory` | S5 (gateway tracking only) |
+| `pipeline` | S5 (gateway tracking only) |
+| `check` | S5 (gateway tracking only) |
+| `orchestrate` | S5 (gateway tracking only) |
+| `nexus` | S5 (gateway tracking only) |
+| `sticky` | S5 (gateway tracking only) |
+All 10 domains contribute to `mcpQueryCalls` count in S5 — any MCP query gateway call regardless of domain earns the +10.