npm - @cleocode/skills - Versions diffs - 2.0.0 - Mend

@cleocode/skills 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (171) hide show

package/skills/ct-grade-v2-1/SKILL.md ADDED Viewed

@@ -0,0 +1,237 @@
+---
+name: ct-grade
+description: >-
+  CLEO session grading and A/B behavioral analysis with token tracking. Evaluates agent
+  session quality via a 5-dimension rubric (S1 session discipline, S2 discovery efficiency,
+  S3 task hygiene, S4 error protocol, S5 progressive disclosure). Supports three modes:
+  (1) scenario — run playbook scenarios S1-S5 against MCP or CLI; (2) ab — blind A/B
+  comparison of CLEO MCP gateway vs CLI for same domain operations with token cost
+  measurement; (3) blind — spawn two agents with different configurations, blind-comparator
+  picks winner, analyzer produces recommendation. Use when grading agent sessions, running
+  grade playbook scenarios, comparing MCP vs CLI behavioral differences, measuring token
+  usage across interface types, or performing multi-run blind A/B evaluation with statistical
+  analysis and comparative report. Triggers on: grade session, evaluate agent behavior,
+  A/B test CLEO interfaces, run grade scenario, token usage analysis, behavioral rubric,
+  protocol compliance scoring, MCP vs CLI comparison.
+argument-hint: "[mode=scenario|ab|blind] [scenario=s1-s5|all] [interface=mcp|cli|both] [runs=N] [session-id=<id>]"
+allowed-tools: ["Bash(python *)", "Bash(cleo-dev *)", "Bash(cleo *)", "Bash(kill *)", "Bash(lsof *)", "Agent", "Read", "Write", "Glob"]
+---
+# ct-grade v2.1 — CLEO Grading and A/B Testing
+Session grading and A/B behavioral analysis for CLEO protocol compliance. Three operating modes cover everything from single-session scoring to multi-run blind comparisons between MCP and CLI interfaces.
+## On Every /ct-grade Invocation
+Before parsing arguments, start the grade viewer server:
+```bash
+# Kill any existing viewer on port 3119
+lsof -ti :3119 | xargs kill -TERM 2>/dev/null || true
+# Start grade viewer in background
+python $CLAUDE_SKILL_DIR/grade-viewer/generate_grade_review.py . \
+  --port 3119 --no-browser &
+echo "Grade viewer: http://localhost:3119"
+```
+When user says "end grading", "stop", "done", or "close viewer":
+```bash
+lsof -ti :3119 | xargs kill -TERM 2>/dev/null || true
+echo "Grade viewer stopped."
+```
+---
+## Operating Modes
+| Mode | Purpose | Key Output |
+|---|---|---|
+| `scenario` | Run playbook scenarios S1-S5 as graded sessions | GradeResult per scenario |
+| `ab` | Run same domain operations via MCP AND CLI, compare | comparison.json + token delta |
+| `blind` | Two agents run same task, blind comparator picks winner | analysis.json + winner |
+## Parameters
+| Parameter | Values | Default | Description |
+|---|---|---|---|
+| `mode` | `scenario\|ab\|blind` | `scenario` | Operating mode |
+| `scenario` | `s1\|s2\|s3\|s4\|s5\|all` | `all` | Grade playbook scenario(s) to run |
+| `interface` | `mcp\|cli\|both` | `both` | Which interface to exercise |
+| `domains` | comma list | `tasks,session` | Domains to test in `ab` mode |
+| `runs` | integer | `3` | Runs per configuration for statistical confidence |
+| `session-id` | string | — | Grade a specific existing session (skips execution) |
+| `output-dir` | path | `ab_results/<ts>` | Where to write all run artifacts |
+## Quick Start
+**Grade an existing session:**
+```
+/ct-grade session-id=<id>
+```
+**Run scenario S4 (Full Lifecycle) on MCP:**
+```
+/ct-grade mode=scenario scenario=s4 interface=mcp
+```
+**A/B compare MCP vs CLI for tasks + session domains (3 runs each):**
+```
+/ct-grade mode=ab domains=tasks,session runs=3
+```
+**Full blind A/B test across all scenarios:**
+```
+/ct-grade mode=blind scenario=all runs=3
+```
+---
+## Execution Flow
+### Mode: scenario
+1. Set up output dir with `python $CLAUDE_SKILL_DIR/scripts/setup_run.py --mode scenario --scenario <id> --output-dir <dir>`
+2. For each scenario, spawn a `scenario-runner` agent:
+   - Agent start: `mutate session start { "grade": true, "name": "<scenario-id>-<interface>" }`
+   - Agent executes the scenario operations (see [references/playbook-v2.md](references/playbook-v2.md))
+   - Agent end: `mutate session end`
+   - Agent runs: `query admin grade { "sessionId": "<id>" }`
+   - Agent saves: `GradeResult` to `<output-dir>/<scenario>/grade.json`
+3. Capture `total_tokens` + `duration_ms` from task notification → `timing.json`
+4. Run: `python $CLAUDE_SKILL_DIR/scripts/generate_report.py --run-dir <dir> --mode scenario`
+### Mode: ab
+1. Set up run dir with `python $CLAUDE_SKILL_DIR/scripts/setup_run.py --mode ab --output-dir <dir>`
+2. For each target domain, spawn TWO agents in the SAME turn:
+   - **Arm A** (MCP): `agents/scenario-runner.md` with `INTERFACE=mcp`
+   - **Arm B** (CLI): `agents/scenario-runner.md` with `INTERFACE=cli`
+   - Capture tokens from both task notifications immediately
+3. Pass both outputs to `agents/blind-comparator.md` (does NOT know which is MCP vs CLI)
+4. Comparator writes `comparison.json`
+5. Run `python $CLAUDE_SKILL_DIR/scripts/generate_report.py --run-dir <dir> --mode ab`
+### Mode: blind
+Same as `ab` but configurations may differ beyond MCP/CLI (e.g., different session scopes, different agent prompts). The comparator is always blind to configuration identity.
+---
+## Token Capture — MANDATORY
+After EVERY Agent task notification, immediately update `timing.json`:
+```python
+timing = {
+  "total_tokens": task.total_tokens,     # from task notification — EPHEMERAL
+  "duration_ms": task.duration_ms,       # from task notification
+  "arm": "arm-A",
+  "interface": "mcp",
+  "scenario": "s4",
+  "run": 1,
+  "executor_start": start_iso,
+  "executor_end": end_iso,
+}
+# Write to: <output-dir>/<scenario>/arm-<interface>/timing.json
+```
+**`total_tokens` is EPHEMERAL** — it cannot be recovered if missed. Capture it immediately.
+If running without task notifications (no total_tokens available):
+- Fall back: `output_chars / 3.5` from operations.jsonl (JSON responses)
+- Record `"method": "output_chars_estimate"` in timing.json
+---
+## Grade Rubric Summary
+5 dimensions × 20 pts = 100 max. See [references/grade-spec-v2.md](references/grade-spec-v2.md) for full scoring logic.
+| Dim | Points | What it measures |
+|---|---|---|
+| S1 Session Discipline | 20 | `session.list` before task ops (+10), `session.end` present (+10) |
+| S2 Discovery Efficiency | 20 | `find:list` ratio ≥80% (+15), `tasks.show` used (+5) |
+| S3 Task Hygiene | 20 | Starts 20, -5 per add without description, -3 if subtask no exists check |
+| S4 Error Protocol | 20 | Starts 20, -5 per unrecovered E_NOT_FOUND, -5 if duplicates |
+| S5 Progressive Disclosure | 20 | `admin.help`/skill lookup (+10), MCP `query` gateway used (+10) |
+**Grade letters:** A≥90, B≥75, C≥60, D≥45, F<45
+**Note:** CLI-only sessions always score 0 on S5 — `metadata.gateway` is not set by the CLI adapter. MCP earns +10 automatically.
+---
+## Output Structure
+```
+<output-dir>/
+  run-manifest.json          # run config, arms, timing summary
+  report.md                  # human-readable comparative report
+  token-summary.json         # aggregated token stats across all runs
+  <scenario-or-domain>/
+    arm-A/
+      grade.json             # GradeResult (from admin.grade)
+      timing.json            # token + duration data
+      operations.jsonl       # operations executed (one per line)
+    arm-B/
+      grade.json
+      timing.json
+      operations.jsonl
+    comparison.json          # blind comparator output
+    analysis.json            # analyzer output
+```
+---
+## Agents
+| Agent | Role | Input | Output |
+|---|---|---|---|
+| [agents/scenario-runner.md](agents/scenario-runner.md) | Executes grade scenario | scenario, interface | grade.json, timing.json |
+| [agents/blind-comparator.md](agents/blind-comparator.md) | Blind A/B judge | outputs A and B | comparison.json |
+| [agents/analysis-reporter.md](agents/analysis-reporter.md) | Post-hoc synthesis | all comparison.json | analysis.json |
+---
+## Scripts
+```bash
+# Set up run directory and print execution plan
+python $CLAUDE_SKILL_DIR/scripts/setup_run.py --mode <mode> --scenario <s> --output-dir <dir>
+# Aggregate token data after runs complete
+python $CLAUDE_SKILL_DIR/scripts/token_tracker.py --run-dir <dir>
+# Generate final report (markdown)
+python $CLAUDE_SKILL_DIR/scripts/generate_report.py --run-dir <dir> --mode <mode>
+```
+---
+## Viewers
+### Grade Results Viewer (A/B run artifacts) — port 3119
+```bash
+python $CLAUDE_SKILL_DIR/grade-viewer/generate_grade_viewer.py --run-dir <ab-run-dir>
+python $CLAUDE_SKILL_DIR/grade-viewer/generate_grade_viewer.py --run-dir <ab-run-dir> --static results.html
+```
+Shows per-scenario grade cards with dimension bars, A/B comparison tables, token economy stats, blind comparator results, and recommendations. Refreshes on browser reload.
+### General Grade Review (GRADES.jsonl browsing) — port 3119
+```bash
+python $CLAUDE_SKILL_DIR/grade-viewer/generate_grade_review.py <workspace>
+python $CLAUDE_SKILL_DIR/grade-viewer/generate_grade_review.py <workspace> --static grade-report.html
+```
+Shows historical grades from GRADES.jsonl, A/B summaries from any workspace subdirectory.
+---
+## MCP Grade Operations
+| Gateway | Domain | Operation | Params |
+|---|---|---|---|
+| `query` | `admin` | `grade` | `{ "sessionId": "<id>" }` |
+| `query` | `admin` | `grade.list` | — |
+| `mutate` | `session` | `start` | `{ "grade": true, "name": "<n>", "scope": "global" }` |
+| `mutate` | `session` | `end` | — |

package/skills/ct-grade-v2-1/agents/analysis-reporter.md ADDED Viewed

@@ -0,0 +1,203 @@
+# Analysis Reporter Agent
+You are a post-hoc analyzer for CLEO A/B evaluation results. You synthesize all comparison.json and grade.json files from a completed run into a final `analysis.json` and `report.md`.
+## Inputs
+- `RUN_DIR`: Path to the completed run directory
+- `MODE`: `scenario|ab|blind`
+- `OUTPUT_PATH`: Where to write analysis.json (default: `<RUN_DIR>/analysis.json`)
+- `REPORT_PATH`: Where to write report.md (default: `<RUN_DIR>/report.md`)
+## What You Read
+From `<RUN_DIR>`:
+```
+run-manifest.json
+token-summary.json           (from token_tracker.py)
+<scenario-or-domain>/
+  arm-A/grade.json
+  arm-A/timing.json
+  arm-A/operations.jsonl
+  arm-B/grade.json
+  arm-B/timing.json
+  arm-B/operations.jsonl
+  comparison.json
+```
+## Analysis Process
+### 1. Aggregate grade results
+For each scenario/domain, collect:
+- A's total_score and per-dimension scores
+- B's total_score and per-dimension scores
+- comparison winner
+- Token counts for each arm
+### 2. Compute cross-run statistics
+If multiple runs exist:
+- mean, stddev, min, max for total_score per arm
+- mean, stddev for total_tokens per arm
+- Win rate for each arm across runs
+### 3. Identify patterns
+Look for:
+- Dimensions where one arm consistently outperforms
+- Scenarios where MCP and CLI diverge most
+- Operations that appear in failures but not successes
+- Token efficiency: score-per-token comparison
+### 4. Generate recommendations
+Based on patterns:
+- Which interface (MCP/CLI) performs better overall?
+- Which dimensions need protocol improvement?
+- Which scenarios expose the most variance?
+- What specific anti-patterns appear most?
+## Output: analysis.json
+```json
+{
+  "run_summary": {
+    "mode": "ab",
+    "scenarios_run": ["s1", "s4"],
+    "total_runs": 6,
+    "arms": {
+      "A": {"label": "MCP interface", "runs": 3},
+      "B": {"label": "CLI interface", "runs": 3}
+    }
+  },
+  "grade_statistics": {
+    "A": {
+      "total_score": {"mean": 88.3, "stddev": 4.5, "min": 83, "max": 93},
+      "dimensions": {
+        "sessionDiscipline": {"mean": 18.3, "stddev": 2.3},
+        "discoveryEfficiency": {"mean": 18.0, "stddev": 1.5},
+        "taskHygiene": {"mean": 18.7, "stddev": 2.1},
+        "errorProtocol": {"mean": 18.7, "stddev": 2.3},
+        "disclosureUse": {"mean": 14.7, "stddev": 4.5}
+      }
+    },
+    "B": {
+      "total_score": {"mean": 71.7, "stddev": 8.1, "min": 62, "max": 80},
+      "dimensions": {
+        "sessionDiscipline": {"mean": 14.0, "stddev": 5.3},
+        "discoveryEfficiency": {"mean": 17.3, "stddev": 2.1},
+        "taskHygiene": {"mean": 18.0, "stddev": 2.0},
+        "errorProtocol": {"mean": 16.7, "stddev": 3.8},
+        "disclosureUse": {"mean": 5.7, "stddev": 4.7}
+      }
+    }
+  },
+  "token_statistics": {
+    "A": {"mean": 4200, "stddev": 380, "min": 3800, "max": 4600},
+    "B": {"mean": 2900, "stddev": 220, "min": 2650, "max": 3100},
+    "delta": {"mean": 1300, "percent": "+44.8%"},
+    "score_per_1k_tokens": {"A": 21.0, "B": 24.7}
+  },
+  "win_rates": {
+    "A_wins": 5,
+    "B_wins": 1,
+    "ties": 0,
+    "A_win_rate": 0.833
+  },
+  "dimension_analysis": [
+    {
+      "dimension": "disclosureUse",
+      "insight": "S5 shows highest variance between arms. MCP arm uses admin.help consistently; CLI arm often skips it.",
+      "A_mean": 14.7,
+      "B_mean": 5.7,
+      "delta": 9.0
+    },
+    {
+      "dimension": "sessionDiscipline",
+      "insight": "CLI arm frequently calls session.list after task ops, violating S1 ordering.",
+      "A_mean": 18.3,
+      "B_mean": 14.0,
+      "delta": 4.3
+    }
+  ],
+  "pattern_analysis": {
+    "winner_execution_pattern": "Start session -> session.list -> admin.help -> tasks.find -> tasks.show -> work -> session.end",
+    "loser_execution_pattern": "Start session -> tasks.find (skip session.list) -> work -> session.end (skip admin.help)",
+    "common_failures": [
+      "session.list called after first task op (violates S1 +10)",
+      "admin.help not called (violates S5 +10)",
+      "tasks.list used instead of tasks.find (reduces S2)"
+    ]
+  },
+  "improvement_suggestions": [
+    {
+      "priority": "high",
+      "dimension": "S1",
+      "suggestion": "CLI interface does not prompt for session.list before task ops. Add a pre-task-op reminder.",
+      "expected_impact": "Would recover +10 S1 points consistently in CLI arm"
+    },
+    {
+      "priority": "high",
+      "dimension": "S5",
+      "suggestion": "CLI arm never calls admin.help. Skill should explicitly prompt 'call admin.help at session start'.",
+      "expected_impact": "Would recover +10 S5 points"
+    },
+    {
+      "priority": "medium",
+      "dimension": "token_efficiency",
+      "suggestion": "MCP arm uses +44.8% more tokens but scores +16.6 points higher. Net score-per-token still favors MCP for protocol-critical work.",
+      "expected_impact": "Context for choosing interface based on task priority"
+    }
+  ]
+}
+```
+## Output: report.md
+Write a human-readable comparative report with:
+1. **Executive Summary** — winner, score delta, token delta
+2. **Per-Scenario Results** — table of A vs B scores per scenario
+3. **Dimension Breakdown** — where each arm excels/fails
+4. **Token Economy** — total_tokens comparison, score-per-token
+5. **Pattern Analysis** — common success/failure patterns
+6. **Recommendations** — actionable improvements ranked by impact
+Use this structure:
+```markdown
+# CLEO Grade A/B Analysis Report
+**Run**: <timestamp>  **Mode**: <mode>  **Scenarios**: <list>
+## Executive Summary
+| Metric | Arm A (MCP) | Arm B (CLI) | Delta |
+|---|---|---|---|
+| Mean Score | 88.3/100 | 71.7/100 | +16.6 |
+| Grade | A | C | — |
+| Mean Tokens | 4,200 | 2,900 | +1,300 (+44.8%) |
+| Score/1k tokens | 21.0 | 24.7 | -3.7 |
+| Win Rate | 83.3% | 16.7% | — |
+**Winner: Arm A (MCP)** — Higher protocol adherence in 5/6 runs.
+Token cost is higher but justified by significant score improvement.
+## Per-Scenario Results
+...
+## Dimension Analysis
+...
+## Recommendations
+...
+```
+After writing both files, output:
+```
+ANALYSIS: <analysis.json path>
+REPORT: <report.md path>
+WINNER_ARM: <A|B|tie>
+WINNER_CONFIG: <mcp|cli|other>
+MEAN_DELTA: <+N points>
+TOKEN_DELTA: <+N tokens>
+```

package/skills/ct-grade-v2-1/agents/blind-comparator.md ADDED Viewed

@@ -0,0 +1,157 @@
+# Blind Comparator Agent
+You are a blind comparator for CLEO behavioral evaluation. You evaluate two outputs — labeled only as **Output A** and **Output B** — without knowing which configuration, interface, or scenario produced them.
+Your job is to produce an objective, evidence-based comparison in `comparison.json` format.
+## Critical Rules
+1. **You do NOT know and MUST NOT speculate** about which output came from MCP vs CLI, or which scenario variant was used.
+2. **Judge on observable output quality only**: correctness, completeness, protocol adherence, efficiency.
+3. **Be specific**: every score must have evidence from the actual outputs.
+4. **Score independently first**, then declare a winner.
+## Inputs
+You will receive:
+- `OUTPUT_A_PATH`: Path to arm A's output files (grade.json, operations.jsonl)
+- `OUTPUT_B_PATH`: Path to arm B's output files (grade.json, operations.jsonl)
+- `SCENARIO`: Which grade scenario was run (for rubric context)
+- `OUTPUT_PATH`: Where to write comparison.json
+## Evaluation Dimensions
+For each output, assess:
+### 1. Grade Score Accuracy (0-5 pts each)
+- Does the session score reflect the actual operations executed?
+- Are flags appropriate for the violations observed?
+- Is the score consistent with the evidence in the grade result?
+### 2. Protocol Adherence (0-5 pts each)
+- Were all required operations for the scenario executed?
+- Were operations in the correct order?
+- Were operations well-formed (descriptions provided, params complete)?
+### 3. Efficiency (0-5 pts each)
+- Did the execution use the minimal necessary operations?
+- Was `tasks.find` preferred over `tasks.list`?
+- Were redundant calls avoided?
+### 4. Error Handling (0-5 pts each)
+- Were errors (if any) properly recovered from?
+- Were no unnecessary errors triggered?
+## Process
+1. Read `grade.json` from both output dirs
+2. Read `operations.jsonl` from both output dirs
+3. Score each dimension for A and B independently
+4. Sum scores: content_score = (grade_accuracy + protocol_adherence) / 2, structure_score = (efficiency + error_handling) / 2
+5. Declare winner (or tie if within 0.5 points)
+6. Write comparison.json
+## Output Format
+Write `comparison.json` to `OUTPUT_PATH`:
+```json
+{
+  "winner": "A",
+  "reasoning": "Output A demonstrated complete protocol adherence with all 10 required operations executed in correct order. Output B missed the session.list-before-task-ops ordering, reducing its S1 score.",
+  "rubric": {
+    "A": {
+      "content": {
+        "grade_score_accuracy": 5,
+        "protocol_adherence": 5
+      },
+      "structure": {
+        "efficiency": 4,
+        "error_handling": 5
+      },
+      "content_score": 5.0,
+      "structure_score": 4.5,
+      "overall_score": 9.5
+    },
+    "B": {
+      "content": {
+        "grade_score_accuracy": 3,
+        "protocol_adherence": 2
+      },
+      "structure": {
+        "efficiency": 4,
+        "error_handling": 5
+      },
+      "content_score": 2.5,
+      "structure_score": 4.5,
+      "overall_score": 7.0
+    }
+  },
+  "output_quality": {
+    "A": {
+      "score": 9,
+      "strengths": ["All scenario operations present", "Correct ordering", "Descriptions on all tasks"],
+      "weaknesses": ["Slightly verbose operation params"]
+    },
+    "B": {
+      "score": 7,
+      "strengths": ["Efficient operation count", "Good error recovery"],
+      "weaknesses": ["session.list came after first task op (-10 S1)", "No admin.help call (-10 S5)"]
+    }
+  },
+  "grade_comparison": {
+    "A": {
+      "total_score": 95,
+      "grade": "A",
+      "flags": []
+    },
+    "B": {
+      "total_score": 75,
+      "grade": "B",
+      "flags": ["session.list called after task ops", "No admin.help or skill lookup calls"]
+    }
+  },
+  "expectation_results": {
+    "A": {
+      "passed": 5,
+      "total": 5,
+      "pass_rate": 1.0,
+      "details": [
+        {"text": "session.list before any task op", "passed": true},
+        {"text": "session.end called", "passed": true},
+        {"text": "tasks.find used for discovery", "passed": true},
+        {"text": "admin.help called", "passed": true},
+        {"text": "No E_NOT_FOUND left unrecovered", "passed": true}
+      ]
+    },
+    "B": {
+      "passed": 3,
+      "total": 5,
+      "pass_rate": 0.60,
+      "details": [
+        {"text": "session.list before any task op", "passed": false},
+        {"text": "session.end called", "passed": true},
+        {"text": "tasks.find used for discovery", "passed": true},
+        {"text": "admin.help called", "passed": false},
+        {"text": "No E_NOT_FOUND left unrecovered", "passed": true}
+      ]
+    }
+  }
+}
+```
+## Tie Handling
+If overall scores are within 0.5 points, declare `"winner": "tie"` and note both performed equivalently.
+## Final Summary
+After writing comparison.json, output:
+```
+WINNER: <A|B|tie>
+SCORE_A: <overall>
+SCORE_B: <overall>
+GRADE_A: <letter> (<total>/100)
+GRADE_B: <letter> (<total>/100)
+FILE: <comparison.json path>
+```