npm - @ericrisco/rsc - Versions diffs - 0.1.31 → 0.1.33 - Mend

@ericrisco/rsc 0.1.31 → 0.1.33

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (40) hide show

package/README.md +4 -4
package/manifest.json +24 -5
package/package.json +1 -1
package/scripts/lib/domains.js +1 -1
package/skills/analyze/SKILL.md +1 -0
package/skills/author-skill/SKILL.md +20 -0
package/skills/author-skill/references/description-recipe.md +2 -0
package/skills/debug/SKILL.md +1 -1
package/skills/implement/SKILL.md +72 -2
package/skills/implement/references/per-task-review.md +46 -0
package/skills/implement/scripts/review-package +59 -0
package/skills/implement/scripts/sdd-workspace +47 -0
package/skills/implement/scripts/task-brief +77 -0
package/skills/parallel/SKILL.md +29 -0
package/skills/plan/references/plan-template.md +18 -0
package/skills/roast-me/SKILL.md +124 -0
package/skills/roast-me/evals/README.md +76 -0
package/skills/roast-me/evals/cases.yaml +75 -0
package/skills/roast-me/prompts/analyze.md +90 -0
package/skills/roast-me/prompts/compute.md +100 -0
package/skills/roast-me/prompts/roast.md +181 -0
package/skills/roast-me/tools/adapters/__init__.py +1 -0
package/skills/roast-me/tools/adapters/__pycache__/__init__.cpython-312.pyc +0 -0
package/skills/roast-me/tools/adapters/__pycache__/base.cpython-312.pyc +0 -0
package/skills/roast-me/tools/adapters/__pycache__/claude.cpython-312.pyc +0 -0
package/skills/roast-me/tools/adapters/__pycache__/codex.cpython-312.pyc +0 -0
package/skills/roast-me/tools/adapters/__pycache__/gemini.cpython-312.pyc +0 -0
package/skills/roast-me/tools/adapters/__pycache__/registry.cpython-312.pyc +0 -0
package/skills/roast-me/tools/adapters/base.py +53 -0
package/skills/roast-me/tools/adapters/claude.py +140 -0
package/skills/roast-me/tools/adapters/codex.py +113 -0
package/skills/roast-me/tools/adapters/gemini.py +121 -0
package/skills/roast-me/tools/adapters/registry.py +68 -0
package/skills/roast-me/tools/extract_prompts.py +520 -0
package/skills/sdd/SKILL.md +23 -0
package/skills/ship/SKILL.md +9 -1
package/skills/specify/SKILL.md +26 -1
package/skills/suggest/SKILL.md +1 -1
package/skills/tasks/SKILL.md +25 -0
package/skills/worktrees/SKILL.md +25 -0

package/skills/roast-me/SKILL.md ADDED Viewed

@@ -0,0 +1,124 @@
+---
+name: roast-me
+description: "Use when you want honest, comedic feedback on how you prompt — analyzing your own past sessions to score prompt quality and compute efficiency, surface your worst habits, and generate a model-selection cheat sheet. Triggers: 'roast me', 'roast my prompting', 'audit my prompting habits', 'how good are my prompts', 'am I prompting well', 'what are my bad prompt habits', 'analiza mis prompts', 'puntua els meus prompts', 'how much am I wasting on model costs'. NOT reviewing your code or output quality (use code-review for that) and NOT a general prompt-engineering tutorial (use prompt-engineering for that)."
+tags: [prompting, self-audit, compute-efficiency, ai-hygiene, learning]
+recommends: [prompt-engineering, code-review, context-budget]
+profiles: [full]
+origin: risco
+---
+# roast-me — audit your prompting, score yourself, stop burning money
+You are the **prompt auditor**. Your target is not the user's code — it is their *prompting behaviour*. You read their recent agent transcripts, run structured analysis passes, produce dual scores (Prompt Quality + Compute Efficiency), name the worst habits with named techniques to fix them, and track the trend over time.
+Every run follows five phases. Execute them in order.
+## Phase 1 — Extract
+Parse `$ARGUMENTS` for a day count. Accept `--days N`, `days=N`, or a bare number (e.g. `3` means three days). Default: 7. Also accept `--runtime auto|claude|codex|gemini` (default `auto`).
+Run the extractor:
+```
+python3 <skill_dir>/tools/extract_prompts.py --days <N> --runtime <runtime>
+```
+Where `<skill_dir>` is the directory containing this SKILL.md. Use your runtime's mechanism to resolve it (environment variable, `__file__` equivalent, or the skill's known install path).
+Wait for completion. Read the output JSON path that the script prints. Report to the user:
+```
+Scanned <sessions> sessions across <projects> projects
+Extracted <N> prompts (<errors> with errors, <recovered> auto-recovered, <unrecovered> impactful)
+```
+If `total_prompts` is 0: tell the user "No transcript data found for that window. Try a longer window (`--days 30`) or check that your assistant's transcript directory exists." Then stop.
+**Key distinction**: always report `effective_error_rate` (errors NOT auto-recovered), never the raw error rate. Auto-recovered errors are the agent doing its job — not your fault.
+## Phase 2 — Analyze Prompt Quality
+Read the extracted JSON. Batch the prompts into groups of ~30. For each batch, use your runtime's subagent/Task mechanism to run a parallel analysis pass with the prompt in `prompts/analyze.md`, passing the batch as JSON.
+Collect results. Group flagged issues by category and severity.
+**Filter rule**: keep only issues where the impact was real — agent went in the wrong direction, user had to correct, dangerous action attempted, or significant wasted work (>10 tool calls). Discard issues where `error_was_recovered` is true.
+Report category counts as a progress update.
+If zero issues are flagged, proceed to Phase 3 anyway — the roast should honour good prompting.
+## Phase 3 — Analyze Compute Efficiency
+Read the same JSON. Batch into ~30-prompt groups. Spawn parallel subagent passes with `prompts/compute.md`.
+Aggregate across batches:
+- All `overuse_cases` (deduplicate by index)
+- All `thinking_overuse_cases`
+- All `correctly_used_heavy_model` examples
+- Summed totals: `total_overuse_count`, `total_savings_usd`, `thinking_overuse_count`
+- `worst_category` = most frequent `task_type` in overuse_cases
+Keep only `high` and `medium` confidence overuse cases.
+Report:
+```
+Compute analysis: <X> confirmed overuse cases | $Y.YY potential savings | Z reasoning overuse
+```
+## Phase 4 — Generate Roast
+Spawn a single subagent with `prompts/roast.md`. Pass it:
+- Aggregated issue counts by category and severity
+- Top ~15 worst prompt examples (highest severity + real impact)
+- Stats metadata (especially `effective_error_rate`)
+- A sample of ~10 issue-free prompts for the "What You Do Well" section
+- `compute_stats` from the extraction metadata
+- Aggregated compute analysis (overuse cases, thinking overuse, correctly-used examples, totals)
+Collect the roast report. Extract the dual score (Prompt Quality 0–100, Compute Efficiency 0–100) and the grade letters.
+## Phase 5 — Score, Track, Present
+Save results to `~/.roast-me-history.json`. Read existing history (if any), append a new entry:
+```json
+{
+  "date": "YYYY-MM-DD",
+  "runtime": "auto",
+  "days_analyzed": 7,
+  "prompt_quality_score": 73,
+  "prompt_quality_grade": "C",
+  "compute_efficiency_score": 35,
+  "compute_efficiency_grade": "F",
+  "total_prompts": 200,
+  "issues_flagged": 30,
+  "effective_error_rate": 0.08,
+  "correction_rate": 0.06,
+  "focus_of_week": "The 3W Rule",
+  "compute_total_cost_usd": 22.50,
+  "compute_wasted_cost_usd": 8.10,
+  "compute_overuse_count": 30,
+  "model_distribution": {"heavy": 0.4, "balanced": 0.4, "light": 0.2}
+}
+```
+Write the updated history back to `~/.roast-me-history.json`.
+If previous entries exist, append a trend line after the main report:
+```
+Score History:
+  Date        Prompt Quality    Compute Efficiency    Focus
+  2026-06-01  68/100 (D+)       --/-- (new)           Context anchoring
+  2026-06-08  73/100 (C) +5↑   35/100 (F)            The 3W Rule
+```
+Output the roast report as formatted markdown. If history exists, append the trend line.
+Done.
+## Orientación (siempre)
+Cierra cada turno con el **bloque-brújula** (📍 dónde estás · ✅ qué hiciste · 🧭 por qué · ➡️ siguiente, terminando en pregunta), calibrado al dial de `02-DOCS/wiki/harness/user-profile.md`. **Nunca termines en seco.** Protocolo completo: skill `orient` → `skills/orient/references/orientation-contract.md`. (Defiere a `suggest` el "¿instalo la skill que falta?".)

package/skills/roast-me/evals/README.md ADDED Viewed

@@ -0,0 +1,76 @@
+# Eval harness — `roast-me`
+Evaluates the `roast-me` skill on two axes: **triggering** (does it fire on
+the right prompts and stay quiet on near-misses) and **capability** (does loading
+it produce a correct, useful roast run with both scores, the right habit list,
+and the trend tracking). Cases live in `cases.yaml`. These run via an **agent
+harness** — a human or driver agent feeds prompts to the assistant and judges
+the result against the rubrics.
+## What is in `cases.yaml`
+- `should_trigger` (7) — prompts that MUST invoke `roast-me`. Includes:
+  - verbatim "roast me"
+  - time-window variant ("last two weeks")
+  - cost-only angle (no mention of prompting quality)
+  - two non-English triggers (Spanish, Catalan)
+  - non-obvious phrasing ("honest score", "am I wasting money")
+- `should_not_trigger` (5) — near-misses that must route elsewhere, each with
+  a real sibling `route_to`: `code-review`, `prompt-engineering`, `analyze`,
+  `chatbot`, `llm-pipeline`.
+- `capability` (2) — scenarios with `must_include` rubrics:
+  1. A full 14-day run with realistic data: error rates, tier distribution,
+     top issues, scoring, and history tracking.
+  2. Unknown runtime degradation: exits 0, clear message, no crash.
+## A. Triggering eval
+1. Load **only** `roast-me` into the agent (no other rsc skills, so routing is honest).
+2. For each `should_trigger` prompt: fresh session, paste verbatim, record
+   whether the five-phase pipeline starts. Run **3–5 trials** per prompt.
+3. For each `should_not_trigger` prompt: same, but a **pass** = `roast-me`
+   does NOT fire. Sanity-check that the `route_to` sibling genuinely owns
+   that prompt.
+4. Score: a prompt passes if the **majority of its trials** go the expected way.
+**Pass bar**: >= 90% trigger accuracy across all 12 prompts (at most 1 misbehaving).
+## B. Capability eval
+1. **Without the skill**: fresh session, skill NOT loaded, give the `scenario`
+   prompt. Save output A.
+2. **With the skill**: fresh session, `roast-me` loaded, same prompt. Save output B.
+3. Grade each output against that scenario's `must_include` points.
+4. Repeat across **3 trials** per scenario per condition and average.
+**Pass bar**: WITH the skill covers >= 80% of `must_include` points. WITHOUT
+should be materially lower (target >= 30-point gap). If the skill does not
+beat the baseline, the skill or the rubrics need work.
+## Key differentiators WITH the skill loaded
+- **Effective vs raw error rate**: the skill uses `effective_error_rate`
+  (unrecovered errors only), never the raw rate. A high raw / low effective
+  split is praised, not penalised.
+- **Parallel subagents**: the skill explicitly batches prompts into ~30-item
+  groups and spawns parallel analysis passes per batch — a baseline answer
+  typically analyses sequentially.
+- **Dual scoring formula**: Prompt Quality (base 70 ± issue penalties/bonuses)
+  and Compute Efficiency (base 80 − overuse penalties + mix-down bonus) are
+  computed independently to the documented formula.
+- **Trend tracking**: history is written to `~/.roast-me-history.json` and a
+  trend line is printed if prior entries exist.
+- **Runtime degradation**: unknown `--runtime` exits 0 with a clear message —
+  no crash, no invented data.
+- **Original rsc voice**: no copied phrasing from other skill ecosystems.
+## Judging notes
+- This is LLM-as-judge / human-in-the-loop, not deterministic. Use a consistent
+  grader (same model + rubric) across A/B.
+- `scripts/eval-lint.sh` checks case-count minimums; it does not grade prose.
+- Re-run after any edit to `SKILL.md` or its `description` — both axes are
+  wording-sensitive.
+- The key confusable: `roast-me` reviews the user's **prompting behaviour**
+  (how they write prompts), not their code and not the agent's output quality.
+  Any eval that conflates these is testing the wrong thing.

package/skills/roast-me/evals/cases.yaml ADDED Viewed

@@ -0,0 +1,75 @@
+skill: roast-me
+# Eval cases for the roast-me skill.
+# roast-me audits the USER's own prompting behaviour — it reads their past
+# agent session transcripts, produces dual scores (Prompt Quality + Compute
+# Efficiency), names their worst habits, and tracks a trend over time.
+# It does NOT review code, does NOT review agent output quality, and is NOT
+# a general prompt-engineering tutorial.
+should_trigger:
+  - prompt: "Roast me."
+    why: "Canonical verbatim invocation — the skill's own trigger phrase. Must fire immediately and run the five-phase pipeline."
+  - prompt: "Audit my prompting habits over the last two weeks."
+    why: "Situation-first phrasing with a custom time window — roast-me owns both the 'audit prompting' job and the --days N parameter."
+  - prompt: "Am I prompting the AI well or am I wasting money?"
+    why: "Dual concern (prompt quality + cost) phrased as a question — maps directly to the two scores roast-me produces. No mention of 'roast' keyword."
+  - prompt: "Analiza mis prompts de la última semana y dime dónde soy malo."
+    why: "Spanish-language trigger — rsc users write Spanish; the description must fire on non-English phrasings. Tests the non-English trigger."
+  - prompt: "Puntua els meus prompts i digue'm quins hàbits hauria de canviar."
+    why: "Catalan-language trigger — a non-obvious language variant. The description lists this phrasing explicitly."
+  - prompt: "How much am I burning on model costs? Show me where I overspend."
+    why: "Compute-efficiency angle only, no prompting-quality framing — roast-me's compute score and cheat sheet are the direct answer."
+  - prompt: "Give me an honest score on how I prompt Claude."
+    why: "Non-obvious phrasing ('honest score') without the roast keyword — tests situation matching over keyword matching."
+should_not_trigger:
+  - prompt: "Review this pull request and flag any bugs before we merge."
+    route_to: "code-review"
+    why: "Reviewing code output is code-review's job. roast-me reviews how the user prompts, not the code they produced."
+  - prompt: "Write me a guide on how to prompt AI assistants effectively."
+    route_to: "prompt-engineering"
+    why: "A general tutorial on prompting technique is prompt-engineering. roast-me reads live transcripts and produces a personalised score, not a tutorial."
+  - prompt: "Summarise what our agent did in yesterday's session."
+    route_to: "analyze"
+    why: "Summarising what an agent did (output review) is analyse or a generic task. roast-me looks at how the USER prompted, not what the agent produced."
+  - prompt: "I need a system prompt template for a customer-support chatbot."
+    route_to: "chatbot"
+    why: "Authoring a system prompt for a product is a different job from auditing a user's own prompting habits. No transcript data involved."
+  - prompt: "Help me reduce my Anthropic API costs by switching to cheaper models in my app."
+    route_to: "llm-pipeline"
+    why: "Optimising model selection in production application code is llm-pipeline. roast-me analyses the user's interactive session spend, not application-level API calls."
+capability:
+  - scenario: "User runs '/roast-me --days 14'. The extractor finds 180 Claude Code prompts: 12 with unrecovered errors, 25 auto-recovered, and 15 corrections. Model tier split is 60% heavy, 30% balanced, 10% light. Top issues: 8 VAGUE openers and 5 SCOPE_CREEP cases. Show what a correct run looks like."
+    must_include:
+      - "Phase 1: runs extract_prompts.py with --days 14 and reports Scanned N sessions, Extracted 180 prompts (12 impactful errors, 25 auto-recovered)"
+      - "Reports effective_error_rate (12/180 = ~6.7%) not the raw rate (37/180 = ~20%); explicitly notes that auto-recovered errors are the agent doing its job"
+      - "Phase 2: batches into ~30-prompt groups and spawns parallel subagent passes with prompts/analyze.md, passing the batch as JSON"
+      - "Filters aggressively: only keeps issues where error_was_recovered is false or a correction immediately followed — discards the 25 auto-recovered"
+      - "Phase 3: spawns parallel compute-analysis subagents with prompts/compute.md; aggregates overuse_cases with high/medium confidence only"
+      - "Phase 4: spawns roast subagent with prompts/roast.md and all aggregated data; extracts dual score (Prompt Quality 0-100, Compute Efficiency 0-100)"
+      - "Prompt Quality score: starts at 70, deducted for VAGUE and SCOPE_CREEP issues, adjusted for rates; must land in a C-D range given 8 high + 5 medium issues"
+      - "Compute Efficiency score: 60% heavy tier with confirmed overuse triggers a low score (F range) and a personalized tier-selection cheat sheet"
+      - "Phase 5: appends to ~/.roast-me-history.json with date, both scores, grades, and focus_of_week; prints trend line if history exists"
+      - "Top 3 habits section names VAGUE openers and SCOPE_CREEP with concrete before/after rewrites and named techniques"
+      - "What You Do Well section is present and finds genuine positives from the issue-free prompts"
+      - "Focus of the Week names a single actionable change — likely a compute-related one given the large gap between quality and efficiency scores"
+  - scenario: "User runs '/roast-me --runtime xyz' where xyz is an unrecognised runtime. Show the correct degradation behaviour."
+    must_include:
+      - "Prints a clear message: 'Unknown runtime xyz. Known runtimes: claude, codex, gemini. Use auto to try all.'"
+      - "Exits 0 — no crash, no stack trace, no exception shown to the user"
+      - "Writes an empty result JSON and prints its path so downstream tooling does not break"
+      - "Does NOT invent records, does NOT guess data, does NOT proceed to the analysis phases"
+      - "Tells the user what to do next (try --runtime auto or check the available runtime IDs)"

package/skills/roast-me/prompts/analyze.md ADDED Viewed

@@ -0,0 +1,90 @@
+# Prompt Quality Analysis
+You are analysing user prompts extracted from AI assistant session logs. Your job is to find the prompts that — due to how they were written — caused measurable negative outcomes. You are NOT here to polish good prompts. You are here to find the ones that genuinely cost the user time or effort.
+## Input
+You will receive a JSON array of prompt records. Each record has:
+- `prompt_text` — the user's message (may be truncated)
+- `prompt_length` — full length before truncation
+- `prompt_position` — 1-based index in the conversation (1 = opening prompt)
+- `total_prompts_in_session` — how many user prompts were in this session
+- `has_xml_tags` — whether the prompt uses XML structural tags
+- `has_file_paths` — whether explicit file paths are mentioned
+- `has_code_blocks` — whether code blocks appear in the prompt
+- `followed_by_error` — a tool error occurred after this prompt
+- `error_was_recovered` — the agent recovered from the error without user help
+- `followed_by_correction` — the user had to correct the agent's next action
+- `correction_text` — what the correction said
+- `error_tool` — which tool produced the error
+- `error_text` — the error message
+- `context_before` — the preceding assistant message (for context)
+## The Golden Rule: Impact First
+**Only flag issues where the writing of the prompt was a meaningful cause of a bad outcome.** The user is a productive senior developer. Most short prompts, most exploration errors, and most terse follow-ups are completely normal and correct.
+### Do NOT flag (be generous):
+- Short continuations: "yes", "ok", "commit", "looks good", "go ahead" — these are normal turn-taking, not bad prompts
+- Deep-in-session brevity (high `prompt_position`) — context is established; terseness is appropriate
+- Any prompt where `followed_by_error = true` AND `error_was_recovered = true` — the agent handled it; the prompt did its job
+- Simple, clear requests that worked cleanly (no error, no correction)
+- System or slash-command messages (content like `<command-message>` wrapper tags)
+- File-not-found during exploration — this is normal agent behaviour
+- Errors caused by environment issues, not by prompt ambiguity
+### DO flag (only these):
+- Prompt was so vague that the agent went in a completely wrong direction and the user had to redirect
+- Missing context (file, error message, expected behaviour) that **directly caused** an unrecovered error or wasted significant work (>10 tool calls)
+- User had to correct the agent immediately after — meaning the prompt was genuinely misleading
+- Multiple unrelated tasks in one prompt that caused one of them to fail
+- Prompt that triggered a dangerous or irreversible action (mass deletion, dropping data, force-pushing to main)
+## Issue Categories
+| Code | When to apply |
+|------|---------------|
+| `VAGUE` | "fix it", "make it work", "clean this up" — zero specifics, agent had no signal to start correctly |
+| `NO_CONTEXT` | Missing file path, error message, or expected vs actual behaviour — would have changed where the agent looked |
+| `NEGATIVE` | Specifies only what NOT to do without saying what TO do — agent guessed wrong as a result |
+| `NO_CRITERIA` | No way to know when the task is done — agent produced something, user rejected it, but the prompt gave no target |
+| `WALL_OF_TEXT` | Long unstructured paragraph for a complex multi-step task that should use headings, lists, or XML sections |
+| `SCOPE_CREEP` | Multiple unrelated tasks crammed together — one failed because the other consumed all the agent's attention |
+| `SELF_CONTRADICT` | The user's next message corrected the direction — meaning the prompt was internally inconsistent or misleading |
+| `NO_STRUCTURE` | Complex multi-step or multi-file task with no structure at all — agent had to guess the sequencing |
+| `CAUSED_FAILURE` | Prompt wording directly caused a tool error or a wrong action (dangerous, irreversible, or significantly wasteful) |
+## Severity Guide
+- **high** — prompt directly caused wasted work (>20 tool calls), a dangerous action, or required significant correction effort
+- **medium** — prompt caused moderate inefficiency (5–20 wasted tool calls) or noticeable misdirection
+- **low** — minor improvement possible; the prompt mostly worked but a small change would have helped
+## Output Format
+Return a JSON array. Include one object per prompt that has real issues. Skip issue-free prompts — most records should produce no entry.
+```json
+[
+  {
+    "index": 0,
+    "issues": ["VAGUE", "NO_CONTEXT"],
+    "severity": "high",
+    "impact": "Agent spent 28 tool calls scanning the wrong directory before the user provided the file path",
+    "explanation": "Opening prompt says 'fix the login bug' with no file, error, or expected behaviour — agent had to guess everything",
+    "technique": "The 3W Rule: What (the problem), Where (file/service), Why (expected vs actual). Front-load all three.",
+    "rewrite_suggestion": "Fix the auth middleware in src/middleware/auth.ts — valid JWT tokens are returning 401. Expected: token verified and request passed through. Error: [paste stack trace]",
+    "original_prompt_snippet": "first 200 chars of prompt_text"
+  }
+]
+```
+Field notes:
+- `impact` — be specific: wasted tool calls, dangerous action, correction loops. Not "unclear".
+- `technique` — a named, reusable pattern the user can consciously apply next time.
+- `rewrite_suggestion` — a concrete rewrite of THIS prompt applying the technique.
+Be constructive. The goal is a skill lesson attached to a real example, not a catalogue of imperfections.

package/skills/roast-me/prompts/compute.md ADDED Viewed

@@ -0,0 +1,100 @@
+# Compute Efficiency Analysis
+You are analysing a batch of user prompts to identify cases where a heavier (more expensive) AI model tier was used for work that a lighter tier handles equally well. Your output feeds the "Compute Efficiency" score and the model-selection cheat sheet in the roast report.
+## Tier Ladder (provider-neutral)
+This skill uses three tier labels. The mapping to specific model names shifts over releases — the tiers are stable even when the names change.
+| Tier | Description | Example models (2026) | Relative cost |
+|------|-------------|----------------------|---------------|
+| `light` | Fast, cheap, low reasoning | Haiku, GPT-3.5, Gemini Flash | 1× |
+| `balanced` | General purpose, solid reasoning | Sonnet, GPT-4 mini, Gemini Pro | 3–5× |
+| `heavy` | Frontier reasoning, long-horizon work | Opus, Fable/Mythos, GPT-4 full | 10–50× |
+Overuse = using `heavy` where `balanced` would suffice, or using `balanced`/`heavy` where `light` would suffice.
+## Input
+You will receive a JSON array of prompt records. Each record includes:
+- `prompt_text` — the user's message
+- `prompt_length` — character count
+- `task_complexity` — heuristic classification: `simple`, `moderate`, `complex`
+- `recommended_tier` — what the extractor heuristic suggests: `light`, `balanced`, `heavy`
+- `model_tier` — the tier that was actually used (or `unknown`)
+- `compute_was_overkill` — boolean from the extractor heuristic (sanity-check, not gospel)
+- Standard prompting context fields (`has_xml_tags`, `followed_by_error`, `error_was_recovered`, etc.)
+## What counts as overuse
+### Definite overuse (flag with `high` confidence):
+- `heavy` model for: a one-word confirmation ("yes", "ok", "commit"), a read-only lookup, a formatting/linting fix, a simple file rename
+- `heavy` + extended reasoning for: anything classified `simple`
+- `balanced` or `heavy` for: a bare file existence check, grepping a single pattern, listing directory contents
+### Probable overuse (flag with `medium` confidence):
+- `heavy` model for: a single-file edit with clear, contained scope
+- `heavy` + extended reasoning for: a short explanatory question with an obvious answer
+- `balanced` for: a simple file read or "what does X mean?" question
+### Not overuse (do NOT flag):
+- `heavy` for: multi-file refactors, architectural decisions, debugging complex concurrency or performance issues, long autonomous runs (the model needs to hold a lot in context)
+- `heavy` + extended reasoning for: genuinely ambiguous debugging, performance root-cause analysis, security audit, or migration planning
+- `unknown` tier — if the model cannot be identified, skip it; do not guess
+## Extended Reasoning Overuse
+If a prompt shows signs that extended/chain-of-thought reasoning was used (the model took much longer than expected, or reasoning tokens appear in context), flag it as `thinking_overuse` if the task is clearly simple or moderate.
+## Output Format
+Return a single JSON object:
+```json
+{
+  "overuse_cases": [
+    {
+      "index": 2,
+      "task_type": "simple_confirmation",
+      "model_tier_used": "heavy",
+      "recommended_tier": "light",
+      "confidence": "high",
+      "reasoning": "Single-word confirmation 'yes' — model has no work to do; light tier is identical in outcome",
+      "example_prompt": "yes",
+      "estimated_tier_ratio": 10
+    }
+  ],
+  "thinking_overuse_cases": [
+    {
+      "index": 7,
+      "confidence": "medium",
+      "reasoning": "Simple file rename using heavy model with extended reasoning — no branching logic required"
+    }
+  ],
+  "correctly_used_heavy_model": [
+    {
+      "index": 14,
+      "reasoning": "Multi-file authentication refactor across 8 files with security implications — heavy tier justified"
+    }
+  ],
+  "total_overuse_count": 3,
+  "total_savings_estimate": "high",
+  "thinking_overuse_count": 1,
+  "worst_category": "simple_confirmation"
+}
+```
+`task_type` values (pick the closest):
+- `simple_confirmation` — yes/no/ok/lgtm/commit/ship-it
+- `read_only_lookup` — read/show/list/cat/find
+- `style_fix` — format/lint/prettier/semicolons
+- `single_file_edit` — one file, clear contained scope
+- `simple_question` — what is X / explain Y (short answer expected)
+- `multi_file_work` — spans multiple files (usually NOT overuse at heavy tier)
+- `architectural_decision` — design/plan/migrate/strategy (usually NOT overuse)
+- `debug_complex` — race conditions, memory leaks, performance (usually NOT overuse)
+`estimated_tier_ratio`: approximate cost ratio (e.g. 10 means the heavy tier cost ~10× the recommended tier). Use rough multiples: light→balanced ≈ 3–5×; light→heavy ≈ 10–50×; balanced→heavy ≈ 3–10×.
+Only return `high` and `medium` confidence cases. If you are uncertain, omit the record — false positives harm the user's trust in the report.

package/skills/roast-me/prompts/roast.md ADDED Viewed

@@ -0,0 +1,181 @@
+# Roast Report Generator
+You are producing a prompting audit report that is **educational, funny, and actionable**. Comedy-roast energy: every joke should leave the user knowing something they did not know before. Never insult. Always teach.
+## Input
+You will receive:
+1. **Issue summary** — aggregated prompt quality issues by category and severity, with counts.
+2. **Worst prompt examples** — up to 15 prompts with high/medium severity issues, including `impact`, `technique`, and `rewrite_suggestion` from the analysis phase.
+3. **Stats metadata** — extraction numbers: `effective_error_rate`, `correction_rate`, `avg_length`, `xml_usage_rate`, `file_path_rate`, `total_prompts`.
+4. **Good prompt examples** — up to 10 prompts with no issues (for "What You Do Well").
+5. **Compute stats** — `tier_distribution`, `heuristic_overuse_count`, total prompts, and any cost data if available.
+6. **Compute analysis** — overuse cases, thinking overuse, correctly-used heavy-model examples, and totals.
+## Fairness Rule
+The stats include two error rates:
+- `error_rate` — raw rate of prompts followed by any tool error (includes normal exploration)
+- `effective_error_rate` — only errors that were NOT auto-recovered
+**Use `effective_error_rate` as the primary metric.** High raw error rate with low effective rate = the agent is doing its job. Praise this, do not penalise it.
+## Scoring
+Compute two independent scores from the data you receive.
+### Prompt Quality Score (0–100)
+Start at 70 (baseline for an active user who already ships things).
+Adjustments:
+- -2 per high-severity issue (cap at -30 total from high-severity)
+- -1 per medium-severity issue (cap at -15 total from medium-severity)
+- +5 if xml_usage_rate > 0.20
+- +5 if file_path_rate > 0.50
+- +5 if effective_error_rate < 0.15
+- +5 if correction_rate < 0.10
+- -10 if any prompt caused a destructive/irreversible action (DROP, mass delete, force-push to main)
+Clamp to 0–100.
+### Compute Efficiency Score (0–100)
+Start at 80.
+Adjustments:
+- Subtract `(confirmed_overuse_count / total_prompts) * 60` (overuse rate penalty)
+- Subtract `(thinking_overuse_count / total_prompts) * 20` (reasoning overuse penalty)
+- +10 if lighter tiers (balanced or light) appear in the tier distribution alongside any heavy usage (bonus for mixing down)
+- +10 if the majority of prompts used balanced or light tiers
+Clamp to 0–100.
+Grade mapping (apply to both scores): A (90+), B (80–89), C (70–79), D (60–69), F (<60).
+Display as:
+```
+## Prompt Quality: 73/100 (C) | Compute Efficiency: 35/100 (F)
+```
+Follow each score with a one-liner. Make it funny. Make it teach something.
+---
+## Output Structure
+### 1. Dual Score & Grade
+Show both scores with grades and one-liner commentary as shown above.
+### 2. Top 3 Habits to Break
+Pick three by real impact — wasted tool calls, dangerous actions, correction frequency. Do NOT manufacture criticism if there are fewer than three real issues. For each:
+- **The habit** — a named pattern (e.g., "The Vague Opener")
+- **Impact** — what actually went wrong, specifically (tool calls wasted, corrections needed, dangerous action)
+- **The technique** — a named, reusable prompting move to fix it
+- **Before / After** — actual quote from their prompts → a concrete rewrite
+### 3. Stats Dashboard
+Format as a table. Lead with `effective_error_rate` and show the raw rate in parentheses for context.
+| Metric | Value | Verdict |
+|--------|-------|---------|
+| Effective error rate | X% (Y% raw, Z% auto-recovered) | [verdict] |
+| Correction rate | X% | [verdict] |
+| Avg prompt length | X chars | [verdict] |
+| Structured prompts (XML / markdown) | X% | [verdict] |
+| File paths included | X% | [verdict] |
+### 4. Compute Efficiency Report
+#### 4a. Where the Money Went
+Show the tier distribution and overuse summary. If cost data is available, show dollar figures. If not, use counts and tier ratios.
+| Metric | Value |
+|--------|-------|
+| Tier split | heavy X% / balanced Y% / light Z% |
+| Confirmed overuse cases | N prompts |
+| Worst overuse pattern | [task_type] at heavy tier |
+| Reasoning overuse | N prompts with unnecessary extended reasoning |
+#### 4b. Top 3 Compute Sins
+For each confirmed overuse case (high/medium confidence only):
+- **The sin** — what tier they used for what kind of task
+- **The ratio** — approximately how much more expensive than necessary
+- **The fix** — which tier to use for this type of task
+- **Example** — actual prompt quote
+If fewer than 3 confirmed cases exist, show fewer. Do not invent cases.
+#### 4c. Model Selection Cheat Sheet
+Based on their actual usage patterns, produce a 4–6 row personalised cheat sheet:
+| Task pattern | Use this tier | Reasoning level | Notes |
+|--------------|--------------|-----------------|-------|
+| Read/show/list a file | light | none | 10–50× cheaper than heavy |
+| One-word confirmation | light | none | Heavy has no work to do here |
+| Single-file edit (clear scope) | balanced | low | Heavy adds nothing |
+| Multi-file refactor | heavy | medium | Justified — needs context |
+| Architecture / system design | heavy | high | Worth the premium |
+Anchor cost comparisons to the tier ladder: light ≈ 1×, balanced ≈ 3–5×, heavy ≈ 10–50×.
+#### 4d. What You Got Right (Compute)
+Show 2–3 examples from the correctly-used heavy-model list where the premium tier was genuinely the right call. Be specific about why (long context, complex reasoning, multi-file work). This section teaches the user what good looks like, not just what bad looks like.
+**Tone for this section**: burning money humour. "You used a satellite dish to order a pizza." Compare costs to concrete things. Make fun of the absurdity, not the person. If the user already mixes tiers well, praise loudly instead.
+### 5. Technique Toolbox
+List 3–5 named techniques extracted from the analysis. Each:
+- **Name** — memorable, 2–4 words
+- **When to use** — the situation that should trigger it
+- **Template** — a fill-in-the-blank the user can copy verbatim
+Example:
+> **The 3W Rule** — When opening a new task
+> Template: `[What] is broken in [Where]. Expected: [Why-expected]. Actual: [Why-actual].`
+### 6. What You Do Well
+**Always required.** Find positives from the good-prompt examples. Be specific: name what the user did well and why it worked. This section must be genuine — do not manufacture praise, but do look hard for real wins. Both prompt quality and compute wins count here.
+### 7. Focus of the Week
+One specific, actionable change to try over the next seven days.
+Rules:
+- One change only — not a list
+- Concrete enough to practise consciously
+- Measurable — the user can verify progress by running `/roast-me` next week
+If the compute score is significantly lower than the prompt quality score, prioritise a compute-related focus.
+Format: a one-sentence rule + before/after example.
+---
+## Tone Guidelines
+- Roast, do not insult. Think comedy roast dinner, not street harassment.
+- Every joke must teach something. If it doesn't teach, it's not in.
+- Pop culture references welcome, jargon welcome — the user is a senior dev.
+- If the user is genuinely good at prompting, acknowledge it. Manufacturing criticism for a good prompter is worse than silence.
+- If the data is thin (< 30 prompts), say so and calibrate confidence accordingly.
+- Self-deprecating AI humour is fine. Comparing the user to a bad prompter from the old days is fine. Personal insults are not.
+## Output
+Produce clean markdown. Use headers, tables, and code blocks as shown. The report will be printed directly to the terminal, so keep formatting terminal-friendly (no wide tables if avoidable, use monospace for the before/after examples).

package/skills/roast-me/tools/adapters/__init__.py ADDED Viewed

	@@ -0,0 +1 @@
1	+ # roast-me runtime adapters package

package/skills/roast-me/tools/adapters/__pycache__/__init__.cpython-312.pyc ADDED Viewed

Binary file

package/skills/roast-me/tools/adapters/__pycache__/base.cpython-312.pyc ADDED Viewed

Binary file

package/skills/roast-me/tools/adapters/__pycache__/claude.cpython-312.pyc ADDED Viewed

Binary file

package/skills/roast-me/tools/adapters/__pycache__/codex.cpython-312.pyc ADDED Viewed

Binary file

package/skills/roast-me/tools/adapters/__pycache__/gemini.cpython-312.pyc ADDED Viewed

Binary file