@ericrisco/rsc 0.1.31 → 0.1.33

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (40) hide show
  1. package/README.md +4 -4
  2. package/manifest.json +24 -5
  3. package/package.json +1 -1
  4. package/scripts/lib/domains.js +1 -1
  5. package/skills/analyze/SKILL.md +1 -0
  6. package/skills/author-skill/SKILL.md +20 -0
  7. package/skills/author-skill/references/description-recipe.md +2 -0
  8. package/skills/debug/SKILL.md +1 -1
  9. package/skills/implement/SKILL.md +72 -2
  10. package/skills/implement/references/per-task-review.md +46 -0
  11. package/skills/implement/scripts/review-package +59 -0
  12. package/skills/implement/scripts/sdd-workspace +47 -0
  13. package/skills/implement/scripts/task-brief +77 -0
  14. package/skills/parallel/SKILL.md +29 -0
  15. package/skills/plan/references/plan-template.md +18 -0
  16. package/skills/roast-me/SKILL.md +124 -0
  17. package/skills/roast-me/evals/README.md +76 -0
  18. package/skills/roast-me/evals/cases.yaml +75 -0
  19. package/skills/roast-me/prompts/analyze.md +90 -0
  20. package/skills/roast-me/prompts/compute.md +100 -0
  21. package/skills/roast-me/prompts/roast.md +181 -0
  22. package/skills/roast-me/tools/adapters/__init__.py +1 -0
  23. package/skills/roast-me/tools/adapters/__pycache__/__init__.cpython-312.pyc +0 -0
  24. package/skills/roast-me/tools/adapters/__pycache__/base.cpython-312.pyc +0 -0
  25. package/skills/roast-me/tools/adapters/__pycache__/claude.cpython-312.pyc +0 -0
  26. package/skills/roast-me/tools/adapters/__pycache__/codex.cpython-312.pyc +0 -0
  27. package/skills/roast-me/tools/adapters/__pycache__/gemini.cpython-312.pyc +0 -0
  28. package/skills/roast-me/tools/adapters/__pycache__/registry.cpython-312.pyc +0 -0
  29. package/skills/roast-me/tools/adapters/base.py +53 -0
  30. package/skills/roast-me/tools/adapters/claude.py +140 -0
  31. package/skills/roast-me/tools/adapters/codex.py +113 -0
  32. package/skills/roast-me/tools/adapters/gemini.py +121 -0
  33. package/skills/roast-me/tools/adapters/registry.py +68 -0
  34. package/skills/roast-me/tools/extract_prompts.py +520 -0
  35. package/skills/sdd/SKILL.md +23 -0
  36. package/skills/ship/SKILL.md +9 -1
  37. package/skills/specify/SKILL.md +26 -1
  38. package/skills/suggest/SKILL.md +1 -1
  39. package/skills/tasks/SKILL.md +25 -0
  40. package/skills/worktrees/SKILL.md +25 -0
@@ -0,0 +1,124 @@
1
+ ---
2
+ name: roast-me
3
+ description: "Use when you want honest, comedic feedback on how you prompt — analyzing your own past sessions to score prompt quality and compute efficiency, surface your worst habits, and generate a model-selection cheat sheet. Triggers: 'roast me', 'roast my prompting', 'audit my prompting habits', 'how good are my prompts', 'am I prompting well', 'what are my bad prompt habits', 'analiza mis prompts', 'puntua els meus prompts', 'how much am I wasting on model costs'. NOT reviewing your code or output quality (use code-review for that) and NOT a general prompt-engineering tutorial (use prompt-engineering for that)."
4
+ tags: [prompting, self-audit, compute-efficiency, ai-hygiene, learning]
5
+ recommends: [prompt-engineering, code-review, context-budget]
6
+ profiles: [full]
7
+ origin: risco
8
+ ---
9
+
10
+ # roast-me — audit your prompting, score yourself, stop burning money
11
+
12
+ You are the **prompt auditor**. Your target is not the user's code — it is their *prompting behaviour*. You read their recent agent transcripts, run structured analysis passes, produce dual scores (Prompt Quality + Compute Efficiency), name the worst habits with named techniques to fix them, and track the trend over time.
13
+
14
+ Every run follows five phases. Execute them in order.
15
+
16
+ ## Phase 1 — Extract
17
+
18
+ Parse `$ARGUMENTS` for a day count. Accept `--days N`, `days=N`, or a bare number (e.g. `3` means three days). Default: 7. Also accept `--runtime auto|claude|codex|gemini` (default `auto`).
19
+
20
+ Run the extractor:
21
+
22
+ ```
23
+ python3 <skill_dir>/tools/extract_prompts.py --days <N> --runtime <runtime>
24
+ ```
25
+
26
+ Where `<skill_dir>` is the directory containing this SKILL.md. Use your runtime's mechanism to resolve it (environment variable, `__file__` equivalent, or the skill's known install path).
27
+
28
+ Wait for completion. Read the output JSON path that the script prints. Report to the user:
29
+
30
+ ```
31
+ Scanned <sessions> sessions across <projects> projects
32
+ Extracted <N> prompts (<errors> with errors, <recovered> auto-recovered, <unrecovered> impactful)
33
+ ```
34
+
35
+ If `total_prompts` is 0: tell the user "No transcript data found for that window. Try a longer window (`--days 30`) or check that your assistant's transcript directory exists." Then stop.
36
+
37
+ **Key distinction**: always report `effective_error_rate` (errors NOT auto-recovered), never the raw error rate. Auto-recovered errors are the agent doing its job — not your fault.
38
+
39
+ ## Phase 2 — Analyze Prompt Quality
40
+
41
+ Read the extracted JSON. Batch the prompts into groups of ~30. For each batch, use your runtime's subagent/Task mechanism to run a parallel analysis pass with the prompt in `prompts/analyze.md`, passing the batch as JSON.
42
+
43
+ Collect results. Group flagged issues by category and severity.
44
+
45
+ **Filter rule**: keep only issues where the impact was real — agent went in the wrong direction, user had to correct, dangerous action attempted, or significant wasted work (>10 tool calls). Discard issues where `error_was_recovered` is true.
46
+
47
+ Report category counts as a progress update.
48
+
49
+ If zero issues are flagged, proceed to Phase 3 anyway — the roast should honour good prompting.
50
+
51
+ ## Phase 3 — Analyze Compute Efficiency
52
+
53
+ Read the same JSON. Batch into ~30-prompt groups. Spawn parallel subagent passes with `prompts/compute.md`.
54
+
55
+ Aggregate across batches:
56
+ - All `overuse_cases` (deduplicate by index)
57
+ - All `thinking_overuse_cases`
58
+ - All `correctly_used_heavy_model` examples
59
+ - Summed totals: `total_overuse_count`, `total_savings_usd`, `thinking_overuse_count`
60
+ - `worst_category` = most frequent `task_type` in overuse_cases
61
+
62
+ Keep only `high` and `medium` confidence overuse cases.
63
+
64
+ Report:
65
+
66
+ ```
67
+ Compute analysis: <X> confirmed overuse cases | $Y.YY potential savings | Z reasoning overuse
68
+ ```
69
+
70
+ ## Phase 4 — Generate Roast
71
+
72
+ Spawn a single subagent with `prompts/roast.md`. Pass it:
73
+ - Aggregated issue counts by category and severity
74
+ - Top ~15 worst prompt examples (highest severity + real impact)
75
+ - Stats metadata (especially `effective_error_rate`)
76
+ - A sample of ~10 issue-free prompts for the "What You Do Well" section
77
+ - `compute_stats` from the extraction metadata
78
+ - Aggregated compute analysis (overuse cases, thinking overuse, correctly-used examples, totals)
79
+
80
+ Collect the roast report. Extract the dual score (Prompt Quality 0–100, Compute Efficiency 0–100) and the grade letters.
81
+
82
+ ## Phase 5 — Score, Track, Present
83
+
84
+ Save results to `~/.roast-me-history.json`. Read existing history (if any), append a new entry:
85
+
86
+ ```json
87
+ {
88
+ "date": "YYYY-MM-DD",
89
+ "runtime": "auto",
90
+ "days_analyzed": 7,
91
+ "prompt_quality_score": 73,
92
+ "prompt_quality_grade": "C",
93
+ "compute_efficiency_score": 35,
94
+ "compute_efficiency_grade": "F",
95
+ "total_prompts": 200,
96
+ "issues_flagged": 30,
97
+ "effective_error_rate": 0.08,
98
+ "correction_rate": 0.06,
99
+ "focus_of_week": "The 3W Rule",
100
+ "compute_total_cost_usd": 22.50,
101
+ "compute_wasted_cost_usd": 8.10,
102
+ "compute_overuse_count": 30,
103
+ "model_distribution": {"heavy": 0.4, "balanced": 0.4, "light": 0.2}
104
+ }
105
+ ```
106
+
107
+ Write the updated history back to `~/.roast-me-history.json`.
108
+
109
+ If previous entries exist, append a trend line after the main report:
110
+
111
+ ```
112
+ Score History:
113
+ Date Prompt Quality Compute Efficiency Focus
114
+ 2026-06-01 68/100 (D+) --/-- (new) Context anchoring
115
+ 2026-06-08 73/100 (C) +5↑ 35/100 (F) The 3W Rule
116
+ ```
117
+
118
+ Output the roast report as formatted markdown. If history exists, append the trend line.
119
+
120
+ Done.
121
+
122
+ ## Orientación (siempre)
123
+
124
+ Cierra cada turno con el **bloque-brújula** (📍 dónde estás · ✅ qué hiciste · 🧭 por qué · ➡️ siguiente, terminando en pregunta), calibrado al dial de `02-DOCS/wiki/harness/user-profile.md`. **Nunca termines en seco.** Protocolo completo: skill `orient` → `skills/orient/references/orientation-contract.md`. (Defiere a `suggest` el "¿instalo la skill que falta?".)
@@ -0,0 +1,76 @@
1
+ # Eval harness — `roast-me`
2
+
3
+ Evaluates the `roast-me` skill on two axes: **triggering** (does it fire on
4
+ the right prompts and stay quiet on near-misses) and **capability** (does loading
5
+ it produce a correct, useful roast run with both scores, the right habit list,
6
+ and the trend tracking). Cases live in `cases.yaml`. These run via an **agent
7
+ harness** — a human or driver agent feeds prompts to the assistant and judges
8
+ the result against the rubrics.
9
+
10
+ ## What is in `cases.yaml`
11
+
12
+ - `should_trigger` (7) — prompts that MUST invoke `roast-me`. Includes:
13
+ - verbatim "roast me"
14
+ - time-window variant ("last two weeks")
15
+ - cost-only angle (no mention of prompting quality)
16
+ - two non-English triggers (Spanish, Catalan)
17
+ - non-obvious phrasing ("honest score", "am I wasting money")
18
+ - `should_not_trigger` (5) — near-misses that must route elsewhere, each with
19
+ a real sibling `route_to`: `code-review`, `prompt-engineering`, `analyze`,
20
+ `chatbot`, `llm-pipeline`.
21
+ - `capability` (2) — scenarios with `must_include` rubrics:
22
+ 1. A full 14-day run with realistic data: error rates, tier distribution,
23
+ top issues, scoring, and history tracking.
24
+ 2. Unknown runtime degradation: exits 0, clear message, no crash.
25
+
26
+ ## A. Triggering eval
27
+
28
+ 1. Load **only** `roast-me` into the agent (no other rsc skills, so routing is honest).
29
+ 2. For each `should_trigger` prompt: fresh session, paste verbatim, record
30
+ whether the five-phase pipeline starts. Run **3–5 trials** per prompt.
31
+ 3. For each `should_not_trigger` prompt: same, but a **pass** = `roast-me`
32
+ does NOT fire. Sanity-check that the `route_to` sibling genuinely owns
33
+ that prompt.
34
+ 4. Score: a prompt passes if the **majority of its trials** go the expected way.
35
+
36
+ **Pass bar**: >= 90% trigger accuracy across all 12 prompts (at most 1 misbehaving).
37
+
38
+ ## B. Capability eval
39
+
40
+ 1. **Without the skill**: fresh session, skill NOT loaded, give the `scenario`
41
+ prompt. Save output A.
42
+ 2. **With the skill**: fresh session, `roast-me` loaded, same prompt. Save output B.
43
+ 3. Grade each output against that scenario's `must_include` points.
44
+ 4. Repeat across **3 trials** per scenario per condition and average.
45
+
46
+ **Pass bar**: WITH the skill covers >= 80% of `must_include` points. WITHOUT
47
+ should be materially lower (target >= 30-point gap). If the skill does not
48
+ beat the baseline, the skill or the rubrics need work.
49
+
50
+ ## Key differentiators WITH the skill loaded
51
+
52
+ - **Effective vs raw error rate**: the skill uses `effective_error_rate`
53
+ (unrecovered errors only), never the raw rate. A high raw / low effective
54
+ split is praised, not penalised.
55
+ - **Parallel subagents**: the skill explicitly batches prompts into ~30-item
56
+ groups and spawns parallel analysis passes per batch — a baseline answer
57
+ typically analyses sequentially.
58
+ - **Dual scoring formula**: Prompt Quality (base 70 ± issue penalties/bonuses)
59
+ and Compute Efficiency (base 80 − overuse penalties + mix-down bonus) are
60
+ computed independently to the documented formula.
61
+ - **Trend tracking**: history is written to `~/.roast-me-history.json` and a
62
+ trend line is printed if prior entries exist.
63
+ - **Runtime degradation**: unknown `--runtime` exits 0 with a clear message —
64
+ no crash, no invented data.
65
+ - **Original rsc voice**: no copied phrasing from other skill ecosystems.
66
+
67
+ ## Judging notes
68
+
69
+ - This is LLM-as-judge / human-in-the-loop, not deterministic. Use a consistent
70
+ grader (same model + rubric) across A/B.
71
+ - `scripts/eval-lint.sh` checks case-count minimums; it does not grade prose.
72
+ - Re-run after any edit to `SKILL.md` or its `description` — both axes are
73
+ wording-sensitive.
74
+ - The key confusable: `roast-me` reviews the user's **prompting behaviour**
75
+ (how they write prompts), not their code and not the agent's output quality.
76
+ Any eval that conflates these is testing the wrong thing.
@@ -0,0 +1,75 @@
1
+ skill: roast-me
2
+
3
+ # Eval cases for the roast-me skill.
4
+ # roast-me audits the USER's own prompting behaviour — it reads their past
5
+ # agent session transcripts, produces dual scores (Prompt Quality + Compute
6
+ # Efficiency), names their worst habits, and tracks a trend over time.
7
+ # It does NOT review code, does NOT review agent output quality, and is NOT
8
+ # a general prompt-engineering tutorial.
9
+
10
+ should_trigger:
11
+ - prompt: "Roast me."
12
+ why: "Canonical verbatim invocation — the skill's own trigger phrase. Must fire immediately and run the five-phase pipeline."
13
+
14
+ - prompt: "Audit my prompting habits over the last two weeks."
15
+ why: "Situation-first phrasing with a custom time window — roast-me owns both the 'audit prompting' job and the --days N parameter."
16
+
17
+ - prompt: "Am I prompting the AI well or am I wasting money?"
18
+ why: "Dual concern (prompt quality + cost) phrased as a question — maps directly to the two scores roast-me produces. No mention of 'roast' keyword."
19
+
20
+ - prompt: "Analiza mis prompts de la última semana y dime dónde soy malo."
21
+ why: "Spanish-language trigger — rsc users write Spanish; the description must fire on non-English phrasings. Tests the non-English trigger."
22
+
23
+ - prompt: "Puntua els meus prompts i digue'm quins hàbits hauria de canviar."
24
+ why: "Catalan-language trigger — a non-obvious language variant. The description lists this phrasing explicitly."
25
+
26
+ - prompt: "How much am I burning on model costs? Show me where I overspend."
27
+ why: "Compute-efficiency angle only, no prompting-quality framing — roast-me's compute score and cheat sheet are the direct answer."
28
+
29
+ - prompt: "Give me an honest score on how I prompt Claude."
30
+ why: "Non-obvious phrasing ('honest score') without the roast keyword — tests situation matching over keyword matching."
31
+
32
+ should_not_trigger:
33
+ - prompt: "Review this pull request and flag any bugs before we merge."
34
+ route_to: "code-review"
35
+ why: "Reviewing code output is code-review's job. roast-me reviews how the user prompts, not the code they produced."
36
+
37
+ - prompt: "Write me a guide on how to prompt AI assistants effectively."
38
+ route_to: "prompt-engineering"
39
+ why: "A general tutorial on prompting technique is prompt-engineering. roast-me reads live transcripts and produces a personalised score, not a tutorial."
40
+
41
+ - prompt: "Summarise what our agent did in yesterday's session."
42
+ route_to: "analyze"
43
+ why: "Summarising what an agent did (output review) is analyse or a generic task. roast-me looks at how the USER prompted, not what the agent produced."
44
+
45
+ - prompt: "I need a system prompt template for a customer-support chatbot."
46
+ route_to: "chatbot"
47
+ why: "Authoring a system prompt for a product is a different job from auditing a user's own prompting habits. No transcript data involved."
48
+
49
+ - prompt: "Help me reduce my Anthropic API costs by switching to cheaper models in my app."
50
+ route_to: "llm-pipeline"
51
+ why: "Optimising model selection in production application code is llm-pipeline. roast-me analyses the user's interactive session spend, not application-level API calls."
52
+
53
+ capability:
54
+ - scenario: "User runs '/roast-me --days 14'. The extractor finds 180 Claude Code prompts: 12 with unrecovered errors, 25 auto-recovered, and 15 corrections. Model tier split is 60% heavy, 30% balanced, 10% light. Top issues: 8 VAGUE openers and 5 SCOPE_CREEP cases. Show what a correct run looks like."
55
+ must_include:
56
+ - "Phase 1: runs extract_prompts.py with --days 14 and reports Scanned N sessions, Extracted 180 prompts (12 impactful errors, 25 auto-recovered)"
57
+ - "Reports effective_error_rate (12/180 = ~6.7%) not the raw rate (37/180 = ~20%); explicitly notes that auto-recovered errors are the agent doing its job"
58
+ - "Phase 2: batches into ~30-prompt groups and spawns parallel subagent passes with prompts/analyze.md, passing the batch as JSON"
59
+ - "Filters aggressively: only keeps issues where error_was_recovered is false or a correction immediately followed — discards the 25 auto-recovered"
60
+ - "Phase 3: spawns parallel compute-analysis subagents with prompts/compute.md; aggregates overuse_cases with high/medium confidence only"
61
+ - "Phase 4: spawns roast subagent with prompts/roast.md and all aggregated data; extracts dual score (Prompt Quality 0-100, Compute Efficiency 0-100)"
62
+ - "Prompt Quality score: starts at 70, deducted for VAGUE and SCOPE_CREEP issues, adjusted for rates; must land in a C-D range given 8 high + 5 medium issues"
63
+ - "Compute Efficiency score: 60% heavy tier with confirmed overuse triggers a low score (F range) and a personalized tier-selection cheat sheet"
64
+ - "Phase 5: appends to ~/.roast-me-history.json with date, both scores, grades, and focus_of_week; prints trend line if history exists"
65
+ - "Top 3 habits section names VAGUE openers and SCOPE_CREEP with concrete before/after rewrites and named techniques"
66
+ - "What You Do Well section is present and finds genuine positives from the issue-free prompts"
67
+ - "Focus of the Week names a single actionable change — likely a compute-related one given the large gap between quality and efficiency scores"
68
+
69
+ - scenario: "User runs '/roast-me --runtime xyz' where xyz is an unrecognised runtime. Show the correct degradation behaviour."
70
+ must_include:
71
+ - "Prints a clear message: 'Unknown runtime xyz. Known runtimes: claude, codex, gemini. Use auto to try all.'"
72
+ - "Exits 0 — no crash, no stack trace, no exception shown to the user"
73
+ - "Writes an empty result JSON and prints its path so downstream tooling does not break"
74
+ - "Does NOT invent records, does NOT guess data, does NOT proceed to the analysis phases"
75
+ - "Tells the user what to do next (try --runtime auto or check the available runtime IDs)"
@@ -0,0 +1,90 @@
1
+ # Prompt Quality Analysis
2
+
3
+ You are analysing user prompts extracted from AI assistant session logs. Your job is to find the prompts that — due to how they were written — caused measurable negative outcomes. You are NOT here to polish good prompts. You are here to find the ones that genuinely cost the user time or effort.
4
+
5
+ ## Input
6
+
7
+ You will receive a JSON array of prompt records. Each record has:
8
+
9
+ - `prompt_text` — the user's message (may be truncated)
10
+ - `prompt_length` — full length before truncation
11
+ - `prompt_position` — 1-based index in the conversation (1 = opening prompt)
12
+ - `total_prompts_in_session` — how many user prompts were in this session
13
+ - `has_xml_tags` — whether the prompt uses XML structural tags
14
+ - `has_file_paths` — whether explicit file paths are mentioned
15
+ - `has_code_blocks` — whether code blocks appear in the prompt
16
+ - `followed_by_error` — a tool error occurred after this prompt
17
+ - `error_was_recovered` — the agent recovered from the error without user help
18
+ - `followed_by_correction` — the user had to correct the agent's next action
19
+ - `correction_text` — what the correction said
20
+ - `error_tool` — which tool produced the error
21
+ - `error_text` — the error message
22
+ - `context_before` — the preceding assistant message (for context)
23
+
24
+ ## The Golden Rule: Impact First
25
+
26
+ **Only flag issues where the writing of the prompt was a meaningful cause of a bad outcome.** The user is a productive senior developer. Most short prompts, most exploration errors, and most terse follow-ups are completely normal and correct.
27
+
28
+ ### Do NOT flag (be generous):
29
+
30
+ - Short continuations: "yes", "ok", "commit", "looks good", "go ahead" — these are normal turn-taking, not bad prompts
31
+ - Deep-in-session brevity (high `prompt_position`) — context is established; terseness is appropriate
32
+ - Any prompt where `followed_by_error = true` AND `error_was_recovered = true` — the agent handled it; the prompt did its job
33
+ - Simple, clear requests that worked cleanly (no error, no correction)
34
+ - System or slash-command messages (content like `<command-message>` wrapper tags)
35
+ - File-not-found during exploration — this is normal agent behaviour
36
+ - Errors caused by environment issues, not by prompt ambiguity
37
+
38
+ ### DO flag (only these):
39
+
40
+ - Prompt was so vague that the agent went in a completely wrong direction and the user had to redirect
41
+ - Missing context (file, error message, expected behaviour) that **directly caused** an unrecovered error or wasted significant work (>10 tool calls)
42
+ - User had to correct the agent immediately after — meaning the prompt was genuinely misleading
43
+ - Multiple unrelated tasks in one prompt that caused one of them to fail
44
+ - Prompt that triggered a dangerous or irreversible action (mass deletion, dropping data, force-pushing to main)
45
+
46
+ ## Issue Categories
47
+
48
+ | Code | When to apply |
49
+ |------|---------------|
50
+ | `VAGUE` | "fix it", "make it work", "clean this up" — zero specifics, agent had no signal to start correctly |
51
+ | `NO_CONTEXT` | Missing file path, error message, or expected vs actual behaviour — would have changed where the agent looked |
52
+ | `NEGATIVE` | Specifies only what NOT to do without saying what TO do — agent guessed wrong as a result |
53
+ | `NO_CRITERIA` | No way to know when the task is done — agent produced something, user rejected it, but the prompt gave no target |
54
+ | `WALL_OF_TEXT` | Long unstructured paragraph for a complex multi-step task that should use headings, lists, or XML sections |
55
+ | `SCOPE_CREEP` | Multiple unrelated tasks crammed together — one failed because the other consumed all the agent's attention |
56
+ | `SELF_CONTRADICT` | The user's next message corrected the direction — meaning the prompt was internally inconsistent or misleading |
57
+ | `NO_STRUCTURE` | Complex multi-step or multi-file task with no structure at all — agent had to guess the sequencing |
58
+ | `CAUSED_FAILURE` | Prompt wording directly caused a tool error or a wrong action (dangerous, irreversible, or significantly wasteful) |
59
+
60
+ ## Severity Guide
61
+
62
+ - **high** — prompt directly caused wasted work (>20 tool calls), a dangerous action, or required significant correction effort
63
+ - **medium** — prompt caused moderate inefficiency (5–20 wasted tool calls) or noticeable misdirection
64
+ - **low** — minor improvement possible; the prompt mostly worked but a small change would have helped
65
+
66
+ ## Output Format
67
+
68
+ Return a JSON array. Include one object per prompt that has real issues. Skip issue-free prompts — most records should produce no entry.
69
+
70
+ ```json
71
+ [
72
+ {
73
+ "index": 0,
74
+ "issues": ["VAGUE", "NO_CONTEXT"],
75
+ "severity": "high",
76
+ "impact": "Agent spent 28 tool calls scanning the wrong directory before the user provided the file path",
77
+ "explanation": "Opening prompt says 'fix the login bug' with no file, error, or expected behaviour — agent had to guess everything",
78
+ "technique": "The 3W Rule: What (the problem), Where (file/service), Why (expected vs actual). Front-load all three.",
79
+ "rewrite_suggestion": "Fix the auth middleware in src/middleware/auth.ts — valid JWT tokens are returning 401. Expected: token verified and request passed through. Error: [paste stack trace]",
80
+ "original_prompt_snippet": "first 200 chars of prompt_text"
81
+ }
82
+ ]
83
+ ```
84
+
85
+ Field notes:
86
+ - `impact` — be specific: wasted tool calls, dangerous action, correction loops. Not "unclear".
87
+ - `technique` — a named, reusable pattern the user can consciously apply next time.
88
+ - `rewrite_suggestion` — a concrete rewrite of THIS prompt applying the technique.
89
+
90
+ Be constructive. The goal is a skill lesson attached to a real example, not a catalogue of imperfections.
@@ -0,0 +1,100 @@
1
+ # Compute Efficiency Analysis
2
+
3
+ You are analysing a batch of user prompts to identify cases where a heavier (more expensive) AI model tier was used for work that a lighter tier handles equally well. Your output feeds the "Compute Efficiency" score and the model-selection cheat sheet in the roast report.
4
+
5
+ ## Tier Ladder (provider-neutral)
6
+
7
+ This skill uses three tier labels. The mapping to specific model names shifts over releases — the tiers are stable even when the names change.
8
+
9
+ | Tier | Description | Example models (2026) | Relative cost |
10
+ |------|-------------|----------------------|---------------|
11
+ | `light` | Fast, cheap, low reasoning | Haiku, GPT-3.5, Gemini Flash | 1× |
12
+ | `balanced` | General purpose, solid reasoning | Sonnet, GPT-4 mini, Gemini Pro | 3–5× |
13
+ | `heavy` | Frontier reasoning, long-horizon work | Opus, Fable/Mythos, GPT-4 full | 10–50× |
14
+
15
+ Overuse = using `heavy` where `balanced` would suffice, or using `balanced`/`heavy` where `light` would suffice.
16
+
17
+ ## Input
18
+
19
+ You will receive a JSON array of prompt records. Each record includes:
20
+
21
+ - `prompt_text` — the user's message
22
+ - `prompt_length` — character count
23
+ - `task_complexity` — heuristic classification: `simple`, `moderate`, `complex`
24
+ - `recommended_tier` — what the extractor heuristic suggests: `light`, `balanced`, `heavy`
25
+ - `model_tier` — the tier that was actually used (or `unknown`)
26
+ - `compute_was_overkill` — boolean from the extractor heuristic (sanity-check, not gospel)
27
+ - Standard prompting context fields (`has_xml_tags`, `followed_by_error`, `error_was_recovered`, etc.)
28
+
29
+ ## What counts as overuse
30
+
31
+ ### Definite overuse (flag with `high` confidence):
32
+ - `heavy` model for: a one-word confirmation ("yes", "ok", "commit"), a read-only lookup, a formatting/linting fix, a simple file rename
33
+ - `heavy` + extended reasoning for: anything classified `simple`
34
+ - `balanced` or `heavy` for: a bare file existence check, grepping a single pattern, listing directory contents
35
+
36
+ ### Probable overuse (flag with `medium` confidence):
37
+ - `heavy` model for: a single-file edit with clear, contained scope
38
+ - `heavy` + extended reasoning for: a short explanatory question with an obvious answer
39
+ - `balanced` for: a simple file read or "what does X mean?" question
40
+
41
+ ### Not overuse (do NOT flag):
42
+ - `heavy` for: multi-file refactors, architectural decisions, debugging complex concurrency or performance issues, long autonomous runs (the model needs to hold a lot in context)
43
+ - `heavy` + extended reasoning for: genuinely ambiguous debugging, performance root-cause analysis, security audit, or migration planning
44
+ - `unknown` tier — if the model cannot be identified, skip it; do not guess
45
+
46
+ ## Extended Reasoning Overuse
47
+
48
+ If a prompt shows signs that extended/chain-of-thought reasoning was used (the model took much longer than expected, or reasoning tokens appear in context), flag it as `thinking_overuse` if the task is clearly simple or moderate.
49
+
50
+ ## Output Format
51
+
52
+ Return a single JSON object:
53
+
54
+ ```json
55
+ {
56
+ "overuse_cases": [
57
+ {
58
+ "index": 2,
59
+ "task_type": "simple_confirmation",
60
+ "model_tier_used": "heavy",
61
+ "recommended_tier": "light",
62
+ "confidence": "high",
63
+ "reasoning": "Single-word confirmation 'yes' — model has no work to do; light tier is identical in outcome",
64
+ "example_prompt": "yes",
65
+ "estimated_tier_ratio": 10
66
+ }
67
+ ],
68
+ "thinking_overuse_cases": [
69
+ {
70
+ "index": 7,
71
+ "confidence": "medium",
72
+ "reasoning": "Simple file rename using heavy model with extended reasoning — no branching logic required"
73
+ }
74
+ ],
75
+ "correctly_used_heavy_model": [
76
+ {
77
+ "index": 14,
78
+ "reasoning": "Multi-file authentication refactor across 8 files with security implications — heavy tier justified"
79
+ }
80
+ ],
81
+ "total_overuse_count": 3,
82
+ "total_savings_estimate": "high",
83
+ "thinking_overuse_count": 1,
84
+ "worst_category": "simple_confirmation"
85
+ }
86
+ ```
87
+
88
+ `task_type` values (pick the closest):
89
+ - `simple_confirmation` — yes/no/ok/lgtm/commit/ship-it
90
+ - `read_only_lookup` — read/show/list/cat/find
91
+ - `style_fix` — format/lint/prettier/semicolons
92
+ - `single_file_edit` — one file, clear contained scope
93
+ - `simple_question` — what is X / explain Y (short answer expected)
94
+ - `multi_file_work` — spans multiple files (usually NOT overuse at heavy tier)
95
+ - `architectural_decision` — design/plan/migrate/strategy (usually NOT overuse)
96
+ - `debug_complex` — race conditions, memory leaks, performance (usually NOT overuse)
97
+
98
+ `estimated_tier_ratio`: approximate cost ratio (e.g. 10 means the heavy tier cost ~10× the recommended tier). Use rough multiples: light→balanced ≈ 3–5×; light→heavy ≈ 10–50×; balanced→heavy ≈ 3–10×.
99
+
100
+ Only return `high` and `medium` confidence cases. If you are uncertain, omit the record — false positives harm the user's trust in the report.
@@ -0,0 +1,181 @@
1
+ # Roast Report Generator
2
+
3
+ You are producing a prompting audit report that is **educational, funny, and actionable**. Comedy-roast energy: every joke should leave the user knowing something they did not know before. Never insult. Always teach.
4
+
5
+ ## Input
6
+
7
+ You will receive:
8
+
9
+ 1. **Issue summary** — aggregated prompt quality issues by category and severity, with counts.
10
+ 2. **Worst prompt examples** — up to 15 prompts with high/medium severity issues, including `impact`, `technique`, and `rewrite_suggestion` from the analysis phase.
11
+ 3. **Stats metadata** — extraction numbers: `effective_error_rate`, `correction_rate`, `avg_length`, `xml_usage_rate`, `file_path_rate`, `total_prompts`.
12
+ 4. **Good prompt examples** — up to 10 prompts with no issues (for "What You Do Well").
13
+ 5. **Compute stats** — `tier_distribution`, `heuristic_overuse_count`, total prompts, and any cost data if available.
14
+ 6. **Compute analysis** — overuse cases, thinking overuse, correctly-used heavy-model examples, and totals.
15
+
16
+ ## Fairness Rule
17
+
18
+ The stats include two error rates:
19
+
20
+ - `error_rate` — raw rate of prompts followed by any tool error (includes normal exploration)
21
+ - `effective_error_rate` — only errors that were NOT auto-recovered
22
+
23
+ **Use `effective_error_rate` as the primary metric.** High raw error rate with low effective rate = the agent is doing its job. Praise this, do not penalise it.
24
+
25
+ ## Scoring
26
+
27
+ Compute two independent scores from the data you receive.
28
+
29
+ ### Prompt Quality Score (0–100)
30
+
31
+ Start at 70 (baseline for an active user who already ships things).
32
+
33
+ Adjustments:
34
+ - -2 per high-severity issue (cap at -30 total from high-severity)
35
+ - -1 per medium-severity issue (cap at -15 total from medium-severity)
36
+ - +5 if xml_usage_rate > 0.20
37
+ - +5 if file_path_rate > 0.50
38
+ - +5 if effective_error_rate < 0.15
39
+ - +5 if correction_rate < 0.10
40
+ - -10 if any prompt caused a destructive/irreversible action (DROP, mass delete, force-push to main)
41
+
42
+ Clamp to 0–100.
43
+
44
+ ### Compute Efficiency Score (0–100)
45
+
46
+ Start at 80.
47
+
48
+ Adjustments:
49
+ - Subtract `(confirmed_overuse_count / total_prompts) * 60` (overuse rate penalty)
50
+ - Subtract `(thinking_overuse_count / total_prompts) * 20` (reasoning overuse penalty)
51
+ - +10 if lighter tiers (balanced or light) appear in the tier distribution alongside any heavy usage (bonus for mixing down)
52
+ - +10 if the majority of prompts used balanced or light tiers
53
+
54
+ Clamp to 0–100.
55
+
56
+ Grade mapping (apply to both scores): A (90+), B (80–89), C (70–79), D (60–69), F (<60).
57
+
58
+ Display as:
59
+
60
+ ```
61
+ ## Prompt Quality: 73/100 (C) | Compute Efficiency: 35/100 (F)
62
+ ```
63
+
64
+ Follow each score with a one-liner. Make it funny. Make it teach something.
65
+
66
+ ---
67
+
68
+ ## Output Structure
69
+
70
+ ### 1. Dual Score & Grade
71
+
72
+ Show both scores with grades and one-liner commentary as shown above.
73
+
74
+ ### 2. Top 3 Habits to Break
75
+
76
+ Pick three by real impact — wasted tool calls, dangerous actions, correction frequency. Do NOT manufacture criticism if there are fewer than three real issues. For each:
77
+
78
+ - **The habit** — a named pattern (e.g., "The Vague Opener")
79
+ - **Impact** — what actually went wrong, specifically (tool calls wasted, corrections needed, dangerous action)
80
+ - **The technique** — a named, reusable prompting move to fix it
81
+ - **Before / After** — actual quote from their prompts → a concrete rewrite
82
+
83
+ ### 3. Stats Dashboard
84
+
85
+ Format as a table. Lead with `effective_error_rate` and show the raw rate in parentheses for context.
86
+
87
+ | Metric | Value | Verdict |
88
+ |--------|-------|---------|
89
+ | Effective error rate | X% (Y% raw, Z% auto-recovered) | [verdict] |
90
+ | Correction rate | X% | [verdict] |
91
+ | Avg prompt length | X chars | [verdict] |
92
+ | Structured prompts (XML / markdown) | X% | [verdict] |
93
+ | File paths included | X% | [verdict] |
94
+
95
+ ### 4. Compute Efficiency Report
96
+
97
+ #### 4a. Where the Money Went
98
+
99
+ Show the tier distribution and overuse summary. If cost data is available, show dollar figures. If not, use counts and tier ratios.
100
+
101
+ | Metric | Value |
102
+ |--------|-------|
103
+ | Tier split | heavy X% / balanced Y% / light Z% |
104
+ | Confirmed overuse cases | N prompts |
105
+ | Worst overuse pattern | [task_type] at heavy tier |
106
+ | Reasoning overuse | N prompts with unnecessary extended reasoning |
107
+
108
+ #### 4b. Top 3 Compute Sins
109
+
110
+ For each confirmed overuse case (high/medium confidence only):
111
+ - **The sin** — what tier they used for what kind of task
112
+ - **The ratio** — approximately how much more expensive than necessary
113
+ - **The fix** — which tier to use for this type of task
114
+ - **Example** — actual prompt quote
115
+
116
+ If fewer than 3 confirmed cases exist, show fewer. Do not invent cases.
117
+
118
+ #### 4c. Model Selection Cheat Sheet
119
+
120
+ Based on their actual usage patterns, produce a 4–6 row personalised cheat sheet:
121
+
122
+ | Task pattern | Use this tier | Reasoning level | Notes |
123
+ |--------------|--------------|-----------------|-------|
124
+ | Read/show/list a file | light | none | 10–50× cheaper than heavy |
125
+ | One-word confirmation | light | none | Heavy has no work to do here |
126
+ | Single-file edit (clear scope) | balanced | low | Heavy adds nothing |
127
+ | Multi-file refactor | heavy | medium | Justified — needs context |
128
+ | Architecture / system design | heavy | high | Worth the premium |
129
+
130
+ Anchor cost comparisons to the tier ladder: light ≈ 1×, balanced ≈ 3–5×, heavy ≈ 10–50×.
131
+
132
+ #### 4d. What You Got Right (Compute)
133
+
134
+ Show 2–3 examples from the correctly-used heavy-model list where the premium tier was genuinely the right call. Be specific about why (long context, complex reasoning, multi-file work). This section teaches the user what good looks like, not just what bad looks like.
135
+
136
+ **Tone for this section**: burning money humour. "You used a satellite dish to order a pizza." Compare costs to concrete things. Make fun of the absurdity, not the person. If the user already mixes tiers well, praise loudly instead.
137
+
138
+ ### 5. Technique Toolbox
139
+
140
+ List 3–5 named techniques extracted from the analysis. Each:
141
+
142
+ - **Name** — memorable, 2–4 words
143
+ - **When to use** — the situation that should trigger it
144
+ - **Template** — a fill-in-the-blank the user can copy verbatim
145
+
146
+ Example:
147
+
148
+ > **The 3W Rule** — When opening a new task
149
+ > Template: `[What] is broken in [Where]. Expected: [Why-expected]. Actual: [Why-actual].`
150
+
151
+ ### 6. What You Do Well
152
+
153
+ **Always required.** Find positives from the good-prompt examples. Be specific: name what the user did well and why it worked. This section must be genuine — do not manufacture praise, but do look hard for real wins. Both prompt quality and compute wins count here.
154
+
155
+ ### 7. Focus of the Week
156
+
157
+ One specific, actionable change to try over the next seven days.
158
+
159
+ Rules:
160
+ - One change only — not a list
161
+ - Concrete enough to practise consciously
162
+ - Measurable — the user can verify progress by running `/roast-me` next week
163
+
164
+ If the compute score is significantly lower than the prompt quality score, prioritise a compute-related focus.
165
+
166
+ Format: a one-sentence rule + before/after example.
167
+
168
+ ---
169
+
170
+ ## Tone Guidelines
171
+
172
+ - Roast, do not insult. Think comedy roast dinner, not street harassment.
173
+ - Every joke must teach something. If it doesn't teach, it's not in.
174
+ - Pop culture references welcome, jargon welcome — the user is a senior dev.
175
+ - If the user is genuinely good at prompting, acknowledge it. Manufacturing criticism for a good prompter is worse than silence.
176
+ - If the data is thin (< 30 prompts), say so and calibrate confidence accordingly.
177
+ - Self-deprecating AI humour is fine. Comparing the user to a bad prompter from the old days is fine. Personal insults are not.
178
+
179
+ ## Output
180
+
181
+ Produce clean markdown. Use headers, tables, and code blocks as shown. The report will be printed directly to the terminal, so keep formatting terminal-friendly (no wide tables if avoidable, use monospace for the before/after examples).
@@ -0,0 +1 @@
1
+ # roast-me runtime adapters package