@cubis/foundry 0.3.70 → 0.3.71

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (35) hide show
  1. package/package.json +1 -1
  2. package/workflows/powers/ask-questions-if-underspecified/SKILL.md +51 -3
  3. package/workflows/powers/behavioral-modes/SKILL.md +100 -9
  4. package/workflows/skills/agent-design/SKILL.md +198 -0
  5. package/workflows/skills/agent-design/references/clarification-patterns.md +153 -0
  6. package/workflows/skills/agent-design/references/skill-testing.md +164 -0
  7. package/workflows/skills/agent-design/references/workflow-patterns.md +226 -0
  8. package/workflows/skills/deep-research/SKILL.md +25 -20
  9. package/workflows/skills/deep-research/references/multi-round-research-loop.md +73 -8
  10. package/workflows/skills/frontend-design/SKILL.md +37 -32
  11. package/workflows/skills/frontend-design/commands/brand.md +167 -0
  12. package/workflows/skills/frontend-design/references/brand-presets.md +228 -0
  13. package/workflows/skills/generated/skill-audit.json +11 -2
  14. package/workflows/skills/generated/skill-catalog.json +37 -5
  15. package/workflows/skills/skills_index.json +1 -1
  16. package/workflows/workflows/agent-environment-setup/platforms/claude/skills/agent-design/SKILL.md +198 -0
  17. package/workflows/workflows/agent-environment-setup/platforms/claude/skills/agent-design/references/clarification-patterns.md +153 -0
  18. package/workflows/workflows/agent-environment-setup/platforms/claude/skills/agent-design/references/skill-testing.md +164 -0
  19. package/workflows/workflows/agent-environment-setup/platforms/claude/skills/agent-design/references/workflow-patterns.md +226 -0
  20. package/workflows/workflows/agent-environment-setup/platforms/claude/skills/deep-research/SKILL.md +25 -20
  21. package/workflows/workflows/agent-environment-setup/platforms/claude/skills/deep-research/references/multi-round-research-loop.md +73 -8
  22. package/workflows/workflows/agent-environment-setup/platforms/claude/skills/frontend-design/SKILL.md +37 -32
  23. package/workflows/workflows/agent-environment-setup/platforms/claude/skills/frontend-design/commands/brand.md +167 -0
  24. package/workflows/workflows/agent-environment-setup/platforms/claude/skills/frontend-design/references/brand-presets.md +228 -0
  25. package/workflows/workflows/agent-environment-setup/platforms/claude/skills/skills_index.json +1 -1
  26. package/workflows/workflows/agent-environment-setup/platforms/copilot/skills/agent-design/SKILL.md +197 -0
  27. package/workflows/workflows/agent-environment-setup/platforms/copilot/skills/agent-design/references/clarification-patterns.md +153 -0
  28. package/workflows/workflows/agent-environment-setup/platforms/copilot/skills/agent-design/references/skill-testing.md +164 -0
  29. package/workflows/workflows/agent-environment-setup/platforms/copilot/skills/agent-design/references/workflow-patterns.md +226 -0
  30. package/workflows/workflows/agent-environment-setup/platforms/copilot/skills/deep-research/SKILL.md +25 -20
  31. package/workflows/workflows/agent-environment-setup/platforms/copilot/skills/deep-research/references/multi-round-research-loop.md +73 -8
  32. package/workflows/workflows/agent-environment-setup/platforms/copilot/skills/frontend-design/SKILL.md +37 -32
  33. package/workflows/workflows/agent-environment-setup/platforms/copilot/skills/frontend-design/commands/brand.md +167 -0
  34. package/workflows/workflows/agent-environment-setup/platforms/copilot/skills/frontend-design/references/brand-presets.md +228 -0
  35. package/workflows/workflows/agent-environment-setup/platforms/copilot/skills/skills_index.json +1 -1
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@cubis/foundry",
3
- "version": "0.3.70",
3
+ "version": "0.3.71",
4
4
  "description": "Cubis Foundry CLI for workflow-first AI agent environments",
5
5
  "type": "module",
6
6
  "bin": {
@@ -1,17 +1,27 @@
1
1
  ---
2
2
  name: ask-questions-if-underspecified
3
- description: Clarify requirements before implementing. Use when serious doubts arise.
3
+ description: Clarify requirements before implementing. Use when serious doubts arise about objective, scope, constraints, environment, or safety — or when the task is substantial enough that being wrong wastes significant effort.
4
4
  ---
5
5
 
6
6
  # Ask Questions If Underspecified
7
7
 
8
8
  ## When to Use
9
9
 
10
- Use this skill when a request has multiple plausible interpretations or key details (objective, scope, constraints, environment, or safety) are unclear.
10
+ Use this skill when a request has multiple plausible interpretations or key details (objective, scope, constraints, environment, or safety) are unclear — **and** when the cost of implementing the wrong interpretation is significant.
11
+
12
+ Three situations require clarification:
13
+
14
+ 1. **High branching** — Multiple plausible interpretations produce significantly different implementations
15
+ 2. **Substantial deliverable** — The task is large enough that wrong assumptions waste real time
16
+ 3. **Safety-critical** — The action is hard to reverse (data migrations, deployments, file deletions)
11
17
 
12
18
  ## When NOT to Use
13
19
 
14
- Do not use this skill when the request is already clear, or when a quick, low-risk discovery read can answer the missing details.
20
+ Do not use this skill when:
21
+
22
+ - The request is already clear and one interpretation is obviously correct
23
+ - A quick discovery read (config files, existing patterns, repo structure) can answer the missing details faster than asking
24
+ - The task is small enough that being slightly wrong is cheap and correctable
15
25
 
16
26
  ## Goal
17
27
 
@@ -22,6 +32,7 @@ Ask the minimum set of clarifying questions needed to avoid wrong work; do not s
22
32
  ### 1) Decide whether the request is underspecified
23
33
 
24
34
  Treat a request as underspecified if after exploring how to perform the work, some or all of the following are not clear:
35
+
25
36
  - Define the objective (what should change vs stay the same)
26
37
  - Define "done" (acceptance criteria, examples, edge cases)
27
38
  - Define scope (which files/components/users are in/out)
@@ -36,6 +47,7 @@ If multiple plausible interpretations exist, assume it is underspecified.
36
47
  Ask 1-5 questions in the first pass. Prefer questions that eliminate whole branches of work.
37
48
 
38
49
  Make questions easy to answer:
50
+
39
51
  - Optimize for scannability (short, numbered questions; avoid paragraphs)
40
52
  - Offer multiple-choice options when possible
41
53
  - Suggest reasonable defaults when appropriate (mark them clearly as the default/recommended choice; bold the recommended choice in the list, or if you present options in a code block, put a bold "Recommended" line immediately above the block and also tag defaults inside the block)
@@ -47,10 +59,12 @@ Make questions easy to answer:
47
59
  ### 3) Pause before acting
48
60
 
49
61
  Until must-have answers arrive:
62
+
50
63
  - Do not run commands, edit files, or produce a detailed plan that depends on unknowns
51
64
  - Do perform a clearly labeled, low-risk discovery step only if it does not commit you to a direction (e.g., inspect repo structure, read relevant config files)
52
65
 
53
66
  If the user explicitly asks you to proceed without answers:
67
+
54
68
  - State your assumptions as a short numbered list
55
69
  - Ask for confirmation; proceed only after they confirm or correct them
56
70
 
@@ -83,3 +97,37 @@ Reply with: defaults (or 1a 2a)
83
97
 
84
98
  - Don't ask questions you can answer with a quick, low-risk discovery read (e.g., configs, existing patterns, docs).
85
99
  - Don't ask open-ended questions if a tight multiple-choice or yes/no would eliminate ambiguity faster.
100
+ - Don't ask more than 5 questions at once — rank by impact and ask the top ones.
101
+ - Don't skip the fast-path — every clarification block needs `defaults` shortcut.
102
+ - Don't forget to restate interpretation before proceeding — confirms you heard correctly.
103
+ - Don't ask about reversible decisions — pick one, proceed, let them correct if wrong.
104
+
105
+ ## Three-Stage Pattern (for complex or substantial tasks)
106
+
107
+ For tasks where wrong assumptions would waste significant effort — documents, architecture decisions, multi-file features — use a three-stage approach:
108
+
109
+ ### Stage 1: Meta-context questions (3-5 questions)
110
+
111
+ Ask about the big picture before touching content:
112
+
113
+ - What _type_ of deliverable is this? (spec, code, doc, design, plan)
114
+ - Who's the audience/consumer?
115
+ - What does "done" look like?
116
+ - Existing template, format, or precedent to follow?
117
+ - Hard constraints (framework, performance, compatibility)?
118
+
119
+ ### Stage 2: Info dump + targeted follow-up
120
+
121
+ After Stage 1 answers: invite the user to brain-dump everything relevant.
122
+
123
+ > "Dump everything you know — background, prior decisions, constraints, opinions, blockers. Don't organize it. Just get it all out."
124
+
125
+ Then ask 5-10 targeted follow-up questions based on gaps. Users can answer in shorthand (`1: yes, 2: see above, 3: no`).
126
+
127
+ **Exit Stage 2 when:** You understand objective, constraints, and at least one clear definition of success.
128
+
129
+ ### Stage 3: Confirm interpretation, then proceed
130
+
131
+ Restate in 1-3 sentences before starting:
132
+
133
+ > "Here's what I understand: [objective]. [Key constraint]. [What done looks like]. Starting now — correct me if anything's off."
@@ -7,6 +7,7 @@ allowed-tools: Read, Glob, Grep
7
7
  # Behavioral Modes - Adaptive AI Operating Modes
8
8
 
9
9
  ## Purpose
10
+
10
11
  This skill defines distinct behavioral modes that optimize AI performance for specific tasks. Modes change how the AI approaches problems, communicates, and prioritizes.
11
12
 
12
13
  ---
@@ -18,6 +19,7 @@ This skill defines distinct behavioral modes that optimize AI performance for sp
18
19
  **When to use:** Early project planning, feature ideation, architecture decisions
19
20
 
20
21
  **Behavior:**
22
+
21
23
  - Ask clarifying questions before assumptions
22
24
  - Offer multiple alternatives (at least 3)
23
25
  - Think divergently - explore unconventional solutions
@@ -25,6 +27,7 @@ This skill defines distinct behavioral modes that optimize AI performance for sp
25
27
  - Use visual diagrams (mermaid) to explain concepts
26
28
 
27
29
  **Output style:**
30
+
28
31
  ```
29
32
  "Let's explore this together. Here are some approaches:
30
33
 
@@ -46,6 +49,7 @@ What resonates with you? Or should we explore a different direction?"
46
49
  **When to use:** Writing code, building features, executing plans
47
50
 
48
51
  **Behavior:**
52
+
49
53
  - **CRITICAL: Use `clean-code` skill standards** - concise, direct, no verbose explanations
50
54
  - Fast execution - minimize questions
51
55
  - Use established patterns and best practices
@@ -57,6 +61,7 @@ What resonates with you? Or should we explore a different direction?"
57
61
  - **NO RUSHING** - Quality > Speed. Read ALL references before coding.
58
62
 
59
63
  **Output style:**
64
+
60
65
  ```
61
66
  [Code block]
62
67
 
@@ -64,6 +69,7 @@ What resonates with you? Or should we explore a different direction?"
64
69
  ```
65
70
 
66
71
  **NOT:**
72
+
67
73
  ```
68
74
  "Building [feature]...
69
75
 
@@ -83,6 +89,7 @@ Run `npm run dev` to test."
83
89
  **When to use:** Fixing bugs, troubleshooting errors, investigating issues
84
90
 
85
91
  **Behavior:**
92
+
86
93
  - Ask for error messages and reproduction steps
87
94
  - Think systematically - check logs, trace data flow
88
95
  - Form hypothesis → test → verify
@@ -90,6 +97,7 @@ Run `npm run dev` to test."
90
97
  - Prevent future occurrences
91
98
 
92
99
  **Output style:**
100
+
93
101
  ```
94
102
  "Investigating...
95
103
 
@@ -106,6 +114,7 @@ Run `npm run dev` to test."
106
114
  **When to use:** Code review, architecture review, security audit
107
115
 
108
116
  **Behavior:**
117
+
109
118
  - Be thorough but constructive
110
119
  - Categorize by severity (Critical/High/Medium/Low)
111
120
  - Explain the "why" behind suggestions
@@ -113,6 +122,7 @@ Run `npm run dev` to test."
113
122
  - Acknowledge what's done well
114
123
 
115
124
  **Output style:**
125
+
116
126
  ```
117
127
  ## Code Review: [file/feature]
118
128
 
@@ -133,6 +143,7 @@ Run `npm run dev` to test."
133
143
  **When to use:** Explaining concepts, documentation, onboarding
134
144
 
135
145
  **Behavior:**
146
+
136
147
  - Explain from fundamentals
137
148
  - Use analogies and examples
138
149
  - Progress from simple to complex
@@ -140,6 +151,7 @@ Run `npm run dev` to test."
140
151
  - Check understanding
141
152
 
142
153
  **Output style:**
154
+
143
155
  ```
144
156
  ## Understanding [Concept]
145
157
 
@@ -163,6 +175,7 @@ Run `npm run dev` to test."
163
175
  **When to use:** Production deployment, final polish, release preparation
164
176
 
165
177
  **Behavior:**
178
+
166
179
  - Focus on stability over features
167
180
  - Check for missing error handling
168
181
  - Verify environment configs
@@ -170,6 +183,7 @@ Run `npm run dev` to test."
170
183
  - Create deployment checklist
171
184
 
172
185
  **Output style:**
186
+
173
187
  ```
174
188
  ## Pre-Ship Checklist
175
189
 
@@ -195,35 +209,111 @@ Run `npm run dev` to test."
195
209
 
196
210
  The AI should automatically detect the appropriate mode based on:
197
211
 
198
- | Trigger | Mode |
199
- |---------|------|
200
- | "what if", "ideas", "options" | BRAINSTORM |
201
- | "build", "create", "add" | IMPLEMENT |
202
- | "not working", "error", "bug" | DEBUG |
203
- | "review", "check", "audit" | REVIEW |
204
- | "explain", "how does", "learn" | TEACH |
205
- | "deploy", "release", "production" | SHIP |
212
+ | Trigger | Mode |
213
+ | ---------------------------------------------- | ------------------- |
214
+ | "what if", "ideas", "options" | BRAINSTORM |
215
+ | "build", "create", "add" | IMPLEMENT |
216
+ | "not working", "error", "bug" | DEBUG |
217
+ | "review", "check", "audit" | REVIEW |
218
+ | "explain", "how does", "learn" | TEACH |
219
+ | "deploy", "release", "production" | SHIP |
220
+ | "iterate", "refine quality", "not good enough" | EVALUATOR-OPTIMIZER |
221
+
222
+ ---
223
+
224
+ ## Workflow Patterns
225
+
226
+ Three patterns govern how modes combine across multiple agents or steps. Use the simplest pattern that solves the problem — add complexity only when it measurably improves results.
227
+
228
+ ### 1. Sequential (default)
229
+
230
+ Use when tasks have dependencies — each step needs the previous step's output.
231
+
232
+ ```
233
+ [BRAINSTORM] → [IMPLEMENT] → [REVIEW] → [SHIP]
234
+ ```
235
+
236
+ Best for: multi-stage features, draft-review-polish cycles, data pipelines.
237
+
238
+ ### 2. Parallel
239
+
240
+ Use when tasks are independent and doing them one at a time is too slow.
241
+
242
+ ```
243
+ [security REVIEW + performance REVIEW + quality REVIEW] → synthesize
244
+ ```
245
+
246
+ Best for: code review across multiple dimensions, parallel analysis. Requires a clear aggregation strategy before starting.
247
+
248
+ ### 3. Evaluator-Optimizer (new)
249
+
250
+ Use when first-draft quality consistently falls short and quality is measurable.
251
+
252
+ ```
253
+ [IMPLEMENT] → [REVIEW with criteria] → pass? → done
254
+ ↓ fail
255
+ feedback → [IMPLEMENT again]
256
+ ```
257
+
258
+ **When to use:**
259
+
260
+ - Technical docs, customer communications, SQL queries against specific standards
261
+ - Any output where the gap between first attempt and required quality is significant
262
+ - When you have clear, checkable criteria (not just "make it better")
263
+
264
+ **When NOT to use:**
265
+
266
+ - First-attempt quality is already acceptable
267
+ - Criteria are too subjective for consistent AI evaluation
268
+ - Real-time use cases needing immediate responses
269
+ - Deterministic validators exist (linters, schema validators) — use those instead
270
+
271
+ **Implementation:**
272
+
273
+ ```
274
+ ## Generator
275
+ Task: [what to create]
276
+ Constraints: [specific, measurable requirements — these become eval criteria]
277
+
278
+ ## Evaluator
279
+ Criteria:
280
+ 1. [Criterion A] — Pass/Fail + specific failure note
281
+ 2. [Criterion B] — Pass/Fail + specific failure note
282
+
283
+ Output JSON: { "pass": bool, "failures": ["..."], "revision_note": "..." }
284
+
285
+ Max iterations: 3 ← always set a ceiling
286
+ Stop when: all criteria pass OR max iterations reached
287
+ ```
206
288
 
207
289
  ---
208
290
 
209
- ## Multi-Agent Collaboration Patterns (2025)
291
+ ## Multi-Agent Collaboration Patterns
210
292
 
211
293
  Modern architectures optimized for agent-to-agent collaboration:
212
294
 
213
295
  ### 1. 🔭 EXPLORE Mode
296
+
214
297
  **Role:** Discovery and Analysis (Explorer Agent)
215
298
  **Behavior:** Socratic questioning, deep-dive code reading, dependency mapping.
216
299
  **Output:** `discovery-report.json`, architectural visualization.
217
300
 
218
301
  ### 2. 🗺️ PLAN-EXECUTE-CRITIC (PEC)
302
+
219
303
  Cyclic mode transitions for high-complexity tasks:
304
+
220
305
  1. **Planner:** Decomposes the task into atomic steps (`task.md`).
221
306
  2. **Executor:** Performs the actual coding (`IMPLEMENT`).
222
307
  3. **Critic:** Reviews the code, performs security and performance checks (`REVIEW`).
223
308
 
224
309
  ### 3. 🧠 MENTAL MODEL SYNC
310
+
225
311
  Behavior for creating and loading "Mental Model" summaries to preserve context between sessions.
226
312
 
313
+ ### 4. 🔄 EVALUATOR-OPTIMIZER
314
+
315
+ Paired agents in an iterative quality loop: Generator produces, Evaluator scores against criteria, Generator refines. Set max iteration ceiling before starting.
316
+
227
317
  ---
228
318
 
229
319
  ## Combining Modes
@@ -239,4 +329,5 @@ Users can explicitly request a mode:
239
329
  /implement the user profile page
240
330
  /debug why login fails
241
331
  /review this pull request
332
+ /iterate [target quality bar] ← triggers evaluator-optimizer
242
333
  ```
@@ -0,0 +1,198 @@
1
+ ---
2
+ name: agent-design
3
+ description: "Use when designing, building, or improving a CBX agent, skill, or workflow: clarification strategy, progressive disclosure structure, workflow pattern selection (sequential, parallel, evaluator-optimizer), skill type taxonomy, description tuning, and eval-first testing."
4
+ license: MIT
5
+ metadata:
6
+ author: cubis-foundry
7
+ version: "1.0"
8
+ compatibility: Claude Code, Codex, GitHub Copilot, Gemini CLI
9
+ ---
10
+
11
+ # Agent Design
12
+
13
+ ## Purpose
14
+
15
+ You are the specialist for designing CBX agents and skills that behave intelligently — asking the right questions, knowing when to pause, executing in the right workflow pattern, and testing their own output.
16
+
17
+ Your job is to close the gap between "it kinda works" and "it works reliably under any input."
18
+
19
+ ## When to Use
20
+
21
+ - Designing or refactoring a SKILL.md or POWER.md
22
+ - Choosing between sequential, parallel, or evaluator-optimizer workflow
23
+ - Writing clarification logic for an agent that handles ambiguous requests
24
+ - Deciding whether a task needs a skill or just a prompt
25
+ - Testing whether a skill actually works as intended
26
+ - Writing descriptions that trigger the right skill at the right time
27
+
28
+ ## Core Principles
29
+
30
+ These come directly from Anthropic's agent engineering research (["Equipping agents for the real world"](https://claude.com/blog/equipping-agents-for-the-real-world-with-agent-skills), March 2026):
31
+
32
+ 1. **Progressive disclosure** — A skill's SKILL.md provides just enough context to know when to load it. Full instructions, references, and scripts are loaded lazily, only when needed. More context in a single file does not equal better behavior — it usually hurts it.
33
+
34
+ 2. **Eval before optimizing** — Define what "good looks like" (test cases + success criteria) before editing the skill. This prevents regression and tells you when improvement actually happened.
35
+
36
+ 3. **Description precision** — The `description` field in YAML frontmatter controls triggering. Too broad = false positives. Too narrow = the skill never fires. Tune it like a search query.
37
+
38
+ 4. **Two skill types** — See [Skill Type Taxonomy](#skill-type-taxonomy). These need different testing strategies and have different shelf lives.
39
+
40
+ 5. **Start with a single agent** — Before adding workflow complexity, first try a single agent with a rich prompt. Only add orchestration when it measurably improves results.
41
+
42
+ ## Skill Type Taxonomy
43
+
44
+ | Type | What it does | Testing goal | Shelf life |
45
+ | ---------------------- | ------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------- | ------------------------------------------------------- |
46
+ | **Capability uplift** | Teaches Claude to do something it can't do alone (e.g. manipulate PDFs, fill forms, use a domain-specific API) | Verify the output is correct and consistent | Medium — may become obsolete as models improve |
47
+ | **Encoded preference** | Sequences steps Claude could do individually, but in your team's specific order and style (e.g. NDA review checklist, weekly update format) | Verify fidelity to the actual workflow | High — these stay useful because they're uniquely yours |
48
+
49
+ Design question: "Is this skill teaching Claude something new, or encoding how we do things?"
50
+
51
+ ## Clarification Strategy
52
+
53
+ An agent that starts wrong wastes everyone's time. Smart agents pause at the right moments.
54
+
55
+ Load `references/clarification-patterns.md` when:
56
+
57
+ - Designing how a skill should handle ambiguous or underspecified inputs
58
+ - Writing the early steps of a workflow where user intent matters
59
+ - Deciding what questions to ask vs. what to infer
60
+
61
+ ## Workflow Pattern Selection
62
+
63
+ Three patterns cover 95% of production agent workflows:
64
+
65
+ | Pattern | Use when | Cost | Benefit |
66
+ | ----------------------- | --------------------------------------------------------------- | ----------------------- | ----------------------------------------- |
67
+ | **Sequential** | Steps have dependencies (B needs A's output) | Latency (linear) | Focus: each step does one thing well |
68
+ | **Parallel** | Steps are independent and concurrency helps | Tokens (multiplicative) | Speed + separation of concerns |
69
+ | **Evaluator-optimizer** | First-draft quality isn't good enough and quality is measurable | Tokens × iterations | Better output through structured feedback |
70
+
71
+ Default to sequential. Add parallel when latency is the bottleneck and tasks are genuinely independent. Add evaluator-optimizer only when you can measure the improvement.
72
+
73
+ Load `references/workflow-patterns.md` for the full decision tree, examples, and anti-patterns.
74
+
75
+ ## Progressive Disclosure Structure
76
+
77
+ A well-structured CBX skill looks like:
78
+
79
+ ```
80
+ skill-name/
81
+ SKILL.md ← lean entry: name, description, purpose, when-to-use, load-table
82
+ references/ ← detailed guides loaded lazily when step requires it
83
+ topic-a.md
84
+ topic-b.md
85
+ commands/ ← slash commands (optional)
86
+ command.md
87
+ scripts/ ← executable code (optional)
88
+ helper.py
89
+ ```
90
+
91
+ **SKILL.md should be loadable in <2000 tokens.** Everything else lives in references.
92
+
93
+ The metadata table pattern that works:
94
+
95
+ ```markdown
96
+ ## References
97
+
98
+ | File | Load when |
99
+ | ----------------------- | ------------------------------------------ |
100
+ | `references/topic-a.md` | Task involves [specific trigger condition] |
101
+ | `references/topic-b.md` | Task involves [specific trigger condition] |
102
+ ```
103
+
104
+ This lets the agent make intelligent decisions about what context to load rather than ingesting everything upfront.
105
+
106
+ ## Description Writing
107
+
108
+ The `description` field is a trigger — write it like a search query, not marketing copy.
109
+
110
+ **Good description:**
111
+
112
+ ```yaml
113
+ description: "Use when evaluating an agent, skill, workflow, or MCP server: rubric design, evaluator-optimizer loops, LLM-as-judge patterns, regression suites, or prototype-vs-production quality gaps."
114
+ ```
115
+
116
+ **Bad description:**
117
+
118
+ ```yaml
119
+ description: "A comprehensive skill for evaluating things and making sure they work well."
120
+ ```
121
+
122
+ Rules:
123
+
124
+ - Lead with the specific trigger verb: "Use when [user does X]"
125
+ - List the specific task types with commas — these act like search keywords
126
+ - Include domain-specific nouns the user would actually type
127
+ - Avoid generic adjectives ("comprehensive", "powerful", "advanced")
128
+
129
+ Test your description: would a user's natural-language request match the intent of these words?
130
+
131
+ ## Testing a Skill
132
+
133
+ Before shipping, verify with this checklist:
134
+
135
+ 1. **Positive trigger** — Does the skill load when it should? Test 5 natural phrasings of the target task.
136
+ 2. **Negative trigger** — Does it stay quiet when it shouldn't load? Test 5 near-miss phrasings.
137
+ 3. **Happy path** — Does the skill complete the standard task correctly?
138
+ 4. **Edge cases** — What happens with missing input, ambiguous phrasing, or edge-case content?
139
+ 5. **Reader test** — Run the delivery (e.g., a generated doc, a plan) through a fresh sub-agent with no context. Can it answer questions about the output correctly?
140
+
141
+ For formal regression suites, load `references/skill-testing.md`.
142
+
143
+ ## Instructions
144
+
145
+ ### Step 1 — Understand the design task
146
+
147
+ Before touching any file, clarify:
148
+
149
+ - Is this a new skill or improving an existing one?
150
+ - Is it capability uplift or encoded preference?
151
+ - What's the specific failure mode being fixed?
152
+ - What would passing look like?
153
+
154
+ If any of these are unclear, apply the clarification pattern from `references/clarification-patterns.md`.
155
+
156
+ ### Step 2 — Choose the structure
157
+
158
+ - If the skill is simple (single task, single purpose): lean SKILL.md with no references
159
+ - If the skill is complex (multiple phases, conditional logic): SKILL.md + references loaded lazily
160
+ - If the skill has reusable commands: add `commands/` directory
161
+
162
+ ### Step 3 — Design the workflow
163
+
164
+ Use the pattern selection table above. Start with sequential. Prove you need complexity before adding it.
165
+
166
+ ### Step 4 — Write the description
167
+
168
+ Write it last. Once you know what the skill does and how it differs from adjacent skills, the right description is usually obvious.
169
+
170
+ ### Step 5 — Define a test
171
+
172
+ Write at least 3 test cases (input → expected output or behavior) before considering the skill done. These become the regression suite.
173
+
174
+ ## Output Format
175
+
176
+ Deliver:
177
+
178
+ 1. **Skill structure** — directory layout, file list
179
+ 2. **SKILL.md** — production-ready with lean body and reference table
180
+ 3. **Reference files** — if needed, each scoped to a specific phase or topic
181
+ 4. **Test cases** — 3-5 natural language inputs with expected behaviors
182
+ 5. **Description** — the final `description` field, tuned for triggering
183
+
184
+ ## References
185
+
186
+ | File | Load when |
187
+ | -------------------------------------- | ------------------------------------------------------------------------------ |
188
+ | `references/clarification-patterns.md` | Designing how the agent handles ambiguous or underspecified input |
189
+ | `references/workflow-patterns.md` | Choosing or implementing sequential, parallel, or evaluator-optimizer workflow |
190
+ | `references/skill-testing.md` | Writing evals, regression sets, or triggering tests for a skill |
191
+
192
+ ## Examples
193
+
194
+ - "Design a skill for our NDA review process — it should follow our checklist exactly."
195
+ - "The feature-forge skill triggers on the wrong prompts. Help me fix the description."
196
+ - "How do I test whether my skill still works after a model update?"
197
+ - "I need a workflow where 3 agents review code in parallel then one synthesizes findings."
198
+ - "This skill's SKILL.md is 4000 tokens. Help me split it into lean structure with references."