npm - @cubis/foundry - Versions diffs - 0.3.70 → 0.3.71 - Mend

@cubis/foundry 0.3.70 → 0.3.71

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (35) hide show

package/workflows/skills/agent-design/references/clarification-patterns.md ADDED Viewed

@@ -0,0 +1,153 @@
+# Clarification Patterns Reference
+Load this when designing how an agent handles ambiguous, underspecified, or multi-interpretation input.
+Source: Anthropic doc-coauthoring skill pattern + CBX ask-questions-if-underspecified research (2026).
+---
+## When to Clarify vs. When to Infer
+The wrong default is to ask everything. The right default is to ask what genuinely branches the work.
+**Clarify** when:
+- Multiple plausible interpretations produce significantly different implementations
+- The wrong interpretation wastes significant time or produces the wrong output
+- A key parameter (scope, audience, constraint) changes the entire approach
+**Infer and state assumptions** when:
+- A quick read (repo structure, config file, existing code) can answer the question
+- The request is clear for 90%+ of the obvious interpretations
+- The user explicitly asked you to proceed
+**Proceed without asking** when:
+- The task is clear and unambiguous
+- Discovery is faster than asking
+- The cost of being slightly wrong is low and reversible
+---
+## The 1-5 Question Rule
+Ask at most **5 questions** in the first pass. Prefer questions that eliminate entire branches of work.
+If more than 5 things are unclear, rank by impact and ask the highest-impact ones first. More questions surface after the user's first answers.
+---
+## Fast-Path Design
+Every clarification block should have a fast path. Users who know what they want shouldn't wade through 5 questions.
+**Include always:**
+- A compact reply format: `"Reply 1b 2a 3c to accept these options"`
+- Default options explicitly labeled: `(default)` or _bolded_
+- A fast-path shortcut: `"Reply 'defaults' to accept all recommended choices"`
+**Example block:**
+```
+Before I start, a few quick questions:
+1. **Scope?**
+   a) Only the requested function **(default)**
+   b) Refactor any touched code
+   c) Not sure — use default
+2. **Framework target?**
+   a) Match existing project **(default)**
+   b) Specify: ___
+3. **Test coverage?**
+   a) None needed **(default)**
+   b) Unit tests alongside
+   c) Full integration test
+Reply with numbers and letters (e.g., `1a 2a 3b`) or `defaults` to proceed with all defaults.
+```
+---
+## Three-Stage Context Gathering (for complex tasks)
+Use this when a task is substantial enough that getting it wrong = significant wasted work. Borrowed from Anthropic's doc-coauthoring skill.
+### Stage 1: Initial Questions (meta-context)
+Ask 3-5 questions about the big-picture framing before touching the content:
+- What type of deliverable is this? (spec, code, doc, design, plan)
+- Who's the audience / consumer of this output?
+- What's the definition of done — what would make this clearly successful?
+- Are there constraints (framework, format, performance bar, audience knowledge level)?
+- Is there an existing template or precedent to follow?
+Tell the user they can answer in shorthand. Offer: "Or just dump your context and I'll ask follow-ups."
+### Stage 2: Info Dump + Follow-up
+After initial answers, invite a full brain dump:
+> "Dump everything you know about this — background, prior decisions, constraints, blockers, opinions. Don't organize it, just get it out."
+Then ask targeted follow-up questions based on gaps in what they provided. Aim for 5-10 numbered follow-ups. Users can use shorthand (e.g., "1: yes, 2: see previous context, 3: no").
+**Exit condition for Stage 2:** You understand the objective, the constraints, and at least one clear definition of success.
+### Stage 3: Confirm Interpretation, Then Proceed
+Restate the requirements in 1-3 sentences before starting work:
+> "Here's my understanding: [objective in one sentence]. [Key constraint]. [What done looks like]. Starting now — let me know if anything's off."
+---
+## Reader Test (for deliverables)
+When the deliverable is substantial (a plan, a document, a design decision), test it with a fresh context before handing it to the user.
+**How:** Invoke a sub-agent or fresh prompt with only the deliverable (no conversation history) and ask:
+- "What is this about?"
+- "What are the key decisions made here?"
+- "What's missing or unclear?"
+If the fresh read surfaces gaps the user would have found, fix them first.
+**When to use:** After generating complex plans, multi-section documents, architecture decisions, or any output that will be read by someone without conversation context.
+---
+## Clarification Anti-Patterns
+Avoid these:
+| Anti-pattern                         | Problem                                                      |
+| ------------------------------------ | ------------------------------------------------------------ |
+| Asking everything upfront            | Overwhelms users; many questions are answerable by inference |
+| Asking about things you can discover | Read the file/repo before asking about it                    |
+| No default options                   | Forces users to reason through every option                  |
+| Open-ended questions without choices | High friction; users don't know the option space             |
+| Not restating interpretation         | User doesn't know what you understood                        |
+| Asking the same question twice       | Signals you didn't read the answer                           |
+| Asking about reversible decisions    | Just pick one and move; it can be changed                    |
+---
+## Decision: Which Pattern to Use
+```
+Is the task clear and unambiguous?
+  → YES: Proceed. State assumptions inline if any.
+  → NO: Is missing info discoverable by reading files/code?
+    → YES: Read first, then proceed or ask a single targeted question.
+    → NO: Is this a quick task where wrong interpretation is cheap?
+      → YES: Proceed with stated assumptions, invite correction.
+      → NO: Use the 1-5 Question Rule or Three-Stage Context Gathering.
+```
+Use Three-Stage context gathering only for substantial deliverables (docs, plans, architecture, complex features). For code tasks, the 1-5 question rule is usually sufficient.

package/workflows/skills/agent-design/references/skill-testing.md ADDED Viewed

@@ -0,0 +1,164 @@
+# Skill Testing Reference
+Load this when writing evals, regression sets, or description-triggering tests for a CBX skill.
+Source: Anthropic skill-creator research — [Improving skill-creator: Test, measure, and refine Agent Skills](https://claude.com/blog/improving-skill-creator-test-measure-and-refine-agent-skills) (March 2026).
+---
+## Two Reasons to Test
+1. **Catch regressions** — As models and infrastructure evolve, skills that worked last month may behave differently. Evals give you an early signal before it impacts your team.
+2. **Know when the skill is obsolete** — For _capability uplift_ skills: if the base model starts passing your evals without the skill loaded, the skill has been incorporated into model behavior and can be retired.
+---
+## Five Test Categories
+Every skill should pass all five before shipping.
+### 1. Trigger tests (description precision)
+Does the skill load when it should — and stay quiet when it shouldn't?
+**Method:**
+- Write 5 natural-language prompts that _should_ trigger the skill
+- Write 5 near-miss prompts that _should not_ trigger
+- Load the skill and observe whether it activates
+**Example for a frontend-design skill:**
+```
+Should trigger:
+- "Build me a landing page for my SaaS product"
+- "Make this dashboard look less generic"
+- "I need a color system for a health app"
+Should NOT trigger:
+- "Fix this TypeScript error"
+- "Review my API endpoint design"
+- "Help me write tests"
+```
+**Fix:** If false positives occur, make the description more specific. If false negatives, broaden or add domain keywords.
+### 2. Happy path test
+Does the skill complete its standard task correctly?
+**Method:**
+- Write the most common, straightforward version of the task the skill handles
+- Run it and verify the output meets the expected criteria
+### 3. Edge case tests
+What happens under abnormal or missing input?
+Examples:
+- Missing required information (no brand color, no framework specified)
+- Ambiguous phrasing
+- Conflicting requirements
+- Very large or very small input
+- The user ignored the clarification questions and just said "do it"
+### 4. Comparison test (A/B)
+Does the skill actually improve output vs. no skill?
+**Method:** Run the same prompt with and without the skill loaded. Judge which output is better — ideally with a fresh evaluator agent that doesn't know which is which.
+If the no-skill output is equivalent, the skill adds no value (or the model has caught up to it).
+### 5. Reader test
+Can someone with no conversation context understand the skill's output?
+**Method:**
+- Take the skill's final output (plan, document, code, design)
+- Open a fresh conversation or use a sub-agent with only the output, no history
+- Ask: "What is this?", "What are the key decisions?", "What's unclear?"
+If the fresh reader struggles, the output has context bleed issues. Fix them before shipping.
+---
+## Writing Eval Cases
+Each eval case = one input + expected behavior description.
+**Format:**
+```
+Input: [natural language prompt or file +prompt]
+Expected:
+  - [Observable behavior 1]
+  - [Observable behavior 2]
+  - [Observable behavior 3 — what NOT to happen]
+```
+**Example for `ask-questions-if-underspecified`:**
+```
+Input: "Build me a feature."
+Expected:
+  - Asks at least 1 clarifying question (scope, purpose, or constraints)
+  - Provides default options to choose from
+  - Does NOT immediately generate code
+  - Does NOT ask more than 5 questions
+```
+**Rules:**
+- Evals should be independent (not dependent on previous evals)
+- Expected behavior should be observable and binary (pass/fail, not subjective)
+- Aim for 5-10 evals per skill before shipping; 15+ for critical skills
+---
+## Benchmark Mode
+Run all evals after a model update or after editing the skill:
+1. Run all evals sequentially (or in parallel to avoid context bleed)
+2. Record: pass rate, elapsed time per eval, token usage
+3. Compare to baseline before the change
+**Pass rate thresholds:**
+- < 60%: Skill has serious issues. Do not ship.
+- 60-80%: Acceptable for early versions. Target improvement.
+- > 80%: Production-ready.
+- > 90%: Reliable enough for critical workflows.
+---
+## Description Tuning Process
+If triggering is unreliable:
+1. List 10 prompts that should trigger the skill (write them as a user would)
+2. List 5 prompts of similar tasks that should _not_ trigger
+3. Find the distinguishing words/phrases between the two lists
+4. Rewrite the description to include the distinguishing words and exclude the overlap
+**Pattern:**
+```yaml
+description: "Use when [specific verb] [specific noun/domain]: [comma-separated task keywords]. NOT for [adjacent tasks that should not trigger]."
+```
+---
+## When to Retire a Skill
+A skill is ready to retire when:
+- 90%+ of its evals pass without the skill loaded (for capability uplift skills)
+- The skill's instructions are now standard model behavior
+- Maintenance cost exceeds value
+Retiring isn't failure — it means the skill did its job and the model caught up.

package/workflows/skills/agent-design/references/workflow-patterns.md ADDED Viewed

@@ -0,0 +1,226 @@
+# Workflow Patterns Reference
+Load this when choosing or implementing a workflow pattern for a CBX agent or skill.
+Source: Anthropic engineering research — [Common workflow patterns for AI agents](https://claude.com/blog/common-workflow-patterns-for-ai-agents-and-when-to-use-them) (March 2026).
+---
+## The Core Insight
+Workflows don't replace agent autonomy — they _shape where and how_ agents apply it.
+A fully autonomous agent decides everything: tools, order, when to stop.
+A workflow provides structure: overall flow, checkpoints, boundaries — but each step still uses full agent reasoning.
+**Start with a single agent call.** If that meets quality bar, you're done. Only add workflow complexity when you can measure the improvement.
+---
+## Pattern 1: Sequential Workflow
+### What it is
+Agents execute in a fixed order. Each stage processes its input, makes tool calls, then passes results to the next stage.
+```
+Input → [Agent A] → [Agent B] → [Agent C] → Output
+```
+### Use when
+- Steps have explicit dependencies (B needs A's output before starting)
+- Multi-stage transformation where each step adds specific value
+- Draft-review-polish cycles
+- Data extraction → validation → loading pipelines
+### Avoid when
+- A single agent can handle the whole task
+- Agents need to collaborate rather than hand off linearly
+- You're forcing sequential structure onto a task that doesn't naturally fit it
+### Cost/benefit
+- **Cost:** Latency is linear — step 2 waits for step 1
+- **Benefit:** Each agent focuses on one thing; accuracy often improves
+### CBX implementation
+```markdown
+## Workflow
+1. **[Agent/Step A]** — [what it receives, what it does, what it produces]
+2. **[Agent/Step B]** — [takes A's output, does X, produces Y]
+3. **[Agent/Step C]** — [final synthesis/delivery]
+Artifacts pass via [file path / variable / structured JSON / natural handoff instructions].
+```
+### Pro tip
+First try the pipeline as a single agent where the steps are part of the prompt. If quality is good enough, you've solved the problem without complexity.
+---
+## Pattern 2: Parallel Workflow
+### What it is
+Multiple agents run simultaneously on independent tasks. Results are merged or synthesized afterward.
+```
+         ┌→ [Agent A] →┐
+Input →  ├→ [Agent B] →├→ Synthesize → Output
+         └→ [Agent C] →┘
+```
+### Use when
+- Tasks are genuinely independent (no agent needs another's output to start)
+- Speed matters and concurrent execution helps
+- Multiple perspectives on the same input (e.g., code review from security + performance + quality)
+- Separation of concerns — different engineers can own individual agents
+### Avoid when
+- Agents need cumulative context or must build on each other's work
+- Resource constraints (API quotas) make concurrent calls inefficient
+- Aggregation logic is unclear or produces contradictory results with no resolution strategy
+### Cost/benefit
+- **Cost:** Tokens multiply (N agents × tokens each); requires aggregation strategy
+- **Benefit:** Faster completion; clean separation of concerns
+### CBX implementation
+```markdown
+## Parallel Steps
+Run these simultaneously:
+- **[Agent A]** — [focused task, specific scope]
+- **[Agent B]** — [focused task, different scope]
+- **[Agent C]** — [focused task, different scope]
+## Synthesis
+After all agents complete:
+[How to merge: majority vote / highest confidence / specialized agent defers to other / human review]
+```
+### Pro tip
+Design your aggregation strategy _before_ implementing parallel agents. Without a clear merge plan, you collect conflicting outputs with no way to resolve them.
+---
+## Pattern 3: Evaluator-Optimizer Workflow
+### What it is
+Two agents loop: one generates content, another evaluates it against criteria, the generator refines based on feedback. Repeat until quality threshold is met or max iterations reached.
+```
+        ┌─────────────────────────────────────┐
+        ↓                                     |
+Input → [Generator] → Draft → [Evaluator] → Pass? → Output
+                                 ↓ Fail
+                            Feedback → [Generator]
+```
+### Use when
+- First-draft quality consistently falls short of the required bar
+- You have clear, measurable quality criteria an AI evaluator can apply consistently
+- The gap between first-attempt and final quality justifies extra tokens and latency
+- Examples: technical docs, customer communications, code against specific standards
+### Avoid when
+- First-attempt quality already meets requirements (unnecessary cost)
+- Real-time applications needing immediate responses
+- Evaluation criteria are too subjective for consistent AI evaluation
+- Deterministic tools exist (linters for style, validators for schemas) — use those instead
+### Cost/benefit
+- **Cost:** Tokens × iterations; adds latency proportionally
+- **Benefit:** Structured feedback loops produce measurably better outputs
+### CBX implementation
+```markdown
+## Generator Prompt
+Task: [what to create]
+Constraints: [specific, measurable requirements]
+Format: [exact output format]
+## Evaluator Prompt
+Review this output against these criteria:
+1. [Criterion A] — Pass/Fail + specific failure note
+2. [Criterion B] — Pass/Fail + specific failure note
+3. [Criterion C] — Pass/Fail + specific failure note
+Output JSON: { "pass": bool, "failures": ["..."], "revision_note": "..." }
+## Loop Control
+- Max iterations: [3-5]
+- Stop when: all criteria pass OR max iterations reached
+- On max with failures: surface remaining issues for human review
+```
+### Pro tip
+Set stopping criteria _before_ iterating. Define max iterations and specific quality thresholds. Without guardrails, you enter expensive loops where the evaluator finds minor issues and quality plateaus well before you stop.
+---
+## Decision Tree
+```
+Can a single agent handle this task effectively?
+  → YES: Don't use workflows. Use a rich single-agent prompt.
+  → NO: Continue...
+Do steps have dependencies (B needs A's output)?
+  → YES: Use Sequential
+  → NO: Continue...
+Can steps run independently, and would concurrency help?
+  → YES: Use Parallel
+  → NO: Continue...
+Does quality improve meaningfully through iteration, and can you measure it?
+  → YES: Use Evaluator-Optimizer
+  → NO: Re-examine whether workflows help at all
+```
+---
+## Combining Patterns
+Patterns are building blocks, not mutually exclusive:
+- A **sequential workflow** can include **parallel** steps at certain stages (e.g., three parallel reviewers before a final synthesis step)
+- An **evaluator-optimizer** can use **parallel evaluation** where multiple evaluators assess different quality dimensions simultaneously
+- A **sequential chain** can use **evaluator-optimizer** at the critical high-quality step
+Only add the combination when each additional pattern measurably improves outcomes.
+---
+## Pattern Comparison
+|                | Sequential                                   | Parallel                                | Evaluator-Optimizer                  |
+| -------------- | -------------------------------------------- | --------------------------------------- | ------------------------------------ |
+| **When**       | Dependencies between steps                   | Independent tasks                       | Quality below bar                    |
+| **Examples**   | Extract → validate → load; Draft → translate | Code review (security + perf + quality) | Technical docs, comms, SQL           |
+| **Latency**    | Linear (each waits for previous)             | Fast (concurrent)                       | Multiplied by iterations             |
+| **Token cost** | Linear                                       | Multiplicative                          | Linear × iterations                  |
+| **Key risk**   | Bottleneck at slow steps                     | Aggregation conflicts                   | Infinite loops without stop criteria |

package/workflows/skills/deep-research/SKILL.md CHANGED Viewed

@@ -1,10 +1,10 @@
 ---
 name: deep-research
-description: "Use when a task needs multi-round research rather than a quick lookup: iterative search, gap finding, corroboration across sources, contradiction handling, or evidence-led synthesis before planning or implementation."
+description: "Use when a task needs multi-round research rather than a quick lookup: iterative search, gap finding, corroboration across sources, contradiction handling, evidence-led synthesis before planning or implementation. Also use when the user asks for 'deep research', 'latest info', or 'how does X compare to Y publicly'."
 license: MIT
 metadata:
   author: cubis-foundry
-  version: "1.0"
+  version: "1.1"
 compatibility: Claude Code, Codex, GitHub Copilot
 ---
@@ -14,23 +14,25 @@ compatibility: Claude Code, Codex, GitHub Copilot
 You are the specialist for iterative evidence gathering and synthesis.
-Your job is to find what is missing, not just summarize the first page of results.
+Your job is to find what is missing, not just summarize the first page of results. Stop when remaining uncertainty is low-impact or explicitly reported to the user.
 ## When to Use
-- The task needs deep web or repo research before planning or implementation.
-- The first-pass answer is incomplete, contradictory, or likely stale.
-- The user explicitly asks for research, latest information, or public-repo comparison.
+- The task needs deep web or repo research before planning or implementation
+- The first-pass answer is incomplete, contradictory, or likely stale
+- The user explicitly asks for research, latest information, or public-repo comparison
+- Claims are contested or the topic changes fast (AI tooling, frameworks, protocols)
 ## Instructions
 ### STANDARD OPERATING PROCEDURE (SOP)
-1. Define the question and what would count as enough evidence.
-2. Run a first pass and identify gaps or contradictions.
+1. Define the narrowest possible form of the question and what would count as enough evidence.
+2. Run a first pass and identify gaps, contradictions, and missing facts.
 3. Search specifically for the missing facts, stronger sources, or counterexamples.
-4. Rank sources by directness, recency, and authority.
-5. Separate sourced facts, informed inference, and unresolved gaps.
+4. Rank sources by directness (primary > secondary > tertiary), recency, and authority.
+5. Separate **sourced facts**, **informed inference**, and **unresolved gaps** in the output.
+6. Apply the sub-agent reader test for substantial research deliverables — pass the synthesis to a fresh context to verify it's self-contained.
 ### Constraints
@@ -41,19 +43,22 @@ Your job is to find what is missing, not just summarize the first page of result
 ## Output Format
-Provide implementation guidance, code examples, and configuration as appropriate to the task.
+Structure clearly as:
-## References
-| File                                      | Load when                                                                                             |
-| ----------------------------------------- | ----------------------------------------------------------------------------------------------------- |
-| `references/multi-round-research-loop.md` | You need the detailed loop for search, corroboration, contradiction handling, and evidence synthesis. |
+- **Key findings** — the answer, directly stated
+- **Evidence** — sourced facts with citations ranked by confidence
+- **Inference** — what follows logically from the evidence (labeled as inference)
+- **Open questions** — what remains unresolved and why it matters
-## Scripts
+## References
-No helper scripts are required for this skill right now. Keep execution in `SKILL.md` and `references/` unless repeated automation becomes necessary.
+| File                                      | Load when                                                                                                                                                   |
+| ----------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `references/multi-round-research-loop.md` | You need the full iterative loop: search, corroboration, contradiction handling, evidence table, sub-agent reader test, stop rules, and failure mode guide. |
 ## Examples
-- "Help me with deep research best practices in this project"
-- "Review my deep research implementation for issues"
+- "Research how Anthropic structures their agent skills — compare to what CBX does"
+- "What's the latest on evaluator-optimizer patterns in production agent systems?"
+- "Deep research on OKLCH vs HSL for design systems — what do practitioners actually use?"
+- "Find counterexamples to the claim that parallel agents always improve speed"