npm - @cubis/foundry - Versions diffs - 0.3.70 → 0.3.71 - Mend

@cubis/foundry 0.3.70 → 0.3.71

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (35) hide show

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@cubis/foundry",
-  "version": "0.3.70",
+  "version": "0.3.71",
   "description": "Cubis Foundry CLI for workflow-first AI agent environments",
   "type": "module",
   "bin": {

package/workflows/powers/ask-questions-if-underspecified/SKILL.md CHANGED Viewed

@@ -1,17 +1,27 @@
 ---
 name: ask-questions-if-underspecified
-description: Clarify requirements before implementing. Use when serious doubts arise.
+description: Clarify requirements before implementing. Use when serious doubts arise about objective, scope, constraints, environment, or safety — or when the task is substantial enough that being wrong wastes significant effort.
 ---
 # Ask Questions If Underspecified
 ## When to Use
-Use this skill when a request has multiple plausible interpretations or key details (objective, scope, constraints, environment, or safety) are unclear.
+Use this skill when a request has multiple plausible interpretations or key details (objective, scope, constraints, environment, or safety) are unclear — **and** when the cost of implementing the wrong interpretation is significant.
+Three situations require clarification:
+1. **High branching** — Multiple plausible interpretations produce significantly different implementations
+2. **Substantial deliverable** — The task is large enough that wrong assumptions waste real time
+3. **Safety-critical** — The action is hard to reverse (data migrations, deployments, file deletions)
 ## When NOT to Use
-Do not use this skill when the request is already clear, or when a quick, low-risk discovery read can answer the missing details.
+Do not use this skill when:
+- The request is already clear and one interpretation is obviously correct
+- A quick discovery read (config files, existing patterns, repo structure) can answer the missing details faster than asking
+- The task is small enough that being slightly wrong is cheap and correctable
 ## Goal
@@ -22,6 +32,7 @@ Ask the minimum set of clarifying questions needed to avoid wrong work; do not s
 ### 1) Decide whether the request is underspecified
 Treat a request as underspecified if after exploring how to perform the work, some or all of the following are not clear:
 - Define the objective (what should change vs stay the same)
 - Define "done" (acceptance criteria, examples, edge cases)
 - Define scope (which files/components/users are in/out)
@@ -36,6 +47,7 @@ If multiple plausible interpretations exist, assume it is underspecified.
 Ask 1-5 questions in the first pass. Prefer questions that eliminate whole branches of work.
 Make questions easy to answer:
 - Optimize for scannability (short, numbered questions; avoid paragraphs)
 - Offer multiple-choice options when possible
 - Suggest reasonable defaults when appropriate (mark them clearly as the default/recommended choice; bold the recommended choice in the list, or if you present options in a code block, put a bold "Recommended" line immediately above the block and also tag defaults inside the block)
@@ -47,10 +59,12 @@ Make questions easy to answer:
 ### 3) Pause before acting
 Until must-have answers arrive:
 - Do not run commands, edit files, or produce a detailed plan that depends on unknowns
 - Do perform a clearly labeled, low-risk discovery step only if it does not commit you to a direction (e.g., inspect repo structure, read relevant config files)
 If the user explicitly asks you to proceed without answers:
 - State your assumptions as a short numbered list
 - Ask for confirmation; proceed only after they confirm or correct them
@@ -83,3 +97,37 @@ Reply with: defaults (or 1a 2a)
 - Don't ask questions you can answer with a quick, low-risk discovery read (e.g., configs, existing patterns, docs).
 - Don't ask open-ended questions if a tight multiple-choice or yes/no would eliminate ambiguity faster.
+- Don't ask more than 5 questions at once — rank by impact and ask the top ones.
+- Don't skip the fast-path — every clarification block needs `defaults` shortcut.
+- Don't forget to restate interpretation before proceeding — confirms you heard correctly.
+- Don't ask about reversible decisions — pick one, proceed, let them correct if wrong.
+## Three-Stage Pattern (for complex or substantial tasks)
+For tasks where wrong assumptions would waste significant effort — documents, architecture decisions, multi-file features — use a three-stage approach:
+### Stage 1: Meta-context questions (3-5 questions)
+Ask about the big picture before touching content:
+- What _type_ of deliverable is this? (spec, code, doc, design, plan)
+- Who's the audience/consumer?
+- What does "done" look like?
+- Existing template, format, or precedent to follow?
+- Hard constraints (framework, performance, compatibility)?
+### Stage 2: Info dump + targeted follow-up
+After Stage 1 answers: invite the user to brain-dump everything relevant.
+> "Dump everything you know — background, prior decisions, constraints, opinions, blockers. Don't organize it. Just get it all out."
+Then ask 5-10 targeted follow-up questions based on gaps. Users can answer in shorthand (`1: yes, 2: see above, 3: no`).
+**Exit Stage 2 when:** You understand objective, constraints, and at least one clear definition of success.
+### Stage 3: Confirm interpretation, then proceed
+Restate in 1-3 sentences before starting:
+> "Here's what I understand: [objective]. [Key constraint]. [What done looks like]. Starting now — correct me if anything's off."

package/workflows/powers/behavioral-modes/SKILL.md CHANGED Viewed

@@ -7,6 +7,7 @@ allowed-tools: Read, Glob, Grep
 # Behavioral Modes - Adaptive AI Operating Modes
 ## Purpose
 This skill defines distinct behavioral modes that optimize AI performance for specific tasks. Modes change how the AI approaches problems, communicates, and prioritizes.
 ---
@@ -18,6 +19,7 @@ This skill defines distinct behavioral modes that optimize AI performance for sp
 **When to use:** Early project planning, feature ideation, architecture decisions
 **Behavior:**
 - Ask clarifying questions before assumptions
 - Offer multiple alternatives (at least 3)
 - Think divergently - explore unconventional solutions
@@ -25,6 +27,7 @@ This skill defines distinct behavioral modes that optimize AI performance for sp
 - Use visual diagrams (mermaid) to explain concepts
 **Output style:**
 ```
 "Let's explore this together. Here are some approaches:
@@ -46,6 +49,7 @@ What resonates with you? Or should we explore a different direction?"
 **When to use:** Writing code, building features, executing plans
 **Behavior:**
 - **CRITICAL: Use `clean-code` skill standards** - concise, direct, no verbose explanations
 - Fast execution - minimize questions
 - Use established patterns and best practices
@@ -57,6 +61,7 @@ What resonates with you? Or should we explore a different direction?"
 - **NO RUSHING** - Quality > Speed. Read ALL references before coding.
 **Output style:**
 ```
 [Code block]
@@ -64,6 +69,7 @@ What resonates with you? Or should we explore a different direction?"
 ```
 **NOT:**
 ```
 "Building [feature]...
@@ -83,6 +89,7 @@ Run `npm run dev` to test."
 **When to use:** Fixing bugs, troubleshooting errors, investigating issues
 **Behavior:**
 - Ask for error messages and reproduction steps
 - Think systematically - check logs, trace data flow
 - Form hypothesis → test → verify
@@ -90,6 +97,7 @@ Run `npm run dev` to test."
 - Prevent future occurrences
 **Output style:**
 ```
 "Investigating...
@@ -106,6 +114,7 @@ Run `npm run dev` to test."
 **When to use:** Code review, architecture review, security audit
 **Behavior:**
 - Be thorough but constructive
 - Categorize by severity (Critical/High/Medium/Low)
 - Explain the "why" behind suggestions
@@ -113,6 +122,7 @@ Run `npm run dev` to test."
 - Acknowledge what's done well
 **Output style:**
 ```
 ## Code Review: [file/feature]
@@ -133,6 +143,7 @@ Run `npm run dev` to test."
 **When to use:** Explaining concepts, documentation, onboarding
 **Behavior:**
 - Explain from fundamentals
 - Use analogies and examples
 - Progress from simple to complex
@@ -140,6 +151,7 @@ Run `npm run dev` to test."
 - Check understanding
 **Output style:**
 ```
 ## Understanding [Concept]
@@ -163,6 +175,7 @@ Run `npm run dev` to test."
 **When to use:** Production deployment, final polish, release preparation
 **Behavior:**
 - Focus on stability over features
 - Check for missing error handling
 - Verify environment configs
@@ -170,6 +183,7 @@ Run `npm run dev` to test."
 - Create deployment checklist
 **Output style:**
 ```
 ## Pre-Ship Checklist
@@ -195,35 +209,111 @@ Run `npm run dev` to test."
 The AI should automatically detect the appropriate mode based on:
-| Trigger | Mode |
-|---------|------|
-| "what if", "ideas", "options" | BRAINSTORM |
-| "build", "create", "add" | IMPLEMENT |
-| "not working", "error", "bug" | DEBUG |
-| "review", "check", "audit" | REVIEW |
-| "explain", "how does", "learn" | TEACH |
-| "deploy", "release", "production" | SHIP |
+| Trigger                                        | Mode                |
+| ---------------------------------------------- | ------------------- |
+| "what if", "ideas", "options"                  | BRAINSTORM          |
+| "build", "create", "add"                       | IMPLEMENT           |
+| "not working", "error", "bug"                  | DEBUG               |
+| "review", "check", "audit"                     | REVIEW              |
+| "explain", "how does", "learn"                 | TEACH               |
+| "deploy", "release", "production"              | SHIP                |
+| "iterate", "refine quality", "not good enough" | EVALUATOR-OPTIMIZER |
+---
+## Workflow Patterns
+Three patterns govern how modes combine across multiple agents or steps. Use the simplest pattern that solves the problem — add complexity only when it measurably improves results.
+### 1. Sequential (default)
+Use when tasks have dependencies — each step needs the previous step's output.
+```
+[BRAINSTORM] → [IMPLEMENT] → [REVIEW] → [SHIP]
+```
+Best for: multi-stage features, draft-review-polish cycles, data pipelines.
+### 2. Parallel
+Use when tasks are independent and doing them one at a time is too slow.
+```
+[security REVIEW + performance REVIEW + quality REVIEW] → synthesize
+```
+Best for: code review across multiple dimensions, parallel analysis. Requires a clear aggregation strategy before starting.
+### 3. Evaluator-Optimizer (new)
+Use when first-draft quality consistently falls short and quality is measurable.
+```
+[IMPLEMENT] → [REVIEW with criteria] → pass? → done
+                      ↓ fail
+               feedback → [IMPLEMENT again]
+```
+**When to use:**
+- Technical docs, customer communications, SQL queries against specific standards
+- Any output where the gap between first attempt and required quality is significant
+- When you have clear, checkable criteria (not just "make it better")
+**When NOT to use:**
+- First-attempt quality is already acceptable
+- Criteria are too subjective for consistent AI evaluation
+- Real-time use cases needing immediate responses
+- Deterministic validators exist (linters, schema validators) — use those instead
+**Implementation:**
+```
+## Generator
+Task: [what to create]
+Constraints: [specific, measurable requirements — these become eval criteria]
+## Evaluator
+Criteria:
+1. [Criterion A] — Pass/Fail + specific failure note
+2. [Criterion B] — Pass/Fail + specific failure note
+Output JSON: { "pass": bool, "failures": ["..."], "revision_note": "..." }
+Max iterations: 3  ← always set a ceiling
+Stop when: all criteria pass OR max iterations reached
+```
 ---
-## Multi-Agent Collaboration Patterns (2025)
+## Multi-Agent Collaboration Patterns
 Modern architectures optimized for agent-to-agent collaboration:
 ### 1. 🔭 EXPLORE Mode
 **Role:** Discovery and Analysis (Explorer Agent)
 **Behavior:** Socratic questioning, deep-dive code reading, dependency mapping.
 **Output:** `discovery-report.json`, architectural visualization.
 ### 2. 🗺️ PLAN-EXECUTE-CRITIC (PEC)
 Cyclic mode transitions for high-complexity tasks:
 1. **Planner:** Decomposes the task into atomic steps (`task.md`).
 2. **Executor:** Performs the actual coding (`IMPLEMENT`).
 3. **Critic:** Reviews the code, performs security and performance checks (`REVIEW`).
 ### 3. 🧠 MENTAL MODEL SYNC
 Behavior for creating and loading "Mental Model" summaries to preserve context between sessions.
+### 4. 🔄 EVALUATOR-OPTIMIZER
+Paired agents in an iterative quality loop: Generator produces, Evaluator scores against criteria, Generator refines. Set max iteration ceiling before starting.
 ---
 ## Combining Modes
@@ -239,4 +329,5 @@ Users can explicitly request a mode:
 /implement the user profile page
 /debug why login fails
 /review this pull request
+/iterate [target quality bar]    ← triggers evaluator-optimizer
 ```

package/workflows/skills/agent-design/SKILL.md ADDED Viewed

@@ -0,0 +1,198 @@
+---
+name: agent-design
+description: "Use when designing, building, or improving a CBX agent, skill, or workflow: clarification strategy, progressive disclosure structure, workflow pattern selection (sequential, parallel, evaluator-optimizer), skill type taxonomy, description tuning, and eval-first testing."
+license: MIT
+metadata:
+  author: cubis-foundry
+  version: "1.0"
+compatibility: Claude Code, Codex, GitHub Copilot, Gemini CLI
+---
+# Agent Design
+## Purpose
+You are the specialist for designing CBX agents and skills that behave intelligently — asking the right questions, knowing when to pause, executing in the right workflow pattern, and testing their own output.
+Your job is to close the gap between "it kinda works" and "it works reliably under any input."
+## When to Use
+- Designing or refactoring a SKILL.md or POWER.md
+- Choosing between sequential, parallel, or evaluator-optimizer workflow
+- Writing clarification logic for an agent that handles ambiguous requests
+- Deciding whether a task needs a skill or just a prompt
+- Testing whether a skill actually works as intended
+- Writing descriptions that trigger the right skill at the right time
+## Core Principles
+These come directly from Anthropic's agent engineering research (["Equipping agents for the real world"](https://claude.com/blog/equipping-agents-for-the-real-world-with-agent-skills), March 2026):
+1. **Progressive disclosure** — A skill's SKILL.md provides just enough context to know when to load it. Full instructions, references, and scripts are loaded lazily, only when needed. More context in a single file does not equal better behavior — it usually hurts it.
+2. **Eval before optimizing** — Define what "good looks like" (test cases + success criteria) before editing the skill. This prevents regression and tells you when improvement actually happened.
+3. **Description precision** — The `description` field in YAML frontmatter controls triggering. Too broad = false positives. Too narrow = the skill never fires. Tune it like a search query.
+4. **Two skill types** — See [Skill Type Taxonomy](#skill-type-taxonomy). These need different testing strategies and have different shelf lives.
+5. **Start with a single agent** — Before adding workflow complexity, first try a single agent with a rich prompt. Only add orchestration when it measurably improves results.
+## Skill Type Taxonomy
+| Type                   | What it does                                                                                                                                | Testing goal                                | Shelf life                                              |
+| ---------------------- | ------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------- | ------------------------------------------------------- |
+| **Capability uplift**  | Teaches Claude to do something it can't do alone (e.g. manipulate PDFs, fill forms, use a domain-specific API)                              | Verify the output is correct and consistent | Medium — may become obsolete as models improve          |
+| **Encoded preference** | Sequences steps Claude could do individually, but in your team's specific order and style (e.g. NDA review checklist, weekly update format) | Verify fidelity to the actual workflow      | High — these stay useful because they're uniquely yours |
+Design question: "Is this skill teaching Claude something new, or encoding how we do things?"
+## Clarification Strategy
+An agent that starts wrong wastes everyone's time. Smart agents pause at the right moments.
+Load `references/clarification-patterns.md` when:
+- Designing how a skill should handle ambiguous or underspecified inputs
+- Writing the early steps of a workflow where user intent matters
+- Deciding what questions to ask vs. what to infer
+## Workflow Pattern Selection
+Three patterns cover 95% of production agent workflows:
+| Pattern                 | Use when                                                        | Cost                    | Benefit                                   |
+| ----------------------- | --------------------------------------------------------------- | ----------------------- | ----------------------------------------- |
+| **Sequential**          | Steps have dependencies (B needs A's output)                    | Latency (linear)        | Focus: each step does one thing well      |
+| **Parallel**            | Steps are independent and concurrency helps                     | Tokens (multiplicative) | Speed + separation of concerns            |
+| **Evaluator-optimizer** | First-draft quality isn't good enough and quality is measurable | Tokens × iterations     | Better output through structured feedback |
+Default to sequential. Add parallel when latency is the bottleneck and tasks are genuinely independent. Add evaluator-optimizer only when you can measure the improvement.
+Load `references/workflow-patterns.md` for the full decision tree, examples, and anti-patterns.
+## Progressive Disclosure Structure
+A well-structured CBX skill looks like:
+```
+skill-name/
+  SKILL.md           ← lean entry: name, description, purpose, when-to-use, load-table
+  references/        ← detailed guides loaded lazily when step requires it
+    topic-a.md
+    topic-b.md
+  commands/          ← slash commands (optional)
+    command.md
+  scripts/           ← executable code (optional)
+    helper.py
+```
+**SKILL.md should be loadable in <2000 tokens.** Everything else lives in references.
+The metadata table pattern that works:
+```markdown
+## References
+| File                    | Load when                                  |
+| ----------------------- | ------------------------------------------ |
+| `references/topic-a.md` | Task involves [specific trigger condition] |
+| `references/topic-b.md` | Task involves [specific trigger condition] |
+```
+This lets the agent make intelligent decisions about what context to load rather than ingesting everything upfront.
+## Description Writing
+The `description` field is a trigger — write it like a search query, not marketing copy.
+**Good description:**
+```yaml
+description: "Use when evaluating an agent, skill, workflow, or MCP server: rubric design, evaluator-optimizer loops, LLM-as-judge patterns, regression suites, or prototype-vs-production quality gaps."
+```
+**Bad description:**
+```yaml
+description: "A comprehensive skill for evaluating things and making sure they work well."
+```
+Rules:
+- Lead with the specific trigger verb: "Use when [user does X]"
+- List the specific task types with commas — these act like search keywords
+- Include domain-specific nouns the user would actually type
+- Avoid generic adjectives ("comprehensive", "powerful", "advanced")
+Test your description: would a user's natural-language request match the intent of these words?
+## Testing a Skill
+Before shipping, verify with this checklist:
+1. **Positive trigger** — Does the skill load when it should? Test 5 natural phrasings of the target task.
+2. **Negative trigger** — Does it stay quiet when it shouldn't load? Test 5 near-miss phrasings.
+3. **Happy path** — Does the skill complete the standard task correctly?
+4. **Edge cases** — What happens with missing input, ambiguous phrasing, or edge-case content?
+5. **Reader test** — Run the delivery (e.g., a generated doc, a plan) through a fresh sub-agent with no context. Can it answer questions about the output correctly?
+For formal regression suites, load `references/skill-testing.md`.
+## Instructions
+### Step 1 — Understand the design task
+Before touching any file, clarify:
+- Is this a new skill or improving an existing one?
+- Is it capability uplift or encoded preference?
+- What's the specific failure mode being fixed?
+- What would passing look like?
+If any of these are unclear, apply the clarification pattern from `references/clarification-patterns.md`.
+### Step 2 — Choose the structure
+- If the skill is simple (single task, single purpose): lean SKILL.md with no references
+- If the skill is complex (multiple phases, conditional logic): SKILL.md + references loaded lazily
+- If the skill has reusable commands: add `commands/` directory
+### Step 3 — Design the workflow
+Use the pattern selection table above. Start with sequential. Prove you need complexity before adding it.
+### Step 4 — Write the description
+Write it last. Once you know what the skill does and how it differs from adjacent skills, the right description is usually obvious.
+### Step 5 — Define a test
+Write at least 3 test cases (input → expected output or behavior) before considering the skill done. These become the regression suite.
+## Output Format
+Deliver:
+1. **Skill structure** — directory layout, file list
+2. **SKILL.md** — production-ready with lean body and reference table
+3. **Reference files** — if needed, each scoped to a specific phase or topic
+4. **Test cases** — 3-5 natural language inputs with expected behaviors
+5. **Description** — the final `description` field, tuned for triggering
+## References
+| File                                   | Load when                                                                      |
+| -------------------------------------- | ------------------------------------------------------------------------------ |
+| `references/clarification-patterns.md` | Designing how the agent handles ambiguous or underspecified input              |
+| `references/workflow-patterns.md`      | Choosing or implementing sequential, parallel, or evaluator-optimizer workflow |
+| `references/skill-testing.md`          | Writing evals, regression sets, or triggering tests for a skill                |
+## Examples
+- "Design a skill for our NDA review process — it should follow our checklist exactly."
+- "The feature-forge skill triggers on the wrong prompts. Help me fix the description."
+- "How do I test whether my skill still works after a model update?"
+- "I need a workflow where 3 agents review code in parallel then one synthesizes findings."
+- "This skill's SKILL.md is 4000 tokens. Help me split it into lean structure with references."