npm - guild-agents - Versions diffs - 1.4.0 → 2.0.0 - Mend

guild-agents 1.4.0 → 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (58) hide show

package/README.md +69 -68
package/bin/guild.js +4 -85
package/package.json +2 -2
package/src/commands/doctor.js +11 -33
package/src/commands/init.js +1 -1
package/src/templates/agents/advisor.md +0 -1
package/src/templates/agents/developer.md +2 -2
package/src/templates/agents/qa.md +1 -1
package/src/templates/agents/tech-lead.md +2 -2
package/src/templates/skills/build-feature/SKILL.md +59 -117
package/src/templates/skills/build-feature/evals/evals.json +3 -4
package/src/templates/skills/council/SKILL.md +6 -16
package/src/templates/skills/council/evals/evals.json +3 -13
package/src/templates/skills/create-pr/SKILL.md +2 -5
package/src/templates/skills/guild-specialize/SKILL.md +2 -9
package/src/templates/skills/qa-cycle/SKILL.md +0 -7
package/src/templates/skills/re-specialize/SKILL.md +0 -3
package/src/templates/skills/session-end/SKILL.md +77 -30
package/src/templates/skills/session-start/SKILL.md +51 -20
package/src/utils/eval-runner.js +2 -8
package/src/utils/generators.js +3 -4
package/src/utils/skill-parser.js +83 -0
package/src/utils/trigger-runner.js +1 -1
package/src/commands/logs.js +0 -63
package/src/commands/reset-learnings.js +0 -44
package/src/commands/run.js +0 -105
package/src/commands/stats.js +0 -147
package/src/templates/agents/db-migration.md +0 -51
package/src/templates/agents/learnings-extractor.md +0 -49
package/src/templates/agents/platform-expert.md +0 -92
package/src/templates/agents/product-owner.md +0 -52
package/src/templates/skills/dev-flow/SKILL.md +0 -83
package/src/templates/skills/dev-flow/evals/evals.json +0 -36
package/src/templates/skills/dev-flow/evals/triggers.json +0 -16
package/src/templates/skills/new-feature/SKILL.md +0 -119
package/src/templates/skills/new-feature/evals/evals.json +0 -41
package/src/templates/skills/new-feature/evals/triggers.json +0 -16
package/src/templates/skills/review/SKILL.md +0 -97
package/src/templates/skills/review/evals/evals.json +0 -43
package/src/templates/skills/review/evals/triggers.json +0 -16
package/src/templates/skills/status/SKILL.md +0 -100
package/src/templates/skills/status/evals/evals.json +0 -40
package/src/templates/skills/status/evals/triggers.json +0 -16
package/src/templates/skills/verify/SKILL.md +0 -114
package/src/templates/skills/verify/evals/triggers.json +0 -16
package/src/utils/accounting.js +0 -139
package/src/utils/dispatch-protocol.js +0 -74
package/src/utils/dispatch.js +0 -172
package/src/utils/executor.js +0 -183
package/src/utils/learnings-io.js +0 -76
package/src/utils/learnings.js +0 -204
package/src/utils/orchestrator-io.js +0 -356
package/src/utils/orchestrator.js +0 -590
package/src/utils/pricing.js +0 -28
package/src/utils/providers/claude-code.js +0 -43
package/src/utils/skill-loader.js +0 -83
package/src/utils/trace.js +0 -400
package/src/utils/workflow-parser.js +0 -225

package/README.md CHANGED Viewed

@@ -7,7 +7,7 @@
 **Guild makes Claude Code think before it builds.**
-Guild is a spec-driven development CLI for Claude Code. It installs structured design and development workflows as `.claude/` markdown files in any project. Before code is written, features are evaluated, debated by independent AI perspectives, and specified in a design doc. Everything is markdown, tracked by git, works offline, zero infrastructure.
+Without Guild, Claude Code writes code immediately. No evaluation, no design, no review. With Guild, every feature goes through structured phases — evaluated by an advisor, designed by a tech lead, reviewed by a code reviewer, validated by QA — before anything ships. Everything is markdown in `.claude/`, tracked by git, works offline, zero infrastructure.
 ## The Problem
@@ -20,10 +20,11 @@ Without structure, Claude Code:
 ## How Guild Solves It
-- **Spec before code**: every feature starts with a design doc
-- **Structured deliberation**: `/council` runs parallel independent analysis -- multiple perspectives evaluate independently, then synthesize
-- **Decisions that persist**: design docs, session state, and project context live in git-tracked markdown
-- **Zero infrastructure**: no servers, no APIs, just markdown files and Claude Code
+- **Spec before code**: `/build-feature` enforces evaluation, design, and review phases — code comes after the design doc
+- **Independent perspectives**: `/council` spawns parallel agents that each analyze your idea independently, then synthesize into a decision
+- **Session continuity**: `/session-start` and `/session-end` combine SESSION.md with Claude Code's memory system — you never lose context between sessions
+- **Behavioral discipline**: `/tdd` and `/debug` prevent the most common LLM anti-patterns: code before tests, fixes before root cause analysis
+- **Quality you can measure**: `guild eval` validates skill structure, trigger accuracy, and description quality with automated benchmarks
 ## Quick Start
@@ -43,46 +44,56 @@ Then use skills as slash commands in Claude Code:
 ## The Pipeline
 ```text
-You ──> /council "Add JWT auth"
+You ──> /build-feature "Add JWT auth"
          │
          ▼
-    ┌──────────┐     ┌──────────────┐     ┌──────────┐
-    │ Evaluate │────>│  Design Doc  │────>│  Build   │
-    │ debate   │     │  spec        │     │ implement│
-    └──────────┘     └──────────────┘     └────┬─────┘
-                                               │
-                                         ┌─────┴─────┐
-                                         ▼           ▼
-                                   ┌──────────┐┌──────────┐
-                                   │  Review  ││    QA    │
-                                   └──────────┘└──────────┘
+    ┌──────────┐     ┌──────────┐     ┌──────────┐
+    │ Evaluate │────>│  Design  │────>│  Build   │
+    │ advisor  │     │ tech-lead│     │developer │
+    └──────────┘     └──────────┘     └────┬─────┘
+                                           │
+                                     ┌─────┴─────┐
+                                     ▼           ▼
+                               ┌──────────┐┌──────────┐
+                               │  Review  ││    QA    │
+                               └──────────┘└──────────┘
 ```
-Six phases: **evaluate**, **specify**, **plan**, **implement**, **review**, **validate**. Phases 1-3 happen before any code is written.
+Five phases: **evaluate**, **design**, **implement**, **review**, **validate**. Phases 1-2 happen before any code is written.
-## Skills Reference
+## Skills
-All 15 skills, grouped by function:
+10 skills, available as slash commands in Claude Code:
-| Skill | Group | Description |
-| --- | --- | --- |
-| `/build-feature` | Pipeline | Full pipeline: evaluate, spec, implement, review, QA |
-| `/new-feature` | Pipeline | Create branch and scaffold for a new feature |
-| `/create-pr` | Pipeline | Create a structured pull request from current branch |
-| `/council` | Decision | Multi-perspective deliberation on a decision or feature |
-| `/review` | Quality | Code review on the current diff |
-| `/qa-cycle` | Quality | QA and bugfix loop until clean |
-| `/tdd` | Discipline | TDD red-green-refactor cycle |
-| `/debug` | Discipline | Systematic 4-phase debugging |
-| `/verify` | Discipline | Evidence-before-claims verification |
-| `/guild-specialize` | Context | Explore codebase, enrich CLAUDE.md with real conventions |
-| `/re-specialize` | Context | Incremental update of auto-generated CLAUDE.md zones |
-| `/session-start` | Context | Load context and resume work |
-| `/session-end` | Context | Save state to SESSION.md |
-| `/status` | Context | Project and session state overview |
-| `/dev-flow` | Context | Show current pipeline phase and next step |
+| Skill | What it does |
+| --- | --- |
+| `/build-feature` | Full pipeline: evaluate, design, implement, review, QA |
+| `/council` | Multi-perspective deliberation — 3 agents debate independently, then synthesize |
+| `/create-pr` | Structured pull request from current branch |
+| `/qa-cycle` | QA and bugfix loop until clean |
+| `/tdd` | TDD red-green-refactor — no code without a failing test |
+| `/debug` | Systematic 4-phase debugging — no fixes without root cause |
+| `/guild-specialize` | Explore your codebase, enrich CLAUDE.md with real conventions |
+| `/re-specialize` | Incremental update of CLAUDE.md when your stack changes |
+| `/session-start` | Resume work from SESSION.md + Claude Code memory |
+| `/session-end` | Save state to SESSION.md + durable learnings to memory |
+## Agents
+6 specialized roles that give Claude Code distinct perspectives:
+| Agent | Role |
+| --- | --- |
+| advisor | Evaluates ideas and provides strategic direction. First gate before any work begins |
+| tech-lead | Breaks features into tasks. Defines technical approach and architecture |
+| developer | Implements features following project conventions. Writes tests, makes atomic commits |
+| code-reviewer | Reviews quality, patterns, and technical debt |
+| qa | Testing, edge cases, regression. Validates the implementation meets acceptance criteria |
+| bugfix | Diagnosis and bug resolution. Isolates root causes and applies targeted fixes |
-## CLI Commands
+Each agent is a flat `.md` file with identity, responsibilities, and boundaries. Claude Code reads them via its native Agent tool and assumes the role.
+## CLI
 ```bash
 guild init              # Interactive project onboarding
@@ -90,48 +101,38 @@ guild new-agent <name>  # Create a custom agent
 guild status            # Show project status
 guild doctor            # Diagnose setup
 guild list              # List agents and skills
-guild run <skill>       # Preview a skill's execution plan (dry-run)
-guild logs              # View execution traces
-guild logs clean        # Remove old traces (--days N, --all)
-guild stats             # Token usage and cost estimates
 guild eval              # Run structural skill evaluations
-guild eval --triggers   # Run trigger accuracy tests (keyword matcher)
-guild eval --semantic   # Run trigger tests with LLM semantic matcher
-guild eval --suggest    # Show description improvement suggestions
-guild workspace init <name> <members...>  # Create a workspace
-guild workspace add <path>                # Add a member repo
-guild workspace status                    # Show workspace state
+guild eval --triggers   # Run trigger accuracy tests
+guild eval --semantic   # LLM-based trigger tests (requires ANTHROPIC_API_KEY)
+guild eval --suggest    # Description improvement suggestions
+guild workspace init    # Create a multi-repo workspace
 ```
 ## Skill Evaluations
-Guild includes a built-in evaluation framework for validating skill quality:
+Guild includes a built-in framework for measuring skill quality:
-- **Structural evals** (`guild eval`) -- assert workflow structure: steps exist, roles are correct, gates are present
-- **Trigger tests** (`guild eval --triggers`) -- verify that user prompts route to the correct skill using keyword overlap scoring
-- **Semantic matcher** (`guild eval --semantic`) -- optional LLM-based scoring via Anthropic Haiku for higher-fidelity trigger testing (requires `ANTHROPIC_API_KEY`)
-- **Description suggestions** (`guild eval --suggest`) -- analyzes keyword gaps in skill descriptions based on failed triggers
+- **Structural evals** -- assert workflow structure: steps exist, roles are correct, gates are present
+- **Trigger tests** -- verify that user prompts route to the correct skill
+- **Semantic matcher** -- LLM-based scoring for higher-fidelity trigger testing
+- **Benchmarks** -- rolling history with per-skill accuracy, precision, recall, and regression detection
-Every trigger run automatically records results to `benchmarks/benchmark.json` (rolling 30-entry history) and generates `benchmarks/benchmark.md` with per-skill accuracy, precision, recall, and delta vs previous run. Regressions (>5% accuracy drop with 2+ tests flipped) are flagged automatically.
+## How It Works
-## Under the Hood
+Guild installs agent definitions and skill workflows as markdown files in your project's `.claude/` directory. Claude Code discovers and executes them natively — no custom runtime, no extra process, no API calls. When you type `/build-feature`, Claude Code reads the skill, follows the phases, and spawns agents using its own Agent tool.
-Guild coordinates 10 specialized agents through the pipeline. Each agent handles one phase.
+Guild defines **what** happens. Claude Code decides **how** to execute it.
-| Agent | Role |
-| --- | --- |
-| advisor | Evaluates ideas and provides strategic direction |
-| product-owner | Turns approved ideas into concrete tasks |
-| tech-lead | Defines technical approach and architecture |
-| developer | Implements features following project conventions |
-| code-reviewer | Reviews quality, patterns, and technical debt |
-| qa | Testing, edge cases, regression validation |
-| bugfix | Bug diagnosis and resolution |
-| db-migration | Schema changes and safe migrations |
-| platform-expert | Diagnoses Claude Code integration issues |
-| learnings-extractor | Extracts compound learnings from pipeline executions |
+## Session Continuity
+Claude Code's native memory system remembers who you are, lessons learned, and project context — knowledge that lasts months. But it explicitly does not store ephemeral work state: what you were building, which branch, what phase, what's next. That's the gap Guild fills.
+`/session-end` writes to **both layers**:
+- **SESSION.md** — where you stopped: task, branch, phase, next steps (overwritten each session)
+- **Claude Code memory** — what you learned: decisions, lessons, references (persists across sessions)
-Agents are flat `.md` files with identity and expertise. Skills orchestrate agents through structured pipelines. Everything lives in `.claude/`, readable by humans, tracked by git.
+`/session-start` reads from **both** and presents a unified summary. You resume exactly where you left off, with full context of what you know and what you were doing.
 ## Guild Builds Itself

package/bin/guild.js CHANGED Viewed

@@ -1,14 +1,15 @@
 #!/usr/bin/env node
 /**
- * Guild v1 — CLI entry point
+ * Guild v2 — CLI entry point
  * Usage:
- *   guild init           — interactive onboarding v1
+ *   guild init           — interactive onboarding
  *   guild new-agent      — create a new agent
  *   guild status         — view project status
  *   guild doctor         — verify setup and report issues
  *   guild list           — list installed agents and skills
- *   guild stats          — view token usage and cost stats
+ *   guild eval           — run skill structural evaluations
+ *   guild workspace      — manage multi-repo workspaces
  */
 import { program } from 'commander';
@@ -106,69 +107,6 @@ program
     }
   });
-// guild reset-learnings
-program
-  .command('reset-learnings')
-  .description('Reset the compound learnings file')
-  .option('-f, --force', 'Skip confirmation prompt')
-  .action(async (options) => {
-    try {
-      const { runResetLearnings } = await import('../src/commands/reset-learnings.js');
-      await runResetLearnings(options);
-    } catch (err) {
-      console.error(err.message);
-      process.exit(1);
-    }
-  });
-// guild run
-program
-  .command('run')
-  .description('Execute a skill workflow')
-  .argument('<skill>', 'Skill name to run')
-  .argument('[input]', 'Input text for the skill', '')
-  .option('--profile <profile>', 'Model profile (max, pro)', 'max')
-  .option('--dry-run', 'Display the execution plan without running it')
-  .action(async (skill, input, options) => {
-    try {
-      const { runRun } = await import('../src/commands/run.js');
-      await runRun(skill, input, options);
-    } catch (err) {
-      console.error(err.message);
-      process.exit(1);
-    }
-  });
-// guild logs (list traces)
-const logsCmd = program
-  .command('logs')
-  .description('View and manage execution traces')
-  .action(async (options) => {
-    try {
-      const { runLogs } = await import('../src/commands/logs.js');
-      await runLogs('list', options);
-    } catch (err) {
-      console.error(err.message);
-      process.exit(1);
-    }
-  });
-// guild logs clean
-logsCmd
-  .command('clean')
-  .description('Remove old execution traces')
-  .option('--days <days>', 'Remove traces older than N days', '30')
-  .option('--all', 'Remove all traces')
-  .action(async (options) => {
-    try {
-      const { runLogs } = await import('../src/commands/logs.js');
-      await runLogs('clean', options);
-    } catch (err) {
-      console.error(err.message);
-      process.exit(1);
-    }
-  });
 // guild eval
 program
   .command('eval')
@@ -195,25 +133,6 @@ program
     }
   });
-// guild stats
-program
-  .command('stats')
-  .description('View token usage stats and cost estimates')
-  .option('--period <period>', 'Filter by period: today, week, month, all', 'month')
-  .option('--compare', 'Compare cost across model profiles')
-  .option('--reset', 'Delete all usage history')
-  .option('-f, --force', 'Skip confirmation prompt (for --reset)')
-  .option('--export <format>', 'Export data (csv)')
-  .action(async (options) => {
-    try {
-      const { runStats } = await import('../src/commands/stats.js');
-      await runStats(options);
-    } catch (err) {
-      console.error(err.message);
-      process.exit(1);
-    }
-  });
 // guild workspace
 const workspaceCmd = program
   .command('workspace')

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "guild-agents",
-  "version": "1.4.0",
+  "version": "2.0.0",
   "description": "Specification-driven development CLI for Claude Code — think before you build",
   "type": "module",
   "files": [
@@ -73,7 +73,7 @@
     "@eslint/js": "^10.0.1",
     "@vitest/coverage-v8": "^4.0.18",
     "eslint": "^10.0.1",
-    "markdownlint-cli2": "^0.21.0",
+    "markdownlint-cli2": "^0.22.1",
     "vitest": "^4.0.18"
   }
 }

package/src/commands/doctor.js CHANGED Viewed

@@ -4,10 +4,10 @@
 import * as p from '@clack/prompts';
 import chalk from 'chalk';
-import { existsSync, readdirSync } from 'fs';
+import { existsSync, readdirSync, readFileSync } from 'fs';
 import { join } from 'path';
 import { resolveProjectRoot } from '../utils/files.js';
-import { loadAllSkills } from '../utils/skill-loader.js';
+import { parseSkill } from '../utils/skill-parser.js';
 export async function runDoctor() {
   const root = resolveProjectRoot();
@@ -86,27 +86,26 @@ export async function runDoctor() {
   // Check workflow validation in skills
   if (existsSync(skillsDir)) {
-    const skills = loadAllSkills(skillsDir);
+    const skillDirs = readdirSync(skillsDir, { withFileTypes: true })
+      .filter(d => d.isDirectory())
+      .filter(d => existsSync(join(skillsDir, d.name, 'SKILL.md')));
     let workflowCount = 0;
     let workflowErrors = 0;
     const errorDetails = [];
-    for (const [name, skill] of skills) {
+    for (const dir of skillDirs) {
+      const content = readFileSync(join(skillsDir, dir.name, 'SKILL.md'), 'utf8');
+      const skill = parseSkill(content);
       if (skill.workflow) {
         workflowCount++;
-        if (skill.errors.length > 0) {
-          workflowErrors++;
-          errorDetails.push(`${name}: ${skill.errors.join('; ')}`);
-        }
-      }
-      // Check that agent references exist
-      if (skill.workflow) {
         for (const step of skill.workflow.steps) {
           if (step.role !== 'system' && step.role !== 'dynamic') {
             const agentPath = join(agentsDir, `${step.role}.md`);
             if (!existsSync(agentPath)) {
-              errorDetails.push(`${name}: step "${step.id}" references agent "${step.role}" — agent not found`);
+              errorDetails.push(`${dir.name}: step "${step.id}" references agent "${step.role}" — agent not found`);
               workflowErrors++;
             }
           }
@@ -124,27 +123,6 @@ export async function runDoctor() {
       });
       healthy = false;
     }
-    // If workflowCount === 0, don't add a check (no workflows to validate)
-    // Check for dual-format skills (workflow frontmatter + body step/phase headings)
-    // Matches "### Step 1", "## Phase 2", etc. — requires digit after Step/Phase
-    const STEP_PHASE_RE = /^#{1,3}\s+(Step|Phase)\s+\d/im;
-    const dualFormatWarnings = [];
-    for (const [name, skill] of skills) {
-      if (skill.workflow && skill.body && STEP_PHASE_RE.test(skill.body)) {
-        dualFormatWarnings.push(name);
-      }
-    }
-    if (dualFormatWarnings.length > 0) {
-      checks.push({
-        name: `Dual-format skills (${dualFormatWarnings.length} warning(s))`,
-        pass: true,
-        warn: true,
-        detail: `Skills with both workflow frontmatter and body step/phase headings: ${dualFormatWarnings.join(', ')}. Workflow steps take precedence — consider removing prose steps from body.`,
-      });
-    }
   }
   // Display results

package/src/commands/init.js CHANGED Viewed

@@ -135,7 +135,7 @@ export async function runInit() {
   const relevantSkills = projectData.hasExistingCode
     ? ['/guild-specialize', '/council', '/build-feature']
-    : ['/council', '/build-feature', '/new-feature'];
+    : ['/council', '/build-feature'];
   p.log.info(`Start with: ${relevantSkills.join('  ')}`);
   const quickStart = projectData.hasExistingCode

package/src/templates/agents/advisor.md CHANGED Viewed

@@ -21,7 +21,6 @@ You are the domain guardian of [PROJECT]. Your job is to evaluate ideas and prop
 ## What you do NOT do
 - You do not define architecture or technical approach -- that is the Tech Lead's role
-- You do not prioritize the backlog or write acceptance criteria -- that is the Product Owner's role
 - You do not review code -- that is the Code Reviewer's role
 - You do not implement anything -- that is the Developer's role

package/src/templates/agents/developer.md CHANGED Viewed

@@ -8,7 +8,7 @@ default-tier: execution
 # Developer
-You are the Developer for [PROJECT]. Your job is to implement features and changes following the project conventions, the approach defined by the Tech Lead, and the acceptance criteria from the Product Owner.
+You are the Developer for [PROJECT]. Your job is to implement features and changes following the project conventions, the approach defined by the Tech Lead, and the acceptance criteria.
 ## Responsibilities
@@ -22,7 +22,7 @@ You are the Developer for [PROJECT]. Your job is to implement features and chang
 - You do not define architecture or technical approach -- that is the Tech Lead's role
 - You do not validate the result functionally -- that is QA's role
-- You do not prioritize or decide what to implement -- that is the Product Owner's role
+- You do not prioritize or decide what to implement -- that is the Advisor's role
 - You do not investigate production bugs -- that is Bugfix's role
 ## Process

package/src/templates/agents/qa.md CHANGED Viewed

@@ -22,7 +22,7 @@ You are QA for [PROJECT]. Your job is to functionally validate that the implemen
 - You do not fix bugs -- that is Bugfix's role
 - You do not write unit tests -- that is the Developer's role
-- You do not define acceptance criteria -- that is the Product Owner's role
+- You do not define acceptance criteria -- that is the Tech Lead's role
 - You do not implement features -- that is the Developer's role
 ## Process

package/src/templates/agents/tech-lead.md CHANGED Viewed

@@ -15,7 +15,7 @@ You are the Tech Lead for [PROJECT]. Your job is to ensure the technical coheren
 - Define the technical approach for each task before implementation
 - Establish patterns, interfaces, and contracts between components
 - Identify technical risks and propose mitigations
-- Enrich Product Owner tasks with concrete technical direction
+- Break features into concrete tasks with verifiable acceptance criteria
 - Maintain the project's architectural coherence over time
 ## What you do NOT do
@@ -23,7 +23,7 @@ You are the Tech Lead for [PROJECT]. Your job is to ensure the technical coheren
 - You do not implement code -- that is the Developer's role
 - You do not validate functional behavior -- that is QA's role
 - You do not evaluate business coherence -- that is the Advisor's role
-- You do not prioritize the backlog -- that is the Product Owner's role
+- You do not evaluate business coherence or prioritize the backlog -- that is the Advisor's role
 ## Process