devlyn-cli 0.7.1 → 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (23) hide show
  1. package/CLAUDE.md +31 -0
  2. package/README.md +34 -13
  3. package/agents-config/evaluator.md +37 -13
  4. package/bin/devlyn.js +33 -4
  5. package/config/skills/devlyn:auto-resolve/SKILL.md +244 -0
  6. package/config/skills/{devlyn-evaluate → devlyn:evaluate}/SKILL.md +106 -4
  7. package/config/skills/{devlyn-team-resolve → devlyn:team-resolve}/SKILL.md +36 -8
  8. package/package.json +1 -1
  9. /package/config/skills/{devlyn-clean → devlyn:clean}/SKILL.md +0 -0
  10. /package/config/skills/{devlyn-design-system → devlyn:design-system}/SKILL.md +0 -0
  11. /package/config/skills/{devlyn-design-ui → devlyn:design-ui}/SKILL.md +0 -0
  12. /package/config/skills/{devlyn-discover-product → devlyn:discover-product}/SKILL.md +0 -0
  13. /package/config/skills/{devlyn-feature-spec → devlyn:feature-spec}/SKILL.md +0 -0
  14. /package/config/skills/{devlyn-implement-ui → devlyn:implement-ui}/SKILL.md +0 -0
  15. /package/config/skills/{devlyn-product-spec → devlyn:product-spec}/SKILL.md +0 -0
  16. /package/config/skills/{devlyn-recommend-features → devlyn:recommend-features}/SKILL.md +0 -0
  17. /package/config/skills/{devlyn-resolve → devlyn:resolve}/SKILL.md +0 -0
  18. /package/config/skills/{devlyn-review → devlyn:review}/SKILL.md +0 -0
  19. /package/config/skills/{devlyn-team-design-ui → devlyn:team-design-ui}/SKILL.md +0 -0
  20. /package/config/skills/{devlyn-team-review → devlyn:team-review}/SKILL.md +0 -0
  21. /package/config/skills/{devlyn-update-docs → devlyn:update-docs}/SKILL.md +0 -0
  22. /package/optional-skills/{devlyn-pencil-pull → devlyn:pencil-pull}/SKILL.md +0 -0
  23. /package/optional-skills/{devlyn-pencil-push → devlyn:pencil-push}/SKILL.md +0 -0
package/CLAUDE.md CHANGED
@@ -48,6 +48,36 @@ The full design-to-implementation pipeline:
48
48
 
49
49
  For complex features, use the Plan agent to design the approach before implementation.
50
50
 
51
+ ## Automated Pipeline (Recommended Starting Point)
52
+
53
+ For hands-free build-evaluate-polish cycles — works for bugs, features, refactors, and chores:
54
+
55
+ ```
56
+ /devlyn:auto-resolve [task description]
57
+ ```
58
+
59
+ This runs the full pipeline automatically: **Build → Evaluate → Fix Loop → Simplify → Review → Clean → Docs**. Each phase runs as a separate subagent with its own context. Communication between phases happens via files (`.claude/done-criteria.md`, `.claude/EVAL-FINDINGS.md`).
60
+
61
+ Optional flags:
62
+ - `--max-rounds 3` — increase max evaluate-fix iterations (default: 2)
63
+ - `--skip-review` — skip team-review phase
64
+ - `--skip-clean` — skip clean phase
65
+ - `--skip-docs` — skip update-docs phase
66
+
67
+ ## Manual Pipeline (Step-by-Step Control)
68
+
69
+ When you want to run each step yourself with review between phases:
70
+
71
+ 1. `/devlyn:team-resolve [issue]` → Investigate + implement (writes `.claude/done-criteria.md`)
72
+ 2. `/devlyn:evaluate` → Grade against done-criteria (writes `.claude/EVAL-FINDINGS.md`)
73
+ 3. If findings exist: `/devlyn:team-resolve "Fix issues in .claude/EVAL-FINDINGS.md"` → Fix loop
74
+ 4. `/simplify` → Quick cleanup pass
75
+ 5. `/devlyn:team-review` → Multi-perspective team review (for important PRs)
76
+ 6. `/devlyn:clean` → Codebase hygiene
77
+ 7. `/devlyn:update-docs` → Keep docs in sync
78
+
79
+ Steps 5-7 are optional depending on scope.
80
+
51
81
  ## Vibe Coding Workflow
52
82
 
53
83
  The recommended sequence after writing code:
@@ -72,6 +102,7 @@ Steps 4-6 are optional depending on the scope of changes. `/simplify` should alw
72
102
 
73
103
  - **Simple bugs**: Use `/devlyn:resolve` for systematic bug fixing with test-driven validation
74
104
  - **Complex bugs**: Use `/devlyn:team-resolve` for multi-perspective investigation with a full agent team
105
+ - **Hands-free**: Use `/devlyn:auto-resolve` for fully automated resolve → evaluate → fix → polish pipeline
75
106
  - **Post-fix review**: Use `/devlyn:team-review` for thorough multi-reviewer validation
76
107
 
77
108
  ## Maintenance Workflow
package/README.md CHANGED
@@ -26,7 +26,7 @@
26
26
 
27
27
  devlyn-cli solves this by installing a curated `.claude/` configuration into any project:
28
28
 
29
- - **14 slash commands** for debugging, code review, UI design, documentation, and more
29
+ - **16 slash commands** for debugging, code review, UI design, documentation, and more
30
30
  - **5 core skills** that activate automatically based on conversation context
31
31
  - **Agent team workflows** that spawn specialized AI teammates for complex tasks
32
32
  - **Product & feature spec templates** for structured planning
@@ -58,7 +58,7 @@ npx devlyn-cli list
58
58
  ```
59
59
  your-project/
60
60
  ├── .claude/
61
- │ ├── commands/ # 14 slash commands
61
+ │ ├── commands/ # 16 slash commands
62
62
  │ ├── skills/ # 5 core skills + any optional addons
63
63
  │ ├── templates/ # Product spec, feature spec, prompt templates
64
64
  │ ├── commit-conventions.md # Commit message standards
@@ -76,6 +76,7 @@ Slash commands are invoked directly in Claude Code conversations (e.g., type `/d
76
76
  |---|---|
77
77
  | `/devlyn:resolve` | Systematic bug fixing with root-cause analysis and test-driven validation |
78
78
  | `/devlyn:team-resolve` | Spawns a full agent team — root cause analyst, test engineer, security auditor — to investigate complex issues |
79
+ | `/devlyn:auto-resolve` | Fully automated pipeline for any task — bugs, features, refactors, chores. Build → evaluate → fix loop → simplify → review → clean → docs. One command, zero human intervention |
79
80
 
80
81
  ### Code Review & Quality
81
82
 
@@ -83,6 +84,7 @@ Slash commands are invoked directly in Claude Code conversations (e.g., type `/d
83
84
  |---|---|
84
85
  | `/devlyn:review` | Post-implementation review — security, quality, best practices checklist |
85
86
  | `/devlyn:team-review` | Multi-perspective team review with specialized reviewers (security, quality, testing, performance, product) |
87
+ | `/devlyn:evaluate` | Independent quality evaluation — assembles evaluator team to grade work against done criteria with calibrated, skeptical grading |
86
88
  | `/devlyn:clean` | Detect and remove dead code, unused dependencies, complexity hotspots, and tech debt |
87
89
 
88
90
  ### UI Design & Implementation
@@ -125,24 +127,43 @@ Skills are **not invoked manually** — they activate automatically when Claude
125
127
 
126
128
  Commands are designed to compose. Pick the right tool based on scope, then chain them together.
127
129
 
128
- ### Recommended Workflow
130
+ ### Automated Pipeline (Recommended)
129
131
 
130
- The full fix polish review maintain cycle:
132
+ One command runs the full cycle no human intervention needed:
133
+
134
+ ```bash
135
+ /devlyn:auto-resolve fix the auth bug where users see blank screen on 401
136
+ ```
137
+
138
+ | Phase | What Happens |
139
+ |---|---|
140
+ | **Build** | `team-resolve` investigates and implements, writes testable done criteria |
141
+ | **Evaluate** | Independent evaluator grades against done criteria with calibrated skepticism |
142
+ | **Fix Loop** | If evaluation fails, fixes findings and re-evaluates (up to N rounds) |
143
+ | **Simplify** | Quick cleanup pass for reuse and efficiency |
144
+ | **Review** | Multi-perspective team review |
145
+ | **Clean** | Remove dead code and unused dependencies |
146
+ | **Docs** | Sync documentation with changes |
147
+
148
+ Each phase runs as a separate subagent (fresh context), communicates via files, and commits a git checkpoint for rollback safety. Skip phases with flags: `--skip-review`, `--skip-clean`, `--skip-docs`, `--max-rounds 3`.
149
+
150
+ ### Manual Workflow
151
+
152
+ For step-by-step control between phases:
131
153
 
132
154
  | Step | Command | What It Does |
133
155
  |---|---|---|
134
156
  | 1. **Resolve** | `/devlyn:resolve` or `/devlyn:team-resolve` | Fix the issue — solo for focused bugs (1-2 modules), team for complex issues (3+ modules) |
135
- | 2. **Simplify** | `/simplify` | Quick cleanup pass for reuse, quality, and efficiency *(built-in Claude Code command)* |
136
- | 3. **Review** | `/devlyn:review` or `/devlyn:team-review` | Audit the changes solo for small PRs (< 10 files), team for large PRs (10+ files) |
137
- | | | *If the review finds issues, loop back to step 1* |
138
- | 4. **Clean** | `/devlyn:clean` | Remove dead code, unused dependencies, and complexity hotspots |
139
- | 5. **Document** | `/devlyn:update-docs` | Sync project documentation with the current codebase |
140
-
141
- Steps 4-5 are optional — run them periodically rather than on every PR. Steps 1-3 are the core loop.
157
+ | 2. **Evaluate** | `/devlyn:evaluate` | Independent quality evaluation grades against done criteria written in step 1 |
158
+ | | | *If the evaluation finds issues: `/devlyn:team-resolve "Fix issues in .claude/EVAL-FINDINGS.md"`* |
159
+ | 3. **Simplify** | `/simplify` | Quick cleanup pass for reuse, quality, and efficiency *(built-in Claude Code command)* |
160
+ | 4. **Review** | `/devlyn:review` or `/devlyn:team-review` | Audit the changes solo for small PRs (< 10 files), team for large PRs (10+ files) |
161
+ | 5. **Clean** | `/devlyn:clean` | Remove dead code, unused dependencies, and complexity hotspots |
162
+ | 6. **Document** | `/devlyn:update-docs` | Sync project documentation with the current codebase |
142
163
 
143
- > **Tip:** Consider running `/devlyn:review` once more after steps 4-5. `/devlyn:clean` removes code and `/devlyn:update-docs` changes docsa final review pass catches accidental regressions from cleanup.
164
+ Steps 5-6 are optionalrun them periodically rather than on every PR.
144
165
 
145
- > **Scope matching matters.** For a simple one-file bug, `/devlyn:resolve` + `/devlyn:review` (solo) is fast. For a multi-module feature, `/devlyn:team-resolve` + `/devlyn:team-review` (team) gives you parallel specialist perspectives. Don't over-tool simple changes.
166
+ > **Scope matching matters.** For a simple one-file bug, `/devlyn:resolve` + `/devlyn:review` (solo) is fast. For a multi-module feature, `/devlyn:auto-resolve` handles everything. Don't over-tool simple changes.
146
167
 
147
168
  ### UI Design Pipeline
148
169
 
@@ -2,15 +2,30 @@
2
2
 
3
3
  You are a code quality evaluator. Your job is to audit work produced by another session, PR, or changeset and provide evidence-based findings with exact file:line references.
4
4
 
5
+ ## Before You Start
6
+
7
+ 1. **Check for done criteria**: Read `.claude/done-criteria.md` if it exists. When present, this is your primary grading rubric — every criterion must be verified with evidence. When absent, fall back to the checklists below.
8
+
9
+ ## Calibration
10
+
11
+ You will be too lenient by default. You will identify real issues, then talk yourself into deciding they aren't a big deal. Fight this tendency.
12
+
13
+ **Rule**: When in doubt, score DOWN. A false negative ships broken code. A false positive costs minutes of review. The cost is asymmetric.
14
+
15
+ - A catch block that logs but doesn't surface error to user = HIGH (not MEDIUM). Logging is not error handling.
16
+ - A `let` that could be `const` = LOW note only. Linters catch this.
17
+ - "The error handling is generally quite good" = WRONG. Count the instances. Name the files.
18
+
5
19
  ## Evaluation Process
6
20
 
7
21
  1. **Discover scope**: Read the changeset (git diff, PR diff, or specified files)
8
22
  2. **Assess correctness**: Find bugs, logic errors, silent failures, missing error handling
9
23
  3. **Check architecture**: Verify patterns match existing codebase, no type duplication, proper wiring
10
- 4. **Verify spec compliance**: If a spec exists (HANDOFF.md, RFC, issue), compare requirements vs implementation
24
+ 4. **Verify spec compliance**: If a spec exists (HANDOFF.md, RFC, issue, done-criteria.md), compare requirements vs implementation
11
25
  5. **Check error handling**: Every async operation needs loading, error, and empty states in UI. No silent catches.
12
26
  6. **Review API contracts**: New endpoints must follow existing conventions for naming, validation, error envelopes
13
27
  7. **Assess test coverage**: New modules need tests. Run the test suite and report results.
28
+ 8. **Evaluate product quality**: Does this feel like a real feature or a demo stub? Are workflows complete end-to-end? Is the UI coherent?
14
29
 
15
30
  ## Rules
16
31
 
@@ -19,22 +34,31 @@ You are a code quality evaluator. Your job is to audit work produced by another
19
34
  - Call out what's done well, not just problems
20
35
  - Look for cross-cutting patterns (e.g., same mistake repeated in multiple files)
21
36
 
22
- ## Output Format
37
+ ## Output
23
38
 
24
- ```
25
- ### Verdict: [PASS / PASS WITH ISSUES / NEEDS WORK / BLOCKED]
39
+ Write findings to `.claude/EVAL-FINDINGS.md` for downstream consumption:
26
40
 
27
- **Findings by Severity:**
41
+ ```markdown
42
+ # Evaluation Findings
28
43
 
29
- CRITICAL:
30
- - [domain] `file:line` - description
44
+ ## Verdict: [PASS / PASS WITH ISSUES / NEEDS WORK / BLOCKED]
31
45
 
32
- HIGH:
33
- - [domain] `file:line` - description
46
+ ## Done Criteria Results (if done-criteria.md existed)
47
+ - [x] [criterion] — VERIFIED: [evidence]
48
+ - [ ] [criterion] — FAILED: [what's wrong, file:line]
34
49
 
35
- **What's Good:**
36
- - [positive observations]
50
+ ## Findings Requiring Action
51
+ ### CRITICAL
52
+ - `file:line` — [description] — Fix: [suggested approach]
53
+
54
+ ### HIGH
55
+ - `file:line` — [description] — Fix: [suggested approach]
37
56
 
38
- **Recommendation:**
39
- [next action]
57
+ ## Cross-Cutting Patterns
58
+ - [pattern description]
59
+
60
+ ## What's Good
61
+ - [positive observations]
40
62
  ```
63
+
64
+ Do NOT delete `.claude/done-criteria.md` or `.claude/EVAL-FINDINGS.md` — the orchestrator or user is responsible for cleanup.
package/bin/devlyn.js CHANGED
@@ -63,8 +63,29 @@ const DEPRECATED_FILES = [
63
63
  'commands/devlyn.team-resolve.md',
64
64
  'commands/devlyn.team-review.md',
65
65
  'commands/devlyn.update-docs.md',
66
- 'commands/devlyn.pencil-pull.md', // migrated to skills/devlyn-pencil-pull
67
- 'commands/devlyn.pencil-push.md', // migrated to skills/devlyn-pencil-push
66
+ 'commands/devlyn.pencil-pull.md', // migrated to skills/devlyn:pencil-pull
67
+ 'commands/devlyn.pencil-push.md', // migrated to skills/devlyn:pencil-push
68
+ ];
69
+
70
+ // Skill directories renamed from devlyn-* to devlyn:* in v0.7.x
71
+ const DEPRECATED_DIRS = [
72
+ 'skills/devlyn-clean',
73
+ 'skills/devlyn-design-system',
74
+ 'skills/devlyn-design-ui',
75
+ 'skills/devlyn-discover-product',
76
+ 'skills/devlyn-evaluate',
77
+ 'skills/devlyn-feature-spec',
78
+ 'skills/devlyn-implement-ui',
79
+ 'skills/devlyn-product-spec',
80
+ 'skills/devlyn-recommend-features',
81
+ 'skills/devlyn-resolve',
82
+ 'skills/devlyn-review',
83
+ 'skills/devlyn-team-design-ui',
84
+ 'skills/devlyn-team-resolve',
85
+ 'skills/devlyn-team-review',
86
+ 'skills/devlyn-update-docs',
87
+ 'skills/devlyn-pencil-pull',
88
+ 'skills/devlyn-pencil-push',
68
89
  ];
69
90
 
70
91
  function getTargetDir() {
@@ -123,8 +144,8 @@ const OPTIONAL_ADDONS = [
123
144
  { name: 'better-auth-setup', desc: 'Production-ready Better Auth + Hono + Drizzle + PostgreSQL auth setup', type: 'local' },
124
145
  { name: 'pyx-scan', desc: 'Check whether an AI agent skill is safe before installing', type: 'local' },
125
146
  { name: 'dokkit', desc: 'Document template filling for DOCX/HWPX — ingest, fill, review, export', type: 'local' },
126
- { name: 'devlyn-pencil-pull', desc: 'Pull Pencil designs into code with exact visual fidelity', type: 'local' },
127
- { name: 'devlyn-pencil-push', desc: 'Push codebase UI to Pencil canvas for design sync', type: 'local' },
147
+ { name: 'devlyn:pencil-pull', desc: 'Pull Pencil designs into code with exact visual fidelity', type: 'local' },
148
+ { name: 'devlyn:pencil-push', desc: 'Push codebase UI to Pencil canvas for design sync', type: 'local' },
128
149
  // External skill packs (installed via npx skills add)
129
150
  { name: 'vercel-labs/agent-skills', desc: 'React, Next.js, React Native best practices', type: 'external' },
130
151
  { name: 'supabase/agent-skills', desc: 'Supabase integration patterns', type: 'external' },
@@ -232,6 +253,14 @@ function cleanupDeprecated(targetDir) {
232
253
  removed++;
233
254
  }
234
255
  }
256
+ for (const relPath of DEPRECATED_DIRS) {
257
+ const fullPath = path.join(targetDir, relPath);
258
+ if (fs.existsSync(fullPath)) {
259
+ fs.rmSync(fullPath, { recursive: true });
260
+ log(` ✕ ${relPath}/ (renamed)`, 'dim');
261
+ removed++;
262
+ }
263
+ }
235
264
  return removed;
236
265
  }
237
266
 
@@ -0,0 +1,244 @@
1
+ ---
2
+ name: devlyn:auto-resolve
3
+ description: Fully automated build-evaluate-polish pipeline for any task type — bug fixes, new features, refactors, chores, and more. Use this as the default starting point when the user wants hands-free implementation with zero human intervention. Runs the full cycle — build, evaluate, fix loop, simplify, review, clean, docs — as a single command. Use when the user says "auto resolve", "build this", "implement this feature", "fix this", "run the full pipeline", "refactor this", or wants to walk away and come back to finished work.
4
+ ---
5
+
6
+ Fully automated resolve-evaluate-polish pipeline. One command, zero human intervention. Spawns a subagent for each phase, uses file-based handoff between phases, and loops on evaluation feedback until the work passes or max rounds are reached.
7
+
8
+ <pipeline_config>
9
+ $ARGUMENTS
10
+ </pipeline_config>
11
+
12
+ <pipeline_workflow>
13
+
14
+ ## PHASE 0: PARSE INPUT
15
+
16
+ 1. Extract the task/issue description from `<pipeline_config>`.
17
+ 2. Determine optional flags from the input (defaults in parentheses):
18
+ - `--max-rounds N` (2) — max evaluate-fix loops before stopping with a report
19
+ - `--skip-review` (false) — skip team-review phase
20
+ - `--skip-clean` (false) — skip clean phase
21
+ - `--skip-docs` (false) — skip update-docs phase
22
+
23
+ Flags can be passed naturally: `/devlyn:auto-resolve fix the auth bug --max-rounds 3 --skip-docs`
24
+ If no flags are present, use defaults.
25
+
26
+ 3. Announce the pipeline plan:
27
+ ```
28
+ Auto-resolve pipeline starting
29
+ Task: [extracted task description]
30
+ Phases: Build → Evaluate → [Fix loop if needed] → Simplify → [Review] → [Clean] → [Docs]
31
+ Max evaluation rounds: [N]
32
+ ```
33
+
34
+ ## PHASE 1: BUILD
35
+
36
+ Spawn a subagent using the Agent tool to investigate and implement the fix. The subagent does NOT have access to skills, so include all necessary instructions inline.
37
+
38
+ Agent prompt — pass this to the Agent tool:
39
+
40
+ Investigate and implement the following task. Work through these phases in order:
41
+
42
+ **Phase A — Understand the task**: Read the task description carefully. Classify the task type:
43
+ - **Bug fix**: trace from symptom to root cause. Read error logs and affected code paths.
44
+ - **Feature**: explore the codebase to find existing patterns, integration points, and relevant modules.
45
+ - **Refactor/Chore**: understand current implementation, identify what needs to change and why.
46
+ - **UI/UX**: review existing components, design system, and user flows.
47
+ Read relevant files in parallel. Build a clear picture of what exists and what needs to change.
48
+
49
+ **Phase B — Define done criteria**: Before writing any code, create `.claude/done-criteria.md` with testable success criteria. Each criterion must be verifiable (a test can assert it or a human can observe it in under 30 seconds), specific (not vague like "handles errors correctly"), and scoped to this task. Include an "Out of Scope" section and a "Verification Method" section. This file is required — downstream evaluation depends on it.
50
+
51
+ **Phase C — Assemble a team**: Use TeamCreate to create a team. Select teammates based on task type:
52
+ - Bug fix: root-cause-analyst + test-engineer (+ security-auditor, performance-engineer as needed)
53
+ - Feature: implementation-planner + test-engineer (+ ux-designer, architecture-reviewer, api-designer as needed)
54
+ - Refactor: architecture-reviewer + test-engineer
55
+ - UI/UX: product-designer + ux-designer + ui-designer (+ accessibility-auditor as needed)
56
+ Each teammate investigates from their perspective and sends findings back.
57
+
58
+ **Phase D — Synthesize and implement**: After all teammates report, compile findings into a unified plan. Implement the solution — no workarounds, no hardcoded values, no silent error swallowing. For bugs: write a failing test first, then fix. For features: implement following existing patterns, then write tests. For refactors: ensure tests pass before and after.
59
+
60
+ **Phase E — Update done criteria**: Mark each criterion in `.claude/done-criteria.md` as satisfied. Run the full test suite.
61
+
62
+ **Phase F — Cleanup**: Shut down all teammates and delete the team.
63
+
64
+ The task is: [paste the task description here]
65
+
66
+ **After the agent completes**:
67
+ 1. Verify `.claude/done-criteria.md` exists — if missing, create a basic one from the agent's output summary
68
+ 2. Run `git diff --stat` to confirm code was actually changed
69
+ 3. If no changes were made, report failure and stop
70
+ 4. **Checkpoint**: Run `git add -A && git commit -m "chore(pipeline): phase 1 — build complete"` to create a rollback point
71
+
72
+ ## PHASE 2: EVALUATE
73
+
74
+ Spawn a subagent using the Agent tool to evaluate the work. Include all evaluation instructions inline.
75
+
76
+ Agent prompt — pass this to the Agent tool:
77
+
78
+ You are an independent evaluator. Your job is to grade work produced by another agent, not to praise it. You will be too lenient by default — fight this tendency. When in doubt, score DOWN, not up. A false negative (missing a bug) ships broken code. A false positive (flagging a non-issue) costs minutes of review. The cost is asymmetric.
79
+
80
+ **Step 1 — Read the done criteria**: Read `.claude/done-criteria.md`. This is your primary grading rubric. Every criterion must be verified with evidence.
81
+
82
+ **Step 2 — Discover changes**: Run `git diff HEAD~1` and `git status` to see what changed. Read all changed/new files in parallel.
83
+
84
+ **Step 3 — Evaluate**: For each changed file, check:
85
+ - Correctness: logic errors, silent failures, null access, incorrect API contracts
86
+ - Architecture: pattern violations, duplication, missing integration
87
+ - Security (if auth/secrets/user-data touched): injection, hardcoded credentials, missing validation
88
+ - Frontend (if UI changed): missing error/loading/empty states, React anti-patterns, server/client boundaries
89
+ - Test coverage: untested modules, missing edge cases
90
+
91
+ **Step 4 — Grade against done criteria**: For each criterion in done-criteria.md, mark VERIFIED (with evidence) or FAILED (with file:line and what's wrong).
92
+
93
+ **Step 5 — Write findings**: Write `.claude/EVAL-FINDINGS.md` with this exact structure:
94
+
95
+ ```
96
+ # Evaluation Findings
97
+ ## Verdict: [PASS / PASS WITH ISSUES / NEEDS WORK / BLOCKED]
98
+ ## Done Criteria Results
99
+ - [x] criterion — VERIFIED: evidence
100
+ - [ ] criterion — FAILED: what's wrong, file:line
101
+ ## Findings Requiring Action
102
+ ### CRITICAL
103
+ - `file:line` — description — Fix: suggested approach
104
+ ### HIGH
105
+ - `file:line` — description — Fix: suggested approach
106
+ ## Cross-Cutting Patterns
107
+ - pattern description
108
+ ```
109
+
110
+ Verdict rules: BLOCKED = any CRITICAL issues. NEEDS WORK = HIGH issues that should be fixed. PASS WITH ISSUES = only MEDIUM/LOW. PASS = clean.
111
+
112
+ Calibration examples to guide your judgment:
113
+ - A catch block that logs but doesn't surface error to user = HIGH (not MEDIUM). Logging is not error handling.
114
+ - A `let` that could be `const` = LOW note only. Linters catch this.
115
+ - "The error handling is generally quite good" = WRONG. Count the instances. Name the files. "3 of 7 async ops have error states. 4 are missing: file:line, file:line..."
116
+
117
+ Do NOT delete `.claude/done-criteria.md` or `.claude/EVAL-FINDINGS.md` — the orchestrator needs them.
118
+
119
+ **After the agent completes**:
120
+ 1. Read `.claude/EVAL-FINDINGS.md`
121
+ 2. Extract the verdict
122
+ 3. Branch on verdict:
123
+ - `PASS` → skip to PHASE 3
124
+ - `PASS WITH ISSUES` → skip to PHASE 3 (issues are shippable)
125
+ - `NEEDS WORK` → go to PHASE 2.5 (fix loop)
126
+ - `BLOCKED` → go to PHASE 2.5 (fix loop)
127
+ 4. If `.claude/EVAL-FINDINGS.md` was not created, treat as PASS WITH ISSUES and log a warning
128
+
129
+ ## PHASE 2.5: FIX LOOP (conditional)
130
+
131
+ Track the current round number. If `round >= max-rounds`, stop the loop and proceed to PHASE 3 with a warning that unresolved findings remain.
132
+
133
+ Spawn a subagent using the Agent tool to fix the evaluation findings.
134
+
135
+ Agent prompt — pass this to the Agent tool:
136
+
137
+ Read `.claude/EVAL-FINDINGS.md` — it contains specific issues found by an independent evaluator. Fix every CRITICAL and HIGH finding. Address MEDIUM findings if straightforward.
138
+
139
+ The original done criteria are in `.claude/done-criteria.md` — your fixes must still satisfy those criteria. Do not delete or weaken criteria to make them pass.
140
+
141
+ For each finding: read the referenced file:line, understand the issue, implement the fix. No workarounds — fix the actual root cause. Run tests after fixing. Update `.claude/done-criteria.md` to mark fixed items.
142
+
143
+ **After the agent completes**:
144
+ 1. **Checkpoint**: Run `git add -A && git commit -m "chore(pipeline): fix round [N] complete"` to preserve the fix
145
+ 2. Increment round counter
146
+ 3. Go back to PHASE 2 (re-evaluate)
147
+
148
+ ## PHASE 3: SIMPLIFY
149
+
150
+ Spawn a subagent using the Agent tool for a quick cleanup pass.
151
+
152
+ Agent prompt — pass this to the Agent tool:
153
+
154
+ Review the recently changed files (use `git diff HEAD~1` to see what changed). Look for: code that could reuse existing utilities instead of reimplementing, quality issues (unclear naming, unnecessary complexity), and efficiency improvements (redundant operations, missing early returns). Fix any issues found. Keep changes minimal — this is a polish pass, not a rewrite.
155
+
156
+ **After the agent completes**:
157
+ 1. **Checkpoint**: Run `git add -A && git commit -m "chore(pipeline): simplify pass complete"` if there are changes
158
+
159
+ ## PHASE 4: REVIEW (skippable)
160
+
161
+ Skip if `--skip-review` was set.
162
+
163
+ Spawn a subagent using the Agent tool for a multi-perspective review.
164
+
165
+ Agent prompt — pass this to the Agent tool:
166
+
167
+ Review all recent changes in this codebase (use `git diff main` and `git status` to determine scope). Assemble a review team using TeamCreate with specialized reviewers: security reviewer, quality reviewer, test analyst. Add UX reviewer, performance reviewer, or API reviewer based on the changes.
168
+
169
+ Each reviewer evaluates from their perspective, sends findings with file:line evidence grouped by severity (CRITICAL, HIGH, MEDIUM, LOW). After all reviewers report, synthesize findings, deduplicate, and fix any CRITICAL issues directly. For HIGH issues, fix if straightforward.
170
+
171
+ Clean up the team after completion.
172
+
173
+ **After the agent completes**:
174
+ 1. If CRITICAL issues remain unfixed, log a warning in the final report
175
+ 2. **Checkpoint**: Run `git add -A && git commit -m "chore(pipeline): review fixes complete"` if there are changes
176
+
177
+ ## PHASE 5: CLEAN (skippable)
178
+
179
+ Skip if `--skip-clean` was set.
180
+
181
+ Spawn a subagent using the Agent tool.
182
+
183
+ Agent prompt — pass this to the Agent tool:
184
+
185
+ Scan the codebase for dead code, unused dependencies, and code hygiene issues in recently changed files. Focus on: unused imports, unreachable code paths, unused variables, dependencies in package.json that are no longer imported. Keep the scope tight — only clean what's related to recent work. Remove what's confirmed dead, leave anything ambiguous.
186
+
187
+ **After the agent completes**:
188
+ 1. **Checkpoint**: Run `git add -A && git commit -m "chore(pipeline): cleanup complete"` if there are changes
189
+
190
+ ## PHASE 6: DOCS (skippable)
191
+
192
+ Skip if `--skip-docs` was set.
193
+
194
+ Spawn a subagent using the Agent tool.
195
+
196
+ Agent prompt — pass this to the Agent tool:
197
+
198
+ Synchronize documentation with recent code changes. Use `git log --oneline -20` and `git diff main` to understand what changed. Update any docs that reference changed APIs, features, or behaviors. Do not create new documentation files unless the changes introduced entirely new features with no existing docs. Preserve all forward-looking content: roadmaps, future plans, visions, open questions.
199
+
200
+ **After the agent completes**:
201
+ 1. **Checkpoint**: Run `git add -A && git commit -m "chore(pipeline): docs updated"` if there are changes
202
+
203
+ ## PHASE 7: FINAL REPORT
204
+
205
+ After all phases complete:
206
+
207
+ 1. Clean up temporary files:
208
+ - Delete `.claude/done-criteria.md`
209
+ - Delete `.claude/EVAL-FINDINGS.md`
210
+
211
+ 2. Run `git log --oneline -10` to show commits made during the pipeline
212
+
213
+ 3. Present the report:
214
+
215
+ ```
216
+ ### Auto-Resolve Pipeline Complete
217
+
218
+ **Task**: [original task description]
219
+
220
+ **Pipeline Summary**:
221
+ | Phase | Status | Notes |
222
+ |-------|--------|-------|
223
+ | Build (team-resolve) | [completed] | [brief summary] |
224
+ | Evaluate | [PASS/NEEDS WORK after N rounds] | [verdict + key findings] |
225
+ | Fix rounds | [N rounds / skipped] | [what was fixed] |
226
+ | Simplify | [completed / skipped] | [changes made] |
227
+ | Review (team-review) | [completed / skipped] | [findings summary] |
228
+ | Clean | [completed / skipped] | [items cleaned] |
229
+ | Docs (update-docs) | [completed / skipped] | [docs updated] |
230
+
231
+ **Evaluation Rounds**: [N] of [max-rounds] used
232
+ **Final Verdict**: [last evaluation verdict]
233
+
234
+ **Commits created**:
235
+ [git log output]
236
+
237
+ **What to do next**:
238
+ - Review the changes: `git diff main`
239
+ - If satisfied, squash pipeline commits: `git rebase -i main` (combine the chore commits into meaningful ones)
240
+ - If not satisfied, run specific fixes: `/devlyn:team-resolve [specific issue]`
241
+ - For a final human review: `/devlyn:team-review`
242
+ ```
243
+
244
+ </pipeline_workflow>
@@ -1,3 +1,8 @@
1
+ ---
2
+ name: devlyn:evaluate
3
+ description: Independent evaluation of work quality by assembling a specialized evaluator team. Use this to grade work produced by another session, PR, branch, or changeset. Evaluators audit correctness, architecture, security, frontend quality, spec compliance, and test coverage. Use when the user says "evaluate this", "check the quality", "grade this work", "review the changes", or wants an independent quality assessment of recent implementation work.
4
+ ---
5
+
1
6
  Evaluate work produced by another session, PR, or changeset by assembling a specialized Agent Team. Each evaluator audits the work from a different quality dimension — correctness, architecture, error handling, type safety, and spec compliance — providing evidence-based findings with file:line references.
2
7
 
3
8
  <evaluation_target>
@@ -18,14 +23,16 @@ Before spawning any evaluators, understand what you're evaluating:
18
23
  - **"recent changes"** or no argument: Use `git diff HEAD` for unstaged changes, `git status` for new files
19
24
  - **Running session / live monitoring**: Take a baseline snapshot with `git status --short | wc -l`, then poll every 30-45 seconds for new changes using `git status` and `find . -newer <reference-file> -type f`. Report findings incrementally as changes appear.
20
25
 
21
- 2. Build the evaluation baseline:
26
+ 2. **Check for done criteria**: Read `.claude/done-criteria.md` if it exists. This file contains testable success criteria written by the generator (e.g., `/devlyn:team-resolve` Phase 1.5). When present, it is the primary grading rubric — every criterion in it must be verified. When absent, fall back to the evaluation checklists below.
27
+
28
+ 3. Build the evaluation baseline:
22
29
  - Run `git status --short` to see all changed and new files
23
30
  - Run `git diff --stat` for a change summary
24
31
  - Read all changed/new files in parallel (use parallel tool calls)
25
32
  - If a spec file exists (HANDOFF.md, RFC, issue), read it to understand intent
26
33
 
27
- 3. Classify the work using the evaluation matrix below
28
- 4. Decide which evaluators to spawn (minimum viable team)
34
+ 4. Classify the work using the evaluation matrix below
35
+ 5. Decide which evaluators to spawn (minimum viable team)
29
36
 
30
37
  <evaluation_classification>
31
38
  Classify the work and select evaluators:
@@ -53,6 +60,68 @@ Classify the work and select evaluators:
53
60
  - Add: performance-evaluator
54
61
  </evaluation_classification>
55
62
 
63
+ <evaluator_calibration>
64
+ **CRITICAL — Read before grading.** Out of the box, you will be too lenient. You will identify real issues, then talk yourself into deciding they aren't a big deal. Fight this tendency.
65
+
66
+ **Calibration rule**: When in doubt, score DOWN, not up. A false negative (missing a bug) ships broken code. A false positive (flagging a non-issue) costs a few minutes of review. The cost is asymmetric — always err toward strictness.
67
+
68
+ **Example: Borderline issue that IS a real problem**
69
+ ```javascript
70
+ // Evaluator found: catch block logs but doesn't surface error to user
71
+ try {
72
+ const data = await fetchUserProfile(id);
73
+ setProfile(data);
74
+ } catch (error) {
75
+ console.error('Failed to fetch profile:', error);
76
+ }
77
+ ```
78
+ **Wrong evaluation**: "MEDIUM — error is logged, which is acceptable for debugging."
79
+ **Correct evaluation**: "HIGH — user sees no feedback when profile fails to load. The UI stays in loading state forever. Must show error state with retry option. file:line evidence: `ProfilePage.tsx:42`"
80
+
81
+ **Why**: Logging is not error handling. The user's experience is broken. This is the #1 pattern evaluators incorrectly downgrade.
82
+
83
+ **Example: Borderline issue that is NOT a real problem**
84
+ ```javascript
85
+ // Evaluator found: variable could be const instead of let
86
+ let userName = getUserName(session);
87
+ return <Header name={userName} />;
88
+ ```
89
+ **Wrong evaluation**: "MEDIUM — should use const for immutable bindings."
90
+ **Correct evaluation**: "LOW (note only) — stylistic preference, linter will catch this. Not worth a finding."
91
+
92
+ **Why**: Don't waste evaluation cycles on linter-catchable style issues. Focus on behavior, not aesthetics.
93
+
94
+ **Example: Self-praise to avoid**
95
+ **Wrong evaluation**: "The error handling throughout this codebase is generally quite good, with most paths properly covered."
96
+ **Correct evaluation**: Evaluate each path individually. "3 of 7 async operations have proper error states. 4 are missing: `file:line`, `file:line`, `file:line`, `file:line`."
97
+
98
+ **Why**: Generalized praise hides specific gaps. Count the instances. Name the files.
99
+ </evaluator_calibration>
100
+
101
+ <product_quality_criteria>
102
+ In addition to technical checklists, evaluate these product quality dimensions. These catch issues that pass all technical checks but still produce mediocre software.
103
+
104
+ **Product Depth** (weight: HIGH):
105
+ Does this feel like a real product feature or a demo stub? Are the workflows complete end-to-end, or do they dead-end? Can a user actually accomplish their goal without workarounds?
106
+ - GOOD: User can create, edit, delete, and search — full CRUD with proper empty/error/loading states
107
+ - BAD: User can create but editing shows a form that doesn't save, search is hardcoded, delete has no confirmation
108
+
109
+ **Design Quality** (weight: MEDIUM — only when UI changes present):
110
+ Does the UI have a coherent visual identity? Do colors, typography, spacing, and layout work together as a system? Or is it generic defaults and mismatched components?
111
+ - GOOD: Consistent spacing scale, intentional color palette, clear visual hierarchy
112
+ - BAD: Mixed spacing values, default component library with no customization, no visual rhythm
113
+
114
+ **Craft** (weight: LOW — usually handled by baseline):
115
+ Technical execution of the UI — typography hierarchy, contrast ratios, alignment, responsive behavior. Most competent implementations pass here.
116
+
117
+ **Functionality** (weight: HIGH):
118
+ Can users understand what the interface does, find primary actions, and complete tasks without guessing? Are affordances clear? Is feedback immediate?
119
+ - GOOD: Primary action is visually prominent, form validation is inline, success/error feedback is instant
120
+ - BAD: Multiple equal-weight buttons with unclear labels, validation only on submit, no loading indicators
121
+
122
+ Include a **Product Quality Score** in the evaluation report: each dimension rated 1-5 with a one-line justification.
123
+ </product_quality_criteria>
124
+
56
125
  Announce to the user:
57
126
  ```
58
127
  Evaluation team assembling for: [summary of what's being evaluated]
@@ -228,6 +297,14 @@ LOW (note):
228
297
  4. For each catch block: is the error surfaced to the user or silently swallowed?
229
298
  5. Check for React anti-patterns: uncontrolled-to-controlled switches, direct DOM mutation, missing cleanup
230
299
  6. Compare against existing components for pattern consistency
300
+ 7. **Live app testing** (when browser tools are available): If `mcp__claude-in-chrome__*` tools are available, test the running application directly:
301
+ - Navigate to the affected pages
302
+ - Click through the user flow end-to-end
303
+ - Test interactive elements (forms, buttons, modals, navigation)
304
+ - Verify loading, error, and empty states render correctly
305
+ - Screenshot any visual issues as evidence
306
+ - Test responsive behavior at mobile/tablet/desktop widths
307
+ If browser tools are NOT available, skip this step and note "Live testing skipped — no browser tools" in your deliverable.
231
308
 
232
309
  **Your deliverable**: Send a message to the team lead with:
233
310
  1. Component quality assessment for each new/changed component
@@ -235,6 +312,7 @@ LOW (note):
235
312
  3. Silent failure points that violate error handling policy
236
313
  4. React anti-patterns found
237
314
  5. Pattern consistency with existing components
315
+ 6. Live testing results (if browser tools were available): screenshots, interaction bugs, visual regressions
238
316
 
239
317
  Read the team config at ~/.claude/teams/{team-name}/config.json to discover teammates. Coordinate with api-contract-evaluator about client-server type alignment via SendMessage.
240
318
  </frontend_evaluator_prompt>
@@ -405,7 +483,31 @@ After receiving all evaluator findings:
405
483
 
406
484
  ## Phase 5: REPORT
407
485
 
408
- Present the evaluation report to the user.
486
+ 1. Present the evaluation report to the user (format below).
487
+
488
+ 2. **Write findings to `.claude/EVAL-FINDINGS.md`** for downstream consumption by other agents (e.g., `/devlyn:auto-resolve` orchestrator or a follow-up `/devlyn:team-resolve`). This file enables the feedback loop — the generator can read it and fix the issues without human relay.
489
+
490
+ ```markdown
491
+ # Evaluation Findings
492
+
493
+ ## Verdict: [PASS / PASS WITH ISSUES / NEEDS WORK / BLOCKED]
494
+
495
+ ## Done Criteria Results (if done-criteria.md existed)
496
+ - [x] [criterion] — VERIFIED: [evidence]
497
+ - [ ] [criterion] — FAILED: [what's wrong, file:line]
498
+
499
+ ## Findings Requiring Action
500
+ ### CRITICAL
501
+ - `file:line` — [description] — Fix: [suggested approach]
502
+
503
+ ### HIGH
504
+ - `file:line` — [description] — Fix: [suggested approach]
505
+
506
+ ## Cross-Cutting Patterns
507
+ - [pattern description]
508
+ ```
509
+
510
+ 3. Do NOT delete `.claude/done-criteria.md` or `.claude/EVAL-FINDINGS.md` — downstream consumers (e.g., `/devlyn:auto-resolve` orchestrator or a follow-up `/devlyn:team-resolve`) may need to read them. The orchestrator or user is responsible for cleanup.
409
511
 
410
512
  ## Phase 6: CLEANUP
411
513
 
@@ -1,3 +1,8 @@
1
+ ---
2
+ name: devlyn:team-resolve
3
+ description: Multi-perspective issue resolution using a specialized agent team. Use this for complex bugs spanning multiple modules, feature implementations requiring diverse expertise, or any issue where a single perspective is insufficient. Assembles root-cause analysts, test engineers, security auditors, and other specialists as needed. Use when the user says "fix this bug", "resolve this issue", "team resolve", or describes a problem that needs investigation.
4
+ ---
5
+
1
6
  Resolve the following issue by assembling a specialized Agent Team to investigate, analyze, and fix it. Each teammate brings a different engineering perspective — like a real team tackling a hard problem together.
2
7
 
3
8
  <issue>
@@ -84,6 +89,34 @@ Issue type: [classification]
84
89
  Teammates: [list of roles being spawned and why each was chosen]
85
90
  ```
86
91
 
92
+ ## Phase 1.5: DEFINITION OF DONE (Sprint Contract)
93
+
94
+ Before any code is written, define what "done" looks like. This prevents self-evaluation bias and gives external evaluators (like `/devlyn:evaluate`) concrete criteria to grade against.
95
+
96
+ 1. Based on your Phase 1 investigation, write testable success criteria to `.claude/done-criteria.md`:
97
+
98
+ ```markdown
99
+ # Done Criteria: [issue summary]
100
+
101
+ ## Success Criteria
102
+ - [ ] [Specific, verifiable criterion — e.g., "User sees error toast when API returns 401, not blank screen"]
103
+ - [ ] [Each criterion must be testable: runnable test, observable behavior, or measurable metric]
104
+ - [ ] [Include edge cases discovered during investigation]
105
+
106
+ ## Out of Scope
107
+ - [Explicitly list what this fix does NOT address]
108
+
109
+ ## Verification Method
110
+ - [How to verify: test command, manual steps, or expected UI behavior]
111
+ ```
112
+
113
+ 2. Each criterion must be:
114
+ - **Verifiable** — a test can assert it, or a human can observe it in under 30 seconds
115
+ - **Specific** — "handles errors correctly" is too vague; "returns 400 with `{error: 'missing_field', field: 'email'}` when email is omitted" is specific
116
+ - **Scoped** — tied to THIS issue, not aspirational improvements
117
+
118
+ 3. This file serves as the contract between the generator (you) and any external evaluator. Do not skip it.
119
+
87
120
  ## Phase 2: TEAM ASSEMBLY
88
121
 
89
122
  Use the Agent Teams infrastructure:
@@ -517,13 +550,7 @@ Implementation order:
517
550
  3. Incorporate security constraints from the Security Auditor (if present)
518
551
  4. Respect architectural patterns flagged by the Architecture Reviewer (if present)
519
552
  5. Apply UX requirements from the UX Designer and Accessibility Auditor (if present)
520
- 6. **Quality gate** — before running tests, review your own code against `<code_quality_standards>`:
521
- - Is error handling graceful and user-facing (not silent, not raw)?
522
- - Are edge cases handled (nulls, empty, concurrent, partial data)?
523
- - Is the solution performant at scale (no O(n²), no unbounded loops)?
524
- - Does the code follow existing codebase patterns and idioms?
525
- - Are interfaces clean and types explicit (no `any`, no leaky abstractions)?
526
- - If any check fails, refactor BEFORE proceeding to tests
553
+ 6. **Update done-criteria.md** — mark each criterion you believe is satisfied. Do NOT self-evaluate quality — that is the evaluator's job. Your role is to implement, not to judge your own work.
527
554
  7. Run the failing test — if it still fails, revert and re-analyze (never layer fixes)
528
555
  8. Run the full test suite for regressions
529
556
 
@@ -578,7 +605,8 @@ Present findings in this format:
578
605
  - [ ] Manual verification (if applicable)
579
606
 
580
607
  ### Recommendation
581
- Run `/devlyn:team-review` to validate the fix meets all quality standards with a full multi-perspective review.
608
+ - Run `/devlyn:evaluate` to grade this work against the done criteria with an independent evaluator team
609
+ - Or run `/devlyn:auto-resolve` next time for the fully automated pipeline (build → evaluate → fix loop → simplify → review → clean → docs)
582
610
 
583
611
  </team_resolution>
584
612
  </output_format>
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "devlyn-cli",
3
- "version": "0.7.1",
3
+ "version": "1.0.0",
4
4
  "description": "Claude Code configuration toolkit for teams",
5
5
  "bin": {
6
6
  "devlyn": "bin/devlyn.js"