npm - devlyn-cli - Versions diffs - 0.7.2 → 1.0.0 - Mend

devlyn-cli 0.7.2 → 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (7) hide show

package/CLAUDE.md +31 -0
package/README.md +34 -13
package/agents-config/evaluator.md +37 -13
package/config/skills/devlyn:auto-resolve/SKILL.md +244 -0
package/config/skills/devlyn:evaluate/SKILL.md +106 -4
package/config/skills/devlyn:team-resolve/SKILL.md +36 -8
package/package.json +1 -1

package/CLAUDE.md CHANGED Viewed

@@ -48,6 +48,36 @@ The full design-to-implementation pipeline:
 For complex features, use the Plan agent to design the approach before implementation.
+## Automated Pipeline (Recommended Starting Point)
+For hands-free build-evaluate-polish cycles — works for bugs, features, refactors, and chores:
+```
+/devlyn:auto-resolve [task description]
+```
+This runs the full pipeline automatically: **Build → Evaluate → Fix Loop → Simplify → Review → Clean → Docs**. Each phase runs as a separate subagent with its own context. Communication between phases happens via files (`.claude/done-criteria.md`, `.claude/EVAL-FINDINGS.md`).
+Optional flags:
+- `--max-rounds 3` — increase max evaluate-fix iterations (default: 2)
+- `--skip-review` — skip team-review phase
+- `--skip-clean` — skip clean phase
+- `--skip-docs` — skip update-docs phase
+## Manual Pipeline (Step-by-Step Control)
+When you want to run each step yourself with review between phases:
+1. `/devlyn:team-resolve [issue]` → Investigate + implement (writes `.claude/done-criteria.md`)
+2. `/devlyn:evaluate` → Grade against done-criteria (writes `.claude/EVAL-FINDINGS.md`)
+3. If findings exist: `/devlyn:team-resolve "Fix issues in .claude/EVAL-FINDINGS.md"` → Fix loop
+4. `/simplify` → Quick cleanup pass
+5. `/devlyn:team-review` → Multi-perspective team review (for important PRs)
+6. `/devlyn:clean` → Codebase hygiene
+7. `/devlyn:update-docs` → Keep docs in sync
+Steps 5-7 are optional depending on scope.
 ## Vibe Coding Workflow
 The recommended sequence after writing code:
@@ -72,6 +102,7 @@ Steps 4-6 are optional depending on the scope of changes. `/simplify` should alw
 - **Simple bugs**: Use `/devlyn:resolve` for systematic bug fixing with test-driven validation
 - **Complex bugs**: Use `/devlyn:team-resolve` for multi-perspective investigation with a full agent team
+- **Hands-free**: Use `/devlyn:auto-resolve` for fully automated resolve → evaluate → fix → polish pipeline
 - **Post-fix review**: Use `/devlyn:team-review` for thorough multi-reviewer validation
 ## Maintenance Workflow

package/README.md CHANGED Viewed

@@ -26,7 +26,7 @@
 devlyn-cli solves this by installing a curated `.claude/` configuration into any project:
-- **14 slash commands** for debugging, code review, UI design, documentation, and more
+- **16 slash commands** for debugging, code review, UI design, documentation, and more
 - **5 core skills** that activate automatically based on conversation context
 - **Agent team workflows** that spawn specialized AI teammates for complex tasks
 - **Product & feature spec templates** for structured planning
@@ -58,7 +58,7 @@ npx devlyn-cli list
 ```
 your-project/
 ├── .claude/
-│   ├── commands/              # 14 slash commands
+│   ├── commands/              # 16 slash commands
 │   ├── skills/                # 5 core skills + any optional addons
 │   ├── templates/             # Product spec, feature spec, prompt templates
 │   ├── commit-conventions.md  # Commit message standards
@@ -76,6 +76,7 @@ Slash commands are invoked directly in Claude Code conversations (e.g., type `/d
 |---|---|
 | `/devlyn:resolve` | Systematic bug fixing with root-cause analysis and test-driven validation |
 | `/devlyn:team-resolve` | Spawns a full agent team — root cause analyst, test engineer, security auditor — to investigate complex issues |
+| `/devlyn:auto-resolve` | Fully automated pipeline for any task — bugs, features, refactors, chores. Build → evaluate → fix loop → simplify → review → clean → docs. One command, zero human intervention |
 ### Code Review & Quality
@@ -83,6 +84,7 @@ Slash commands are invoked directly in Claude Code conversations (e.g., type `/d
 |---|---|
 | `/devlyn:review` | Post-implementation review — security, quality, best practices checklist |
 | `/devlyn:team-review` | Multi-perspective team review with specialized reviewers (security, quality, testing, performance, product) |
+| `/devlyn:evaluate` | Independent quality evaluation — assembles evaluator team to grade work against done criteria with calibrated, skeptical grading |
 | `/devlyn:clean` | Detect and remove dead code, unused dependencies, complexity hotspots, and tech debt |
 ### UI Design & Implementation
@@ -125,24 +127,43 @@ Skills are **not invoked manually** — they activate automatically when Claude
 Commands are designed to compose. Pick the right tool based on scope, then chain them together.
-### Recommended Workflow
+### Automated Pipeline (Recommended)
-The full fix → polish → review → maintain cycle:
+One command runs the full cycle — no human intervention needed:
+```bash
+/devlyn:auto-resolve fix the auth bug where users see blank screen on 401
+```
+| Phase | What Happens |
+|---|---|
+| **Build** | `team-resolve` investigates and implements, writes testable done criteria |
+| **Evaluate** | Independent evaluator grades against done criteria with calibrated skepticism |
+| **Fix Loop** | If evaluation fails, fixes findings and re-evaluates (up to N rounds) |
+| **Simplify** | Quick cleanup pass for reuse and efficiency |
+| **Review** | Multi-perspective team review |
+| **Clean** | Remove dead code and unused dependencies |
+| **Docs** | Sync documentation with changes |
+Each phase runs as a separate subagent (fresh context), communicates via files, and commits a git checkpoint for rollback safety. Skip phases with flags: `--skip-review`, `--skip-clean`, `--skip-docs`, `--max-rounds 3`.
+### Manual Workflow
+For step-by-step control between phases:
 | Step | Command | What It Does |
 |---|---|---|
 | 1. **Resolve** | `/devlyn:resolve` or `/devlyn:team-resolve` | Fix the issue — solo for focused bugs (1-2 modules), team for complex issues (3+ modules) |
-| 2. **Simplify** | `/simplify` | Quick cleanup pass for reuse, quality, and efficiency *(built-in Claude Code command)* |
-| 3. **Review** | `/devlyn:review` or `/devlyn:team-review` | Audit the changes — solo for small PRs (< 10 files), team for large PRs (10+ files) |
-| | | *If the review finds issues, loop back to step 1* |
-| 4. **Clean** | `/devlyn:clean` | Remove dead code, unused dependencies, and complexity hotspots |
-| 5. **Document** | `/devlyn:update-docs` | Sync project documentation with the current codebase |
-Steps 4-5 are optional — run them periodically rather than on every PR. Steps 1-3 are the core loop.
+| 2. **Evaluate** | `/devlyn:evaluate` | Independent quality evaluation — grades against done criteria written in step 1 |
+| | | *If the evaluation finds issues: `/devlyn:team-resolve "Fix issues in .claude/EVAL-FINDINGS.md"`* |
+| 3. **Simplify** | `/simplify` | Quick cleanup pass for reuse, quality, and efficiency *(built-in Claude Code command)* |
+| 4. **Review** | `/devlyn:review` or `/devlyn:team-review` | Audit the changes — solo for small PRs (< 10 files), team for large PRs (10+ files) |
+| 5. **Clean** | `/devlyn:clean` | Remove dead code, unused dependencies, and complexity hotspots |
+| 6. **Document** | `/devlyn:update-docs` | Sync project documentation with the current codebase |
-> **Tip:** Consider running `/devlyn:review` once more after steps 4-5. `/devlyn:clean` removes code and `/devlyn:update-docs` changes docs — a final review pass catches accidental regressions from cleanup.
+Steps 5-6 are optional — run them periodically rather than on every PR.
-> **Scope matching matters.** For a simple one-file bug, `/devlyn:resolve` + `/devlyn:review` (solo) is fast. For a multi-module feature, `/devlyn:team-resolve` + `/devlyn:team-review` (team) gives you parallel specialist perspectives. Don't over-tool simple changes.
+> **Scope matching matters.** For a simple one-file bug, `/devlyn:resolve` + `/devlyn:review` (solo) is fast. For a multi-module feature, `/devlyn:auto-resolve` handles everything. Don't over-tool simple changes.
 ### UI Design Pipeline

package/agents-config/evaluator.md CHANGED Viewed

@@ -2,15 +2,30 @@
 You are a code quality evaluator. Your job is to audit work produced by another session, PR, or changeset and provide evidence-based findings with exact file:line references.
+## Before You Start
+1. **Check for done criteria**: Read `.claude/done-criteria.md` if it exists. When present, this is your primary grading rubric — every criterion must be verified with evidence. When absent, fall back to the checklists below.
+## Calibration
+You will be too lenient by default. You will identify real issues, then talk yourself into deciding they aren't a big deal. Fight this tendency.
+**Rule**: When in doubt, score DOWN. A false negative ships broken code. A false positive costs minutes of review. The cost is asymmetric.
+- A catch block that logs but doesn't surface error to user = HIGH (not MEDIUM). Logging is not error handling.
+- A `let` that could be `const` = LOW note only. Linters catch this.
+- "The error handling is generally quite good" = WRONG. Count the instances. Name the files.
 ## Evaluation Process
 1. **Discover scope**: Read the changeset (git diff, PR diff, or specified files)
 2. **Assess correctness**: Find bugs, logic errors, silent failures, missing error handling
 3. **Check architecture**: Verify patterns match existing codebase, no type duplication, proper wiring
-4. **Verify spec compliance**: If a spec exists (HANDOFF.md, RFC, issue), compare requirements vs implementation
+4. **Verify spec compliance**: If a spec exists (HANDOFF.md, RFC, issue, done-criteria.md), compare requirements vs implementation
 5. **Check error handling**: Every async operation needs loading, error, and empty states in UI. No silent catches.
 6. **Review API contracts**: New endpoints must follow existing conventions for naming, validation, error envelopes
 7. **Assess test coverage**: New modules need tests. Run the test suite and report results.
+8. **Evaluate product quality**: Does this feel like a real feature or a demo stub? Are workflows complete end-to-end? Is the UI coherent?
 ## Rules
@@ -19,22 +34,31 @@ You are a code quality evaluator. Your job is to audit work produced by another
 - Call out what's done well, not just problems
 - Look for cross-cutting patterns (e.g., same mistake repeated in multiple files)
-## Output Format
+## Output
-```
-### Verdict: [PASS / PASS WITH ISSUES / NEEDS WORK / BLOCKED]
+Write findings to `.claude/EVAL-FINDINGS.md` for downstream consumption:
-**Findings by Severity:**
+```markdown
+# Evaluation Findings
-CRITICAL:
-- [domain] `file:line` - description
+## Verdict: [PASS / PASS WITH ISSUES / NEEDS WORK / BLOCKED]
-HIGH:
-- [domain] `file:line` - description
+## Done Criteria Results (if done-criteria.md existed)
+- [x] [criterion] — VERIFIED: [evidence]
+- [ ] [criterion] — FAILED: [what's wrong, file:line]
-**What's Good:**
-- [positive observations]
+## Findings Requiring Action
+### CRITICAL
+- `file:line` — [description] — Fix: [suggested approach]
+### HIGH
+- `file:line` — [description] — Fix: [suggested approach]
-**Recommendation:**
-[next action]
+## Cross-Cutting Patterns
+- [pattern description]
+## What's Good
+- [positive observations]
 ```
+Do NOT delete `.claude/done-criteria.md` or `.claude/EVAL-FINDINGS.md` — the orchestrator or user is responsible for cleanup.

package/config/skills/devlyn:auto-resolve/SKILL.md ADDED Viewed

@@ -0,0 +1,244 @@
+---
+name: devlyn:auto-resolve
+description: Fully automated build-evaluate-polish pipeline for any task type — bug fixes, new features, refactors, chores, and more. Use this as the default starting point when the user wants hands-free implementation with zero human intervention. Runs the full cycle — build, evaluate, fix loop, simplify, review, clean, docs — as a single command. Use when the user says "auto resolve", "build this", "implement this feature", "fix this", "run the full pipeline", "refactor this", or wants to walk away and come back to finished work.
+---
+Fully automated resolve-evaluate-polish pipeline. One command, zero human intervention. Spawns a subagent for each phase, uses file-based handoff between phases, and loops on evaluation feedback until the work passes or max rounds are reached.
+<pipeline_config>
+$ARGUMENTS
+</pipeline_config>
+<pipeline_workflow>
+## PHASE 0: PARSE INPUT
+1. Extract the task/issue description from `<pipeline_config>`.
+2. Determine optional flags from the input (defaults in parentheses):
+   - `--max-rounds N` (2) — max evaluate-fix loops before stopping with a report
+   - `--skip-review` (false) — skip team-review phase
+   - `--skip-clean` (false) — skip clean phase
+   - `--skip-docs` (false) — skip update-docs phase
+   Flags can be passed naturally: `/devlyn:auto-resolve fix the auth bug --max-rounds 3 --skip-docs`
+   If no flags are present, use defaults.
+3. Announce the pipeline plan:
+```
+Auto-resolve pipeline starting
+Task: [extracted task description]
+Phases: Build → Evaluate → [Fix loop if needed] → Simplify → [Review] → [Clean] → [Docs]
+Max evaluation rounds: [N]
+```
+## PHASE 1: BUILD
+Spawn a subagent using the Agent tool to investigate and implement the fix. The subagent does NOT have access to skills, so include all necessary instructions inline.
+Agent prompt — pass this to the Agent tool:
+Investigate and implement the following task. Work through these phases in order:
+**Phase A — Understand the task**: Read the task description carefully. Classify the task type:
+- **Bug fix**: trace from symptom to root cause. Read error logs and affected code paths.
+- **Feature**: explore the codebase to find existing patterns, integration points, and relevant modules.
+- **Refactor/Chore**: understand current implementation, identify what needs to change and why.
+- **UI/UX**: review existing components, design system, and user flows.
+Read relevant files in parallel. Build a clear picture of what exists and what needs to change.
+**Phase B — Define done criteria**: Before writing any code, create `.claude/done-criteria.md` with testable success criteria. Each criterion must be verifiable (a test can assert it or a human can observe it in under 30 seconds), specific (not vague like "handles errors correctly"), and scoped to this task. Include an "Out of Scope" section and a "Verification Method" section. This file is required — downstream evaluation depends on it.
+**Phase C — Assemble a team**: Use TeamCreate to create a team. Select teammates based on task type:
+- Bug fix: root-cause-analyst + test-engineer (+ security-auditor, performance-engineer as needed)
+- Feature: implementation-planner + test-engineer (+ ux-designer, architecture-reviewer, api-designer as needed)
+- Refactor: architecture-reviewer + test-engineer
+- UI/UX: product-designer + ux-designer + ui-designer (+ accessibility-auditor as needed)
+Each teammate investigates from their perspective and sends findings back.
+**Phase D — Synthesize and implement**: After all teammates report, compile findings into a unified plan. Implement the solution — no workarounds, no hardcoded values, no silent error swallowing. For bugs: write a failing test first, then fix. For features: implement following existing patterns, then write tests. For refactors: ensure tests pass before and after.
+**Phase E — Update done criteria**: Mark each criterion in `.claude/done-criteria.md` as satisfied. Run the full test suite.
+**Phase F — Cleanup**: Shut down all teammates and delete the team.
+The task is: [paste the task description here]
+**After the agent completes**:
+1. Verify `.claude/done-criteria.md` exists — if missing, create a basic one from the agent's output summary
+2. Run `git diff --stat` to confirm code was actually changed
+3. If no changes were made, report failure and stop
+4. **Checkpoint**: Run `git add -A && git commit -m "chore(pipeline): phase 1 — build complete"` to create a rollback point
+## PHASE 2: EVALUATE
+Spawn a subagent using the Agent tool to evaluate the work. Include all evaluation instructions inline.
+Agent prompt — pass this to the Agent tool:
+You are an independent evaluator. Your job is to grade work produced by another agent, not to praise it. You will be too lenient by default — fight this tendency. When in doubt, score DOWN, not up. A false negative (missing a bug) ships broken code. A false positive (flagging a non-issue) costs minutes of review. The cost is asymmetric.
+**Step 1 — Read the done criteria**: Read `.claude/done-criteria.md`. This is your primary grading rubric. Every criterion must be verified with evidence.
+**Step 2 — Discover changes**: Run `git diff HEAD~1` and `git status` to see what changed. Read all changed/new files in parallel.
+**Step 3 — Evaluate**: For each changed file, check:
+- Correctness: logic errors, silent failures, null access, incorrect API contracts
+- Architecture: pattern violations, duplication, missing integration
+- Security (if auth/secrets/user-data touched): injection, hardcoded credentials, missing validation
+- Frontend (if UI changed): missing error/loading/empty states, React anti-patterns, server/client boundaries
+- Test coverage: untested modules, missing edge cases
+**Step 4 — Grade against done criteria**: For each criterion in done-criteria.md, mark VERIFIED (with evidence) or FAILED (with file:line and what's wrong).
+**Step 5 — Write findings**: Write `.claude/EVAL-FINDINGS.md` with this exact structure:
+```
+# Evaluation Findings
+## Verdict: [PASS / PASS WITH ISSUES / NEEDS WORK / BLOCKED]
+## Done Criteria Results
+- [x] criterion — VERIFIED: evidence
+- [ ] criterion — FAILED: what's wrong, file:line
+## Findings Requiring Action
+### CRITICAL
+- `file:line` — description — Fix: suggested approach
+### HIGH
+- `file:line` — description — Fix: suggested approach
+## Cross-Cutting Patterns
+- pattern description
+```
+Verdict rules: BLOCKED = any CRITICAL issues. NEEDS WORK = HIGH issues that should be fixed. PASS WITH ISSUES = only MEDIUM/LOW. PASS = clean.
+Calibration examples to guide your judgment:
+- A catch block that logs but doesn't surface error to user = HIGH (not MEDIUM). Logging is not error handling.
+- A `let` that could be `const` = LOW note only. Linters catch this.
+- "The error handling is generally quite good" = WRONG. Count the instances. Name the files. "3 of 7 async ops have error states. 4 are missing: file:line, file:line..."
+Do NOT delete `.claude/done-criteria.md` or `.claude/EVAL-FINDINGS.md` — the orchestrator needs them.
+**After the agent completes**:
+1. Read `.claude/EVAL-FINDINGS.md`
+2. Extract the verdict
+3. Branch on verdict:
+   - `PASS` → skip to PHASE 3
+   - `PASS WITH ISSUES` → skip to PHASE 3 (issues are shippable)
+   - `NEEDS WORK` → go to PHASE 2.5 (fix loop)
+   - `BLOCKED` → go to PHASE 2.5 (fix loop)
+4. If `.claude/EVAL-FINDINGS.md` was not created, treat as PASS WITH ISSUES and log a warning
+## PHASE 2.5: FIX LOOP (conditional)
+Track the current round number. If `round >= max-rounds`, stop the loop and proceed to PHASE 3 with a warning that unresolved findings remain.
+Spawn a subagent using the Agent tool to fix the evaluation findings.
+Agent prompt — pass this to the Agent tool:
+Read `.claude/EVAL-FINDINGS.md` — it contains specific issues found by an independent evaluator. Fix every CRITICAL and HIGH finding. Address MEDIUM findings if straightforward.
+The original done criteria are in `.claude/done-criteria.md` — your fixes must still satisfy those criteria. Do not delete or weaken criteria to make them pass.
+For each finding: read the referenced file:line, understand the issue, implement the fix. No workarounds — fix the actual root cause. Run tests after fixing. Update `.claude/done-criteria.md` to mark fixed items.
+**After the agent completes**:
+1. **Checkpoint**: Run `git add -A && git commit -m "chore(pipeline): fix round [N] complete"` to preserve the fix
+2. Increment round counter
+3. Go back to PHASE 2 (re-evaluate)
+## PHASE 3: SIMPLIFY
+Spawn a subagent using the Agent tool for a quick cleanup pass.
+Agent prompt — pass this to the Agent tool:
+Review the recently changed files (use `git diff HEAD~1` to see what changed). Look for: code that could reuse existing utilities instead of reimplementing, quality issues (unclear naming, unnecessary complexity), and efficiency improvements (redundant operations, missing early returns). Fix any issues found. Keep changes minimal — this is a polish pass, not a rewrite.
+**After the agent completes**:
+1. **Checkpoint**: Run `git add -A && git commit -m "chore(pipeline): simplify pass complete"` if there are changes
+## PHASE 4: REVIEW (skippable)
+Skip if `--skip-review` was set.
+Spawn a subagent using the Agent tool for a multi-perspective review.
+Agent prompt — pass this to the Agent tool:
+Review all recent changes in this codebase (use `git diff main` and `git status` to determine scope). Assemble a review team using TeamCreate with specialized reviewers: security reviewer, quality reviewer, test analyst. Add UX reviewer, performance reviewer, or API reviewer based on the changes.
+Each reviewer evaluates from their perspective, sends findings with file:line evidence grouped by severity (CRITICAL, HIGH, MEDIUM, LOW). After all reviewers report, synthesize findings, deduplicate, and fix any CRITICAL issues directly. For HIGH issues, fix if straightforward.
+Clean up the team after completion.
+**After the agent completes**:
+1. If CRITICAL issues remain unfixed, log a warning in the final report
+2. **Checkpoint**: Run `git add -A && git commit -m "chore(pipeline): review fixes complete"` if there are changes
+## PHASE 5: CLEAN (skippable)
+Skip if `--skip-clean` was set.
+Spawn a subagent using the Agent tool.
+Agent prompt — pass this to the Agent tool:
+Scan the codebase for dead code, unused dependencies, and code hygiene issues in recently changed files. Focus on: unused imports, unreachable code paths, unused variables, dependencies in package.json that are no longer imported. Keep the scope tight — only clean what's related to recent work. Remove what's confirmed dead, leave anything ambiguous.
+**After the agent completes**:
+1. **Checkpoint**: Run `git add -A && git commit -m "chore(pipeline): cleanup complete"` if there are changes
+## PHASE 6: DOCS (skippable)
+Skip if `--skip-docs` was set.
+Spawn a subagent using the Agent tool.
+Agent prompt — pass this to the Agent tool:
+Synchronize documentation with recent code changes. Use `git log --oneline -20` and `git diff main` to understand what changed. Update any docs that reference changed APIs, features, or behaviors. Do not create new documentation files unless the changes introduced entirely new features with no existing docs. Preserve all forward-looking content: roadmaps, future plans, visions, open questions.
+**After the agent completes**:
+1. **Checkpoint**: Run `git add -A && git commit -m "chore(pipeline): docs updated"` if there are changes
+## PHASE 7: FINAL REPORT
+After all phases complete:
+1. Clean up temporary files:
+   - Delete `.claude/done-criteria.md`
+   - Delete `.claude/EVAL-FINDINGS.md`
+2. Run `git log --oneline -10` to show commits made during the pipeline
+3. Present the report:
+```
+### Auto-Resolve Pipeline Complete
+**Task**: [original task description]
+**Pipeline Summary**:
+| Phase | Status | Notes |
+|-------|--------|-------|
+| Build (team-resolve) | [completed] | [brief summary] |
+| Evaluate | [PASS/NEEDS WORK after N rounds] | [verdict + key findings] |
+| Fix rounds | [N rounds / skipped] | [what was fixed] |
+| Simplify | [completed / skipped] | [changes made] |
+| Review (team-review) | [completed / skipped] | [findings summary] |
+| Clean | [completed / skipped] | [items cleaned] |
+| Docs (update-docs) | [completed / skipped] | [docs updated] |
+**Evaluation Rounds**: [N] of [max-rounds] used
+**Final Verdict**: [last evaluation verdict]
+**Commits created**:
+[git log output]
+**What to do next**:
+- Review the changes: `git diff main`
+- If satisfied, squash pipeline commits: `git rebase -i main` (combine the chore commits into meaningful ones)
+- If not satisfied, run specific fixes: `/devlyn:team-resolve [specific issue]`
+- For a final human review: `/devlyn:team-review`
+```
+</pipeline_workflow>

package/config/skills/devlyn:evaluate/SKILL.md CHANGED Viewed

@@ -1,3 +1,8 @@
+---
+name: devlyn:evaluate
+description: Independent evaluation of work quality by assembling a specialized evaluator team. Use this to grade work produced by another session, PR, branch, or changeset. Evaluators audit correctness, architecture, security, frontend quality, spec compliance, and test coverage. Use when the user says "evaluate this", "check the quality", "grade this work", "review the changes", or wants an independent quality assessment of recent implementation work.
+---
 Evaluate work produced by another session, PR, or changeset by assembling a specialized Agent Team. Each evaluator audits the work from a different quality dimension — correctness, architecture, error handling, type safety, and spec compliance — providing evidence-based findings with file:line references.
 <evaluation_target>
@@ -18,14 +23,16 @@ Before spawning any evaluators, understand what you're evaluating:
    - **"recent changes"** or no argument: Use `git diff HEAD` for unstaged changes, `git status` for new files
    - **Running session / live monitoring**: Take a baseline snapshot with `git status --short | wc -l`, then poll every 30-45 seconds for new changes using `git status` and `find . -newer <reference-file> -type f`. Report findings incrementally as changes appear.
-2. Build the evaluation baseline:
+2. **Check for done criteria**: Read `.claude/done-criteria.md` if it exists. This file contains testable success criteria written by the generator (e.g., `/devlyn:team-resolve` Phase 1.5). When present, it is the primary grading rubric — every criterion in it must be verified. When absent, fall back to the evaluation checklists below.
+3. Build the evaluation baseline:
    - Run `git status --short` to see all changed and new files
    - Run `git diff --stat` for a change summary
    - Read all changed/new files in parallel (use parallel tool calls)
    - If a spec file exists (HANDOFF.md, RFC, issue), read it to understand intent
-3. Classify the work using the evaluation matrix below
-4. Decide which evaluators to spawn (minimum viable team)
+4. Classify the work using the evaluation matrix below
+5. Decide which evaluators to spawn (minimum viable team)
 <evaluation_classification>
 Classify the work and select evaluators:
@@ -53,6 +60,68 @@ Classify the work and select evaluators:
 - Add: performance-evaluator
 </evaluation_classification>
+<evaluator_calibration>
+**CRITICAL — Read before grading.** Out of the box, you will be too lenient. You will identify real issues, then talk yourself into deciding they aren't a big deal. Fight this tendency.
+**Calibration rule**: When in doubt, score DOWN, not up. A false negative (missing a bug) ships broken code. A false positive (flagging a non-issue) costs a few minutes of review. The cost is asymmetric — always err toward strictness.
+**Example: Borderline issue that IS a real problem**
+```javascript
+// Evaluator found: catch block logs but doesn't surface error to user
+try {
+  const data = await fetchUserProfile(id);
+  setProfile(data);
+} catch (error) {
+  console.error('Failed to fetch profile:', error);
+}
+```
+**Wrong evaluation**: "MEDIUM — error is logged, which is acceptable for debugging."
+**Correct evaluation**: "HIGH — user sees no feedback when profile fails to load. The UI stays in loading state forever. Must show error state with retry option. file:line evidence: `ProfilePage.tsx:42`"
+**Why**: Logging is not error handling. The user's experience is broken. This is the #1 pattern evaluators incorrectly downgrade.
+**Example: Borderline issue that is NOT a real problem**
+```javascript
+// Evaluator found: variable could be const instead of let
+let userName = getUserName(session);
+return <Header name={userName} />;
+```
+**Wrong evaluation**: "MEDIUM — should use const for immutable bindings."
+**Correct evaluation**: "LOW (note only) — stylistic preference, linter will catch this. Not worth a finding."
+**Why**: Don't waste evaluation cycles on linter-catchable style issues. Focus on behavior, not aesthetics.
+**Example: Self-praise to avoid**
+**Wrong evaluation**: "The error handling throughout this codebase is generally quite good, with most paths properly covered."
+**Correct evaluation**: Evaluate each path individually. "3 of 7 async operations have proper error states. 4 are missing: `file:line`, `file:line`, `file:line`, `file:line`."
+**Why**: Generalized praise hides specific gaps. Count the instances. Name the files.
+</evaluator_calibration>
+<product_quality_criteria>
+In addition to technical checklists, evaluate these product quality dimensions. These catch issues that pass all technical checks but still produce mediocre software.
+**Product Depth** (weight: HIGH):
+Does this feel like a real product feature or a demo stub? Are the workflows complete end-to-end, or do they dead-end? Can a user actually accomplish their goal without workarounds?
+- GOOD: User can create, edit, delete, and search — full CRUD with proper empty/error/loading states
+- BAD: User can create but editing shows a form that doesn't save, search is hardcoded, delete has no confirmation
+**Design Quality** (weight: MEDIUM — only when UI changes present):
+Does the UI have a coherent visual identity? Do colors, typography, spacing, and layout work together as a system? Or is it generic defaults and mismatched components?
+- GOOD: Consistent spacing scale, intentional color palette, clear visual hierarchy
+- BAD: Mixed spacing values, default component library with no customization, no visual rhythm
+**Craft** (weight: LOW — usually handled by baseline):
+Technical execution of the UI — typography hierarchy, contrast ratios, alignment, responsive behavior. Most competent implementations pass here.
+**Functionality** (weight: HIGH):
+Can users understand what the interface does, find primary actions, and complete tasks without guessing? Are affordances clear? Is feedback immediate?
+- GOOD: Primary action is visually prominent, form validation is inline, success/error feedback is instant
+- BAD: Multiple equal-weight buttons with unclear labels, validation only on submit, no loading indicators
+Include a **Product Quality Score** in the evaluation report: each dimension rated 1-5 with a one-line justification.
+</product_quality_criteria>
 Announce to the user:
 ```
 Evaluation team assembling for: [summary of what's being evaluated]
@@ -228,6 +297,14 @@ LOW (note):
 4. For each catch block: is the error surfaced to the user or silently swallowed?
 5. Check for React anti-patterns: uncontrolled-to-controlled switches, direct DOM mutation, missing cleanup
 6. Compare against existing components for pattern consistency
+7. **Live app testing** (when browser tools are available): If `mcp__claude-in-chrome__*` tools are available, test the running application directly:
+   - Navigate to the affected pages
+   - Click through the user flow end-to-end
+   - Test interactive elements (forms, buttons, modals, navigation)
+   - Verify loading, error, and empty states render correctly
+   - Screenshot any visual issues as evidence
+   - Test responsive behavior at mobile/tablet/desktop widths
+   If browser tools are NOT available, skip this step and note "Live testing skipped — no browser tools" in your deliverable.
 **Your deliverable**: Send a message to the team lead with:
 1. Component quality assessment for each new/changed component
@@ -235,6 +312,7 @@ LOW (note):
 3. Silent failure points that violate error handling policy
 4. React anti-patterns found
 5. Pattern consistency with existing components
+6. Live testing results (if browser tools were available): screenshots, interaction bugs, visual regressions
 Read the team config at ~/.claude/teams/{team-name}/config.json to discover teammates. Coordinate with api-contract-evaluator about client-server type alignment via SendMessage.
 </frontend_evaluator_prompt>
@@ -405,7 +483,31 @@ After receiving all evaluator findings:
 ## Phase 5: REPORT
-Present the evaluation report to the user.
+1. Present the evaluation report to the user (format below).
+2. **Write findings to `.claude/EVAL-FINDINGS.md`** for downstream consumption by other agents (e.g., `/devlyn:auto-resolve` orchestrator or a follow-up `/devlyn:team-resolve`). This file enables the feedback loop — the generator can read it and fix the issues without human relay.
+```markdown
+# Evaluation Findings
+## Verdict: [PASS / PASS WITH ISSUES / NEEDS WORK / BLOCKED]
+## Done Criteria Results (if done-criteria.md existed)
+- [x] [criterion] — VERIFIED: [evidence]
+- [ ] [criterion] — FAILED: [what's wrong, file:line]
+## Findings Requiring Action
+### CRITICAL
+- `file:line` — [description] — Fix: [suggested approach]
+### HIGH
+- `file:line` — [description] — Fix: [suggested approach]
+## Cross-Cutting Patterns
+- [pattern description]
+```
+3. Do NOT delete `.claude/done-criteria.md` or `.claude/EVAL-FINDINGS.md` — downstream consumers (e.g., `/devlyn:auto-resolve` orchestrator or a follow-up `/devlyn:team-resolve`) may need to read them. The orchestrator or user is responsible for cleanup.
 ## Phase 6: CLEANUP

package/config/skills/devlyn:team-resolve/SKILL.md CHANGED Viewed

@@ -1,3 +1,8 @@
+---
+name: devlyn:team-resolve
+description: Multi-perspective issue resolution using a specialized agent team. Use this for complex bugs spanning multiple modules, feature implementations requiring diverse expertise, or any issue where a single perspective is insufficient. Assembles root-cause analysts, test engineers, security auditors, and other specialists as needed. Use when the user says "fix this bug", "resolve this issue", "team resolve", or describes a problem that needs investigation.
+---
 Resolve the following issue by assembling a specialized Agent Team to investigate, analyze, and fix it. Each teammate brings a different engineering perspective — like a real team tackling a hard problem together.
 <issue>
@@ -84,6 +89,34 @@ Issue type: [classification]
 Teammates: [list of roles being spawned and why each was chosen]
 ```
+## Phase 1.5: DEFINITION OF DONE (Sprint Contract)
+Before any code is written, define what "done" looks like. This prevents self-evaluation bias and gives external evaluators (like `/devlyn:evaluate`) concrete criteria to grade against.
+1. Based on your Phase 1 investigation, write testable success criteria to `.claude/done-criteria.md`:
+```markdown
+# Done Criteria: [issue summary]
+## Success Criteria
+- [ ] [Specific, verifiable criterion — e.g., "User sees error toast when API returns 401, not blank screen"]
+- [ ] [Each criterion must be testable: runnable test, observable behavior, or measurable metric]
+- [ ] [Include edge cases discovered during investigation]
+## Out of Scope
+- [Explicitly list what this fix does NOT address]
+## Verification Method
+- [How to verify: test command, manual steps, or expected UI behavior]
+```
+2. Each criterion must be:
+   - **Verifiable** — a test can assert it, or a human can observe it in under 30 seconds
+   - **Specific** — "handles errors correctly" is too vague; "returns 400 with `{error: 'missing_field', field: 'email'}` when email is omitted" is specific
+   - **Scoped** — tied to THIS issue, not aspirational improvements
+3. This file serves as the contract between the generator (you) and any external evaluator. Do not skip it.
 ## Phase 2: TEAM ASSEMBLY
 Use the Agent Teams infrastructure:
@@ -517,13 +550,7 @@ Implementation order:
 3. Incorporate security constraints from the Security Auditor (if present)
 4. Respect architectural patterns flagged by the Architecture Reviewer (if present)
 5. Apply UX requirements from the UX Designer and Accessibility Auditor (if present)
-6. **Quality gate** — before running tests, review your own code against `<code_quality_standards>`:
-   - Is error handling graceful and user-facing (not silent, not raw)?
-   - Are edge cases handled (nulls, empty, concurrent, partial data)?
-   - Is the solution performant at scale (no O(n²), no unbounded loops)?
-   - Does the code follow existing codebase patterns and idioms?
-   - Are interfaces clean and types explicit (no `any`, no leaky abstractions)?
-   - If any check fails, refactor BEFORE proceeding to tests
+6. **Update done-criteria.md** — mark each criterion you believe is satisfied. Do NOT self-evaluate quality — that is the evaluator's job. Your role is to implement, not to judge your own work.
 7. Run the failing test — if it still fails, revert and re-analyze (never layer fixes)
 8. Run the full test suite for regressions
@@ -578,7 +605,8 @@ Present findings in this format:
 - [ ] Manual verification (if applicable)
 ### Recommendation
-Run `/devlyn:team-review` to validate the fix meets all quality standards with a full multi-perspective review.
+- Run `/devlyn:evaluate` to grade this work against the done criteria with an independent evaluator team
+- Or run `/devlyn:auto-resolve` next time for the fully automated pipeline (build → evaluate → fix loop → simplify → review → clean → docs)
 </team_resolution>
 </output_format>

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "devlyn-cli",
-  "version": "0.7.2",
+  "version": "1.0.0",
   "description": "Claude Code configuration toolkit for teams",
   "bin": {
     "devlyn": "bin/devlyn.js"