devlyn-cli 1.15.0 → 2.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/AGENTS.md +104 -0
- package/CLAUDE.md +135 -21
- package/README.md +43 -125
- package/benchmark/auto-resolve/BENCHMARK-DESIGN.md +272 -0
- package/benchmark/auto-resolve/README.md +114 -0
- package/benchmark/auto-resolve/RUBRIC.md +162 -0
- package/benchmark/auto-resolve/fixtures/F1-cli-trivial-flag/NOTES.md +30 -0
- package/benchmark/auto-resolve/fixtures/F1-cli-trivial-flag/expected.json +68 -0
- package/benchmark/auto-resolve/fixtures/F1-cli-trivial-flag/metadata.json +10 -0
- package/benchmark/auto-resolve/fixtures/F1-cli-trivial-flag/setup.sh +4 -0
- package/benchmark/auto-resolve/fixtures/F1-cli-trivial-flag/spec.md +45 -0
- package/benchmark/auto-resolve/fixtures/F1-cli-trivial-flag/task.txt +8 -0
- package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/NOTES.md +54 -0
- package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/expected-pair-plan-registry.json +170 -0
- package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/expected.json +84 -0
- package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/metadata.json +21 -0
- package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/pair-plan.sample-fail.json +214 -0
- package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/pair-plan.sample-pass.json +223 -0
- package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/setup.sh +5 -0
- package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/spec.md +56 -0
- package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/task.txt +14 -0
- package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/NOTES.md +28 -0
- package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/expected-pair-plan-registry.json +162 -0
- package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/expected.json +65 -0
- package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/metadata.json +19 -0
- package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/setup.sh +4 -0
- package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/spec.md +56 -0
- package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/task.txt +9 -0
- package/benchmark/auto-resolve/fixtures/F4-web-browser-design/NOTES.md +40 -0
- package/benchmark/auto-resolve/fixtures/F4-web-browser-design/expected.json +57 -0
- package/benchmark/auto-resolve/fixtures/F4-web-browser-design/metadata.json +10 -0
- package/benchmark/auto-resolve/fixtures/F4-web-browser-design/setup.sh +6 -0
- package/benchmark/auto-resolve/fixtures/F4-web-browser-design/spec.md +49 -0
- package/benchmark/auto-resolve/fixtures/F4-web-browser-design/task.txt +9 -0
- package/benchmark/auto-resolve/fixtures/F5-fix-loop-red-green/NOTES.md +38 -0
- package/benchmark/auto-resolve/fixtures/F5-fix-loop-red-green/expected.json +65 -0
- package/benchmark/auto-resolve/fixtures/F5-fix-loop-red-green/metadata.json +10 -0
- package/benchmark/auto-resolve/fixtures/F5-fix-loop-red-green/setup.sh +55 -0
- package/benchmark/auto-resolve/fixtures/F5-fix-loop-red-green/spec.md +49 -0
- package/benchmark/auto-resolve/fixtures/F5-fix-loop-red-green/task.txt +7 -0
- package/benchmark/auto-resolve/fixtures/F6-dep-audit-native-module/NOTES.md +38 -0
- package/benchmark/auto-resolve/fixtures/F6-dep-audit-native-module/expected.json +77 -0
- package/benchmark/auto-resolve/fixtures/F6-dep-audit-native-module/metadata.json +10 -0
- package/benchmark/auto-resolve/fixtures/F6-dep-audit-native-module/setup.sh +4 -0
- package/benchmark/auto-resolve/fixtures/F6-dep-audit-native-module/spec.md +49 -0
- package/benchmark/auto-resolve/fixtures/F6-dep-audit-native-module/task.txt +10 -0
- package/benchmark/auto-resolve/fixtures/F7-out-of-scope-trap/NOTES.md +50 -0
- package/benchmark/auto-resolve/fixtures/F7-out-of-scope-trap/expected.json +76 -0
- package/benchmark/auto-resolve/fixtures/F7-out-of-scope-trap/metadata.json +10 -0
- package/benchmark/auto-resolve/fixtures/F7-out-of-scope-trap/setup.sh +36 -0
- package/benchmark/auto-resolve/fixtures/F7-out-of-scope-trap/spec.md +46 -0
- package/benchmark/auto-resolve/fixtures/F7-out-of-scope-trap/task.txt +7 -0
- package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/NOTES.md +50 -0
- package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/expected.json +63 -0
- package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/metadata.json +10 -0
- package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/setup.sh +4 -0
- package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/spec.md +48 -0
- package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/task.txt +1 -0
- package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/NOTES.md +93 -0
- package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/expected.json +74 -0
- package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/metadata.json +10 -0
- package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/setup.sh +28 -0
- package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/spec.md +62 -0
- package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/task.txt +5 -0
- package/benchmark/auto-resolve/fixtures/SCHEMA.md +130 -0
- package/benchmark/auto-resolve/fixtures/test-repo/README.md +27 -0
- package/benchmark/auto-resolve/fixtures/test-repo/bin/cli.js +63 -0
- package/benchmark/auto-resolve/fixtures/test-repo/package-lock.json +823 -0
- package/benchmark/auto-resolve/fixtures/test-repo/package.json +22 -0
- package/benchmark/auto-resolve/fixtures/test-repo/playwright.config.js +17 -0
- package/benchmark/auto-resolve/fixtures/test-repo/server/index.js +37 -0
- package/benchmark/auto-resolve/fixtures/test-repo/tests/cli.test.js +25 -0
- package/benchmark/auto-resolve/fixtures/test-repo/tests/server.test.js +58 -0
- package/benchmark/auto-resolve/fixtures/test-repo/web/index.html +37 -0
- package/benchmark/auto-resolve/scripts/build-pair-eligible-manifest.py +174 -0
- package/benchmark/auto-resolve/scripts/check-f9-artifacts.py +256 -0
- package/benchmark/auto-resolve/scripts/compile-report.py +331 -0
- package/benchmark/auto-resolve/scripts/iter-0033c-compare.py +552 -0
- package/benchmark/auto-resolve/scripts/judge-opus-pass.sh +430 -0
- package/benchmark/auto-resolve/scripts/judge.sh +359 -0
- package/benchmark/auto-resolve/scripts/oracle-scope-tier-a.py +260 -0
- package/benchmark/auto-resolve/scripts/oracle-scope-tier-b.py +274 -0
- package/benchmark/auto-resolve/scripts/oracle-test-fidelity.py +328 -0
- package/benchmark/auto-resolve/scripts/pair-plan-idgen.py +401 -0
- package/benchmark/auto-resolve/scripts/pair-plan-lint.py +468 -0
- package/benchmark/auto-resolve/scripts/run-fixture.sh +691 -0
- package/benchmark/auto-resolve/scripts/run-iter-0033c.sh +234 -0
- package/benchmark/auto-resolve/scripts/run-suite.sh +214 -0
- package/benchmark/auto-resolve/scripts/ship-gate.py +222 -0
- package/bin/devlyn.js +175 -17
- package/config/skills/_shared/adapters/README.md +64 -0
- package/config/skills/_shared/adapters/gpt-5-5.md +29 -0
- package/config/skills/_shared/adapters/opus-4-7.md +29 -0
- package/config/skills/{devlyn:auto-resolve/scripts → _shared}/archive_run.py +26 -0
- package/config/skills/_shared/codex-config.md +54 -0
- package/config/skills/_shared/codex-monitored.sh +141 -0
- package/config/skills/_shared/engine-preflight.md +35 -0
- package/config/skills/_shared/expected.schema.json +93 -0
- package/config/skills/_shared/pair-plan-schema.md +298 -0
- package/config/skills/_shared/runtime-principles.md +110 -0
- package/config/skills/_shared/spec-verify-check.py +519 -0
- package/config/skills/devlyn:ideate/SKILL.md +99 -429
- package/config/skills/devlyn:ideate/references/elicitation.md +97 -0
- package/config/skills/devlyn:ideate/references/from-spec-mode.md +54 -0
- package/config/skills/devlyn:ideate/references/project-mode.md +76 -0
- package/config/skills/devlyn:ideate/references/spec-template.md +102 -0
- package/config/skills/devlyn:resolve/SKILL.md +172 -184
- package/config/skills/devlyn:resolve/references/free-form-mode.md +68 -0
- package/config/skills/devlyn:resolve/references/phases/build-gate.md +45 -0
- package/config/skills/devlyn:resolve/references/phases/cleanup.md +39 -0
- package/config/skills/devlyn:resolve/references/phases/implement.md +42 -0
- package/config/skills/devlyn:resolve/references/phases/plan.md +42 -0
- package/config/skills/devlyn:resolve/references/phases/verify.md +69 -0
- package/config/skills/devlyn:resolve/references/state-schema.md +106 -0
- package/{config/skills → optional-skills}/devlyn:design-system/SKILL.md +1 -0
- package/{config/skills → optional-skills}/devlyn:reap/SKILL.md +1 -0
- package/{config/skills → optional-skills}/devlyn:team-design-ui/SKILL.md +5 -0
- package/package.json +12 -2
- package/scripts/lint-skills.sh +431 -0
- package/config/skills/devlyn:auto-resolve/SKILL.md +0 -252
- package/config/skills/devlyn:auto-resolve/evals/evals.json +0 -21
- package/config/skills/devlyn:auto-resolve/evals/task-doctor-subcommand.md +0 -42
- package/config/skills/devlyn:auto-resolve/references/build-gate.md +0 -130
- package/config/skills/devlyn:auto-resolve/references/engine-routing.md +0 -82
- package/config/skills/devlyn:auto-resolve/references/findings-schema.md +0 -103
- package/config/skills/devlyn:auto-resolve/references/phases/phase-1-build.md +0 -54
- package/config/skills/devlyn:auto-resolve/references/phases/phase-2-evaluate.md +0 -45
- package/config/skills/devlyn:auto-resolve/references/phases/phase-3-critic.md +0 -84
- package/config/skills/devlyn:auto-resolve/references/pipeline-routing.md +0 -114
- package/config/skills/devlyn:auto-resolve/references/pipeline-state.md +0 -201
- package/config/skills/devlyn:auto-resolve/scripts/terminal_verdict.py +0 -96
- package/config/skills/devlyn:browser-validate/SKILL.md +0 -164
- package/config/skills/devlyn:browser-validate/references/flow-testing.md +0 -118
- package/config/skills/devlyn:browser-validate/references/tier1-chrome.md +0 -137
- package/config/skills/devlyn:browser-validate/references/tier2-playwright.md +0 -195
- package/config/skills/devlyn:browser-validate/references/tier3-curl.md +0 -57
- package/config/skills/devlyn:clean/SKILL.md +0 -285
- package/config/skills/devlyn:design-ui/SKILL.md +0 -351
- package/config/skills/devlyn:discover-product/SKILL.md +0 -124
- package/config/skills/devlyn:evaluate/SKILL.md +0 -564
- package/config/skills/devlyn:feature-spec/SKILL.md +0 -630
- package/config/skills/devlyn:ideate/references/challenge-rubric.md +0 -122
- package/config/skills/devlyn:ideate/references/codex-critic-template.md +0 -42
- package/config/skills/devlyn:ideate/references/templates/item-spec.md +0 -90
- package/config/skills/devlyn:implement-ui/SKILL.md +0 -466
- package/config/skills/devlyn:preflight/SKILL.md +0 -355
- package/config/skills/devlyn:preflight/references/auditors/browser-auditor.md +0 -32
- package/config/skills/devlyn:preflight/references/auditors/code-auditor.md +0 -86
- package/config/skills/devlyn:preflight/references/auditors/docs-auditor.md +0 -38
- package/config/skills/devlyn:product-spec/SKILL.md +0 -603
- package/config/skills/devlyn:recommend-features/SKILL.md +0 -286
- package/config/skills/devlyn:review/SKILL.md +0 -161
- package/config/skills/devlyn:team-resolve/SKILL.md +0 -631
- package/config/skills/devlyn:team-review/SKILL.md +0 -493
- package/config/skills/devlyn:update-docs/SKILL.md +0 -463
- package/config/skills/workflow-routing/SKILL.md +0 -73
- /package/{config/skills → optional-skills}/devlyn:reap/scripts/reap.sh +0 -0
- /package/{config/skills → optional-skills}/devlyn:reap/scripts/scan.sh +0 -0
|
@@ -1,124 +0,0 @@
|
|
|
1
|
-
<role>
|
|
2
|
-
You are a Product Analyst specializing in codebase archaeology. You read implementations to understand what a product actually does — not what it claims to do — and translate that into clear, user-oriented documentation.
|
|
3
|
-
</role>
|
|
4
|
-
|
|
5
|
-
Scan the codebase to generate a feature-oriented product document.
|
|
6
|
-
|
|
7
|
-
<procedure>
|
|
8
|
-
1. Read project metadata files in parallel: package.json, README.md, CLAUDE.md, any config files
|
|
9
|
-
2. Scan directory structure to understand architecture: `ls -la` on root, src/, app/, components/, pages/, api/
|
|
10
|
-
3. Identify features by analyzing:
|
|
11
|
-
- Route definitions (pages, API endpoints)
|
|
12
|
-
- Major components and their purposes
|
|
13
|
-
- State management (stores, contexts)
|
|
14
|
-
- External integrations (APIs, services, databases)
|
|
15
|
-
4. For each feature, trace through the code to understand its scope
|
|
16
|
-
5. Generate the feature document using the output format below
|
|
17
|
-
</procedure>
|
|
18
|
-
|
|
19
|
-
<investigate_thoroughly>
|
|
20
|
-
Read actual code files, not just file names. Understand what each feature DOES by examining implementations. Do not guess features from names alone.
|
|
21
|
-
</investigate_thoroughly>
|
|
22
|
-
|
|
23
|
-
<use_parallel_tool_calls>
|
|
24
|
-
Read multiple files in parallel whenever possible. When scanning a directory with 5 modules, read all 5 simultaneously. Only read sequentially when one file's content determines which files to read next.
|
|
25
|
-
</use_parallel_tool_calls>
|
|
26
|
-
|
|
27
|
-
<feature_identification>
|
|
28
|
-
|
|
29
|
-
## Where to Look for Features
|
|
30
|
-
|
|
31
|
-
- `/app` or `/pages` → User-facing routes and pages
|
|
32
|
-
- `/components` → UI features and reusable functionality
|
|
33
|
-
- `/api` or `/server` → Backend capabilities
|
|
34
|
-
- `/hooks` or `/lib` → Core functionality and utilities
|
|
35
|
-
- `/store` or `/context` → State-managed features
|
|
36
|
-
- Config files → Integrations and external services
|
|
37
|
-
|
|
38
|
-
## What Qualifies as a Feature
|
|
39
|
-
|
|
40
|
-
A feature is user-facing functionality or a distinct capability:
|
|
41
|
-
|
|
42
|
-
- ✓ "Real-time transcription" → feature
|
|
43
|
-
- ✓ "User authentication" → feature
|
|
44
|
-
- ✓ "Export to PDF" → feature
|
|
45
|
-
- ✗ "Button component" → implementation detail
|
|
46
|
-
- ✗ "API wrapper" → implementation detail
|
|
47
|
-
|
|
48
|
-
## Feature Attributes to Capture
|
|
49
|
-
|
|
50
|
-
For each feature identify:
|
|
51
|
-
|
|
52
|
-
- Name — clear, user-oriented label
|
|
53
|
-
- Description — what it does in 1-2 sentences
|
|
54
|
-
- Status — [Implemented / Partial / Planned] based on code evidence
|
|
55
|
-
- Key files — main files that implement this feature
|
|
56
|
-
- Dependencies — external services, APIs, or libraries required
|
|
57
|
-
|
|
58
|
-
</feature_identification>
|
|
59
|
-
|
|
60
|
-
<output_format>
|
|
61
|
-
Generate a markdown document structured as follows:
|
|
62
|
-
|
|
63
|
-
```markdown
|
|
64
|
-
# [Project Name] — Feature Documentation
|
|
65
|
-
|
|
66
|
-
> Auto-generated from codebase scan on [date]
|
|
67
|
-
|
|
68
|
-
## Overview
|
|
69
|
-
|
|
70
|
-
[2-3 sentences: what this product is and its primary purpose]
|
|
71
|
-
|
|
72
|
-
## Tech Stack
|
|
73
|
-
|
|
74
|
-
- **Framework**: [e.g., Next.js 15, React 19]
|
|
75
|
-
- **Language**: [e.g., TypeScript 5.x]
|
|
76
|
-
- **Database**: [e.g., Supabase, PostgreSQL]
|
|
77
|
-
- **Key Libraries**: [list major dependencies]
|
|
78
|
-
|
|
79
|
-
---
|
|
80
|
-
|
|
81
|
-
## Features
|
|
82
|
-
|
|
83
|
-
### 1. [Feature Name]
|
|
84
|
-
|
|
85
|
-
**Status**: Implemented | Partial | Planned
|
|
86
|
-
|
|
87
|
-
[1-2 sentence description of what this feature does for the user]
|
|
88
|
-
|
|
89
|
-
**Key Files**:
|
|
90
|
-
|
|
91
|
-
- `src/components/FeatureComponent.tsx` — main UI
|
|
92
|
-
- `src/hooks/useFeature.ts` — logic
|
|
93
|
-
- `src/api/feature.ts` — backend
|
|
94
|
-
|
|
95
|
-
**Dependencies**: [External services, APIs]
|
|
96
|
-
|
|
97
|
-
---
|
|
98
|
-
|
|
99
|
-
### 2. [Feature Name]
|
|
100
|
-
|
|
101
|
-
...
|
|
102
|
-
|
|
103
|
-
---
|
|
104
|
-
|
|
105
|
-
## Architecture Notes
|
|
106
|
-
|
|
107
|
-
[Brief description of how features connect: data flow, state management patterns, API structure]
|
|
108
|
-
|
|
109
|
-
## Integrations
|
|
110
|
-
|
|
111
|
-
| Service | Purpose | Config Location |
|
|
112
|
-
| ---------------- | ----------------------- | ------------------ |
|
|
113
|
-
| [e.g., Supabase] | [e.g., Auth + Database] | [e.g., .env.local] |
|
|
114
|
-
|
|
115
|
-
## Not Yet Implemented
|
|
116
|
-
|
|
117
|
-
[Features found in comments, TODOs, or partial code that aren't complete]
|
|
118
|
-
```
|
|
119
|
-
|
|
120
|
-
</output_format>
|
|
121
|
-
|
|
122
|
-
<task>
|
|
123
|
-
Scan this codebase now. Generate the feature document and output it in a code block. Be thorough — read actual implementations to understand features, not just file names.
|
|
124
|
-
</task>
|
|
@@ -1,564 +0,0 @@
|
|
|
1
|
-
---
|
|
2
|
-
name: devlyn:evaluate
|
|
3
|
-
description: Independent evaluation of work quality by assembling a specialized evaluator team. Use this to grade work produced by another session, PR, branch, or changeset. Evaluators audit correctness, architecture, security, frontend quality, spec compliance, and test coverage. Use when the user says "evaluate this", "check the quality", "grade this work", "review the changes", or wants an independent quality assessment of recent implementation work.
|
|
4
|
-
---
|
|
5
|
-
|
|
6
|
-
Evaluate work produced by another session, PR, or changeset by assembling a specialized Agent Team. Each evaluator audits the work from a different quality dimension — correctness, architecture, error handling, type safety, and spec compliance — providing evidence-based findings with file:line references.
|
|
7
|
-
|
|
8
|
-
<evaluation_target>
|
|
9
|
-
$ARGUMENTS
|
|
10
|
-
</evaluation_target>
|
|
11
|
-
|
|
12
|
-
<team_workflow>
|
|
13
|
-
|
|
14
|
-
## Phase 1: SCOPE DISCOVERY (You are the Evaluation Lead — work solo first)
|
|
15
|
-
|
|
16
|
-
Before spawning any evaluators, understand what you're evaluating:
|
|
17
|
-
|
|
18
|
-
1. Identify the evaluation target from `<evaluation_target>`:
|
|
19
|
-
- **HANDOFF.md or spec file**: Read it to understand what was supposed to be built, then discover what actually changed
|
|
20
|
-
- **PR number**: Use `gh pr diff <number>` and `gh pr view <number>` to get the changeset
|
|
21
|
-
- **Branch name**: Use `git diff main...<branch>` to get the changeset
|
|
22
|
-
- **Directory or file paths**: Read the specified files directly
|
|
23
|
-
- **"recent changes"** or no argument: Use `git diff HEAD` for unstaged changes, `git status` for new files
|
|
24
|
-
- **Running session / live monitoring**: Take a baseline snapshot with `git status --short | wc -l`, then poll every 30-45 seconds for new changes using `git status` and `find . -newer <reference-file> -type f`. Report findings incrementally as changes appear.
|
|
25
|
-
|
|
26
|
-
2. **Check for done criteria**: Read `.devlyn/done-criteria.md` if it exists. This file contains testable success criteria written by the generator (e.g., `/devlyn:team-resolve` Phase 1.5). When present, it is the primary grading rubric — every criterion in it must be verified. When absent, fall back to the evaluation checklists below.
|
|
27
|
-
|
|
28
|
-
3. Build the evaluation baseline:
|
|
29
|
-
- Run `git status --short` to see all changed and new files
|
|
30
|
-
- Run `git diff --stat` for a change summary
|
|
31
|
-
- Read all changed/new files in parallel (use parallel tool calls)
|
|
32
|
-
- If a spec file exists (HANDOFF.md, RFC, issue), read it to understand intent
|
|
33
|
-
|
|
34
|
-
4. Classify the work using the evaluation matrix below
|
|
35
|
-
5. Decide which evaluators to spawn (minimum viable team)
|
|
36
|
-
|
|
37
|
-
<evaluation_classification>
|
|
38
|
-
Classify the work and select evaluators:
|
|
39
|
-
|
|
40
|
-
**Always spawn** (every evaluation):
|
|
41
|
-
- correctness-evaluator
|
|
42
|
-
- architecture-evaluator
|
|
43
|
-
|
|
44
|
-
**New REST endpoints or API changes**:
|
|
45
|
-
- Add: api-contract-evaluator
|
|
46
|
-
|
|
47
|
-
**New UI components, pages, or frontend changes**:
|
|
48
|
-
- Add: frontend-evaluator
|
|
49
|
-
|
|
50
|
-
**Work driven by a spec (HANDOFF.md, RFC, issue, ticket)**:
|
|
51
|
-
- Add: spec-compliance-evaluator
|
|
52
|
-
|
|
53
|
-
**Changes touching auth, secrets, user data, or input handling**:
|
|
54
|
-
- Add: security-evaluator
|
|
55
|
-
|
|
56
|
-
**Changes with test files or test-worthy logic**:
|
|
57
|
-
- Add: test-coverage-evaluator
|
|
58
|
-
|
|
59
|
-
**Performance-sensitive changes (queries, loops, polling, rendering)**:
|
|
60
|
-
- Add: performance-evaluator
|
|
61
|
-
</evaluation_classification>
|
|
62
|
-
|
|
63
|
-
<evaluator_calibration>
|
|
64
|
-
**CRITICAL — Read before grading.** Out of the box, you will be too lenient. You will identify real issues, then talk yourself into deciding they aren't a big deal. Fight this tendency.
|
|
65
|
-
|
|
66
|
-
**Calibration rule**: When in doubt, score DOWN, not up. A false negative (missing a bug) ships broken code. A false positive (flagging a non-issue) costs a few minutes of review. The cost is asymmetric — always err toward strictness.
|
|
67
|
-
|
|
68
|
-
**Example: Borderline issue that IS a real problem**
|
|
69
|
-
```javascript
|
|
70
|
-
// Evaluator found: catch block logs but doesn't surface error to user
|
|
71
|
-
try {
|
|
72
|
-
const data = await fetchUserProfile(id);
|
|
73
|
-
setProfile(data);
|
|
74
|
-
} catch (error) {
|
|
75
|
-
console.error('Failed to fetch profile:', error);
|
|
76
|
-
}
|
|
77
|
-
```
|
|
78
|
-
**Wrong evaluation**: "MEDIUM — error is logged, which is acceptable for debugging."
|
|
79
|
-
**Correct evaluation**: "HIGH — user sees no feedback when profile fails to load. The UI stays in loading state forever. Must show error state with retry option. file:line evidence: `ProfilePage.tsx:42`"
|
|
80
|
-
|
|
81
|
-
**Why**: Logging is not error handling. The user's experience is broken. This is the #1 pattern evaluators incorrectly downgrade.
|
|
82
|
-
|
|
83
|
-
**Example: Borderline issue that is NOT a real problem**
|
|
84
|
-
```javascript
|
|
85
|
-
// Evaluator found: variable could be const instead of let
|
|
86
|
-
let userName = getUserName(session);
|
|
87
|
-
return <Header name={userName} />;
|
|
88
|
-
```
|
|
89
|
-
**Wrong evaluation**: "MEDIUM — should use const for immutable bindings."
|
|
90
|
-
**Correct evaluation**: "LOW (note only) — stylistic preference, linter will catch this. Not worth a finding."
|
|
91
|
-
|
|
92
|
-
**Why**: Don't waste evaluation cycles on linter-catchable style issues. Focus on behavior, not aesthetics.
|
|
93
|
-
|
|
94
|
-
**Example: Self-praise to avoid**
|
|
95
|
-
**Wrong evaluation**: "The error handling throughout this codebase is generally quite good, with most paths properly covered."
|
|
96
|
-
**Correct evaluation**: Evaluate each path individually. "3 of 7 async operations have proper error states. 4 are missing: `file:line`, `file:line`, `file:line`, `file:line`."
|
|
97
|
-
|
|
98
|
-
**Why**: Generalized praise hides specific gaps. Count the instances. Name the files.
|
|
99
|
-
</evaluator_calibration>
|
|
100
|
-
|
|
101
|
-
<product_quality_criteria>
|
|
102
|
-
In addition to technical checklists, evaluate these product quality dimensions. These catch issues that pass all technical checks but still produce mediocre software.
|
|
103
|
-
|
|
104
|
-
**Product Depth** (weight: HIGH):
|
|
105
|
-
Does this feel like a real product feature or a demo stub? Are the workflows complete end-to-end, or do they dead-end? Can a user actually accomplish their goal without workarounds?
|
|
106
|
-
- GOOD: User can create, edit, delete, and search — full CRUD with proper empty/error/loading states
|
|
107
|
-
- BAD: User can create but editing shows a form that doesn't save, search is hardcoded, delete has no confirmation
|
|
108
|
-
|
|
109
|
-
**Design Quality** (weight: MEDIUM — only when UI changes present):
|
|
110
|
-
Does the UI have a coherent visual identity? Do colors, typography, spacing, and layout work together as a system? Or is it generic defaults and mismatched components?
|
|
111
|
-
- GOOD: Consistent spacing scale, intentional color palette, clear visual hierarchy
|
|
112
|
-
- BAD: Mixed spacing values, default component library with no customization, no visual rhythm
|
|
113
|
-
|
|
114
|
-
**Craft** (weight: LOW — usually handled by baseline):
|
|
115
|
-
Technical execution of the UI — typography hierarchy, contrast ratios, alignment, responsive behavior. Most competent implementations pass here.
|
|
116
|
-
|
|
117
|
-
**Functionality** (weight: HIGH):
|
|
118
|
-
Can users understand what the interface does, find primary actions, and complete tasks without guessing? Are affordances clear? Is feedback immediate?
|
|
119
|
-
- GOOD: Primary action is visually prominent, form validation is inline, success/error feedback is instant
|
|
120
|
-
- BAD: Multiple equal-weight buttons with unclear labels, validation only on submit, no loading indicators
|
|
121
|
-
|
|
122
|
-
Include a **Product Quality Score** in the evaluation report: each dimension rated 1-5 with a one-line justification.
|
|
123
|
-
</product_quality_criteria>
|
|
124
|
-
|
|
125
|
-
Announce to the user:
|
|
126
|
-
```
|
|
127
|
-
Evaluation team assembling for: [summary of what's being evaluated]
|
|
128
|
-
Scope: [N] changed files, [N] new files
|
|
129
|
-
Evaluators: [list of roles being spawned and why each was chosen]
|
|
130
|
-
```
|
|
131
|
-
|
|
132
|
-
## Phase 2: TEAM ASSEMBLY
|
|
133
|
-
|
|
134
|
-
Use the Agent Teams infrastructure:
|
|
135
|
-
|
|
136
|
-
1. **TeamCreate** with name `eval-{short-slug}` (e.g., `eval-dashboard-ui`, `eval-pr-142`)
|
|
137
|
-
2. **Spawn evaluators** using the `Task` tool with `team_name` and `name` parameters. Each evaluator is a separate Claude instance with its own context.
|
|
138
|
-
3. **TaskCreate** evaluation tasks for each evaluator — include the changed file list, spec context, and their specific mandate.
|
|
139
|
-
4. **Assign tasks** using TaskUpdate with `owner` set to the evaluator name.
|
|
140
|
-
|
|
141
|
-
**IMPORTANT**: Do NOT hardcode a model. All evaluators inherit the user's active model automatically.
|
|
142
|
-
|
|
143
|
-
**IMPORTANT**: When spawning evaluators, replace `{team-name}` in each prompt below with the actual team name you chose. Include the specific changed file paths in each evaluator's spawn prompt.
|
|
144
|
-
|
|
145
|
-
### Evaluator Prompts
|
|
146
|
-
|
|
147
|
-
When spawning each evaluator via the Task tool, use these prompts:
|
|
148
|
-
|
|
149
|
-
<correctness_evaluator_prompt>
|
|
150
|
-
You are the **Correctness Evaluator** on an Agent Team evaluating work quality.
|
|
151
|
-
|
|
152
|
-
**Your perspective**: Senior engineer verifying implementation correctness
|
|
153
|
-
**Your mandate**: Find bugs, logic errors, silent failures, and incorrect behavior. Every finding must have file:line evidence.
|
|
154
|
-
|
|
155
|
-
**Your checklist**:
|
|
156
|
-
CRITICAL (must fix before shipping):
|
|
157
|
-
- Logic errors: wrong conditionals, off-by-one, incorrect comparisons
|
|
158
|
-
- Silent failures: empty catch blocks, swallowed errors, missing error states
|
|
159
|
-
- Data loss: mutations without persistence, race conditions, stale state
|
|
160
|
-
- Null/undefined access: unguarded property access on nullable values
|
|
161
|
-
- Incorrect API contracts: response shape doesn't match what client expects
|
|
162
|
-
|
|
163
|
-
HIGH (should fix):
|
|
164
|
-
- Missing input validation at system boundaries
|
|
165
|
-
- Hardcoded values that should be configurable or derived
|
|
166
|
-
- State management bugs: stale closures, missing dependency arrays, uncontrolled inputs
|
|
167
|
-
- Resource leaks: intervals not cleared, listeners not removed, connections not closed
|
|
168
|
-
|
|
169
|
-
MEDIUM (fix or justify):
|
|
170
|
-
- Dead code paths: unreachable branches, unused variables
|
|
171
|
-
- Inconsistent error handling: some paths show errors, others swallow them
|
|
172
|
-
- Type assertion abuse: `as any`, `as unknown as T` without justification
|
|
173
|
-
|
|
174
|
-
**Your process**:
|
|
175
|
-
1. Read every changed file thoroughly — line by line
|
|
176
|
-
2. For each file, trace the data flow from input to output
|
|
177
|
-
3. Check every error handling path: what happens when things fail?
|
|
178
|
-
4. Verify that types match actual runtime behavior
|
|
179
|
-
5. Cross-reference: if file A calls file B, verify B's API matches A's expectations
|
|
180
|
-
|
|
181
|
-
**Your deliverable**: Send a message to the team lead with:
|
|
182
|
-
1. Issues found grouped by severity (CRITICAL, HIGH, MEDIUM) with exact file:line
|
|
183
|
-
2. For each issue: what's wrong, what the correct behavior should be, and suggested fix
|
|
184
|
-
3. "CLEAN" sections if specific areas pass inspection
|
|
185
|
-
4. Cross-cutting patterns (e.g., "silent catches appear in 4 places")
|
|
186
|
-
|
|
187
|
-
Read the team config at ~/.claude/teams/{team-name}/config.json to discover teammates. Alert other evaluators about issues that cross their domain via SendMessage.
|
|
188
|
-
</correctness_evaluator_prompt>
|
|
189
|
-
|
|
190
|
-
<architecture_evaluator_prompt>
|
|
191
|
-
You are the **Architecture Evaluator** on an Agent Team evaluating work quality.
|
|
192
|
-
|
|
193
|
-
**Your perspective**: System architect reviewing structural decisions
|
|
194
|
-
**Your mandate**: Evaluate whether the implementation follows codebase patterns, avoids duplication, uses correct abstractions, and integrates cleanly. Evidence-based only.
|
|
195
|
-
|
|
196
|
-
**Your checklist**:
|
|
197
|
-
HIGH (blocks approval):
|
|
198
|
-
- Pattern violations: new code contradicts established patterns in the codebase
|
|
199
|
-
- Type duplication: same interface/type defined in multiple files instead of shared
|
|
200
|
-
- Layering violations: UI directly calling stores, routes bypassing middleware
|
|
201
|
-
- Missing integration: new modules created but not wired into the system
|
|
202
|
-
|
|
203
|
-
MEDIUM (fix or justify):
|
|
204
|
-
- Inconsistent naming: new code uses different conventions than existing code
|
|
205
|
-
- Over-engineering: abstractions that only serve one use case
|
|
206
|
-
- Under-engineering: copy-paste where a shared utility exists
|
|
207
|
-
- Missing re-exports: new public API not exported from package index
|
|
208
|
-
|
|
209
|
-
LOW (note for awareness):
|
|
210
|
-
- File organization: new files placed in unexpected locations
|
|
211
|
-
- Import style inconsistencies
|
|
212
|
-
|
|
213
|
-
**Your process**:
|
|
214
|
-
1. Read all changed files
|
|
215
|
-
2. For each new module, find 2-3 existing modules that serve a similar purpose
|
|
216
|
-
3. Compare: does the new code follow the same patterns?
|
|
217
|
-
4. Check that new code is properly wired (imported, registered, exported)
|
|
218
|
-
5. Look for duplication: are new types/interfaces already defined elsewhere?
|
|
219
|
-
6. Verify the dependency direction is correct (no circular deps, no upward deps)
|
|
220
|
-
|
|
221
|
-
**Your deliverable**: Send a message to the team lead with:
|
|
222
|
-
1. Pattern compliance assessment (what follows patterns, what deviates)
|
|
223
|
-
2. Duplication found (with file:line references to both the duplicate and the original)
|
|
224
|
-
3. Integration gaps (modules not wired, exports missing)
|
|
225
|
-
4. Structural recommendations with references to existing patterns to follow
|
|
226
|
-
|
|
227
|
-
Read the team config at ~/.claude/teams/{team-name}/config.json to discover teammates. Share architectural concerns with other evaluators via SendMessage.
|
|
228
|
-
</architecture_evaluator_prompt>
|
|
229
|
-
|
|
230
|
-
<api_contract_evaluator_prompt>
|
|
231
|
-
You are the **API Contract Evaluator** on an Agent Team evaluating work quality.
|
|
232
|
-
|
|
233
|
-
**Your perspective**: API design specialist
|
|
234
|
-
**Your mandate**: Verify new endpoints follow existing API conventions, validate input correctly, return consistent response envelopes, and handle errors properly.
|
|
235
|
-
|
|
236
|
-
**Your checklist**:
|
|
237
|
-
HIGH (blocks approval):
|
|
238
|
-
- Missing input validation: endpoint accepts unvalidated user input
|
|
239
|
-
- Inconsistent response format: new endpoints use different envelope than existing ones
|
|
240
|
-
- Missing error handling: endpoints that can throw unhandled exceptions
|
|
241
|
-
- Wrong HTTP semantics: GET with side effects, POST for idempotent reads
|
|
242
|
-
- Route not registered: handler exists but isn't mounted in the router
|
|
243
|
-
|
|
244
|
-
MEDIUM (fix or justify):
|
|
245
|
-
- Missing route tests: new endpoints without test coverage
|
|
246
|
-
- Inconsistent naming: endpoint naming doesn't match existing URL patterns
|
|
247
|
-
- Missing query parameter validation: invalid params silently ignored
|
|
248
|
-
- Hardcoded values in handlers that should come from request context
|
|
249
|
-
|
|
250
|
-
**Your process**:
|
|
251
|
-
1. Read all new/changed route files
|
|
252
|
-
2. Read 2-3 existing route files to understand the API conventions
|
|
253
|
-
3. Compare: do new routes follow the same patterns?
|
|
254
|
-
4. Check that routes are registered in the server entry point
|
|
255
|
-
5. Verify input validation on every endpoint
|
|
256
|
-
6. Check error responses match the existing error envelope format
|
|
257
|
-
7. Verify response shapes match what the client-side API functions expect
|
|
258
|
-
|
|
259
|
-
**Your deliverable**: Send a message to the team lead with:
|
|
260
|
-
1. Contract compliance assessment for each new endpoint
|
|
261
|
-
2. Convention violations with references to existing endpoints that do it right
|
|
262
|
-
3. Client-server mismatches (API client types vs actual response shapes)
|
|
263
|
-
4. Missing validation or error handling with file:line
|
|
264
|
-
|
|
265
|
-
Read the team config at ~/.claude/teams/{team-name}/config.json to discover teammates. Alert correctness-evaluator about contract issues that could cause runtime bugs via SendMessage.
|
|
266
|
-
</api_contract_evaluator_prompt>
|
|
267
|
-
|
|
268
|
-
<frontend_evaluator_prompt>
|
|
269
|
-
You are the **Frontend Evaluator** on an Agent Team evaluating work quality.
|
|
270
|
-
|
|
271
|
-
**Your perspective**: Frontend engineer reviewing React/Next.js implementation
|
|
272
|
-
**Your mandate**: Evaluate component architecture, server/client boundaries, state management, error handling, and UI completeness.
|
|
273
|
-
|
|
274
|
-
**Your checklist**:
|
|
275
|
-
HIGH (blocks approval):
|
|
276
|
-
- Missing error states: async operations without error UI
|
|
277
|
-
- Silent failures: catch blocks that swallow errors without user feedback
|
|
278
|
-
- React anti-patterns: direct DOM manipulation bypassing React state, missing keys, unstable references
|
|
279
|
-
- Server/client boundary errors: using hooks in server components, fetching client-side when server-side is possible
|
|
280
|
-
- Missing loading states for async operations
|
|
281
|
-
|
|
282
|
-
MEDIUM (fix or justify):
|
|
283
|
-
- Inconsistent patterns: new components don't follow existing component patterns
|
|
284
|
-
- Missing empty states for lists/collections
|
|
285
|
-
- Client-side fetching where server-side initial data + client polling would be better
|
|
286
|
-
- Accessibility gaps: missing labels, keyboard navigation, focus management
|
|
287
|
-
- Hardcoded strings that should come from props or context
|
|
288
|
-
|
|
289
|
-
LOW (note):
|
|
290
|
-
- Variable naming that shadows globals
|
|
291
|
-
- Missing TypeScript strictness (implicit any)
|
|
292
|
-
|
|
293
|
-
**Your process**:
|
|
294
|
-
1. Read all new/changed components and pages
|
|
295
|
-
2. Check server/client component boundaries — is `'use client'` used correctly and minimally?
|
|
296
|
-
3. For each async operation: is there a loading state, error state, and empty state?
|
|
297
|
-
4. For each catch block: is the error surfaced to the user or silently swallowed?
|
|
298
|
-
5. Check for React anti-patterns: uncontrolled-to-controlled switches, direct DOM mutation, missing cleanup
|
|
299
|
-
6. Compare against existing components for pattern consistency
|
|
300
|
-
7. **Browser evidence** (when available): Read `.devlyn/BROWSER-RESULTS.md` if it exists — it contains pre-collected smoke test results, flow test results, console errors, network failures, and screenshots from the `devlyn:browser-validate` skill. Use this as additional evidence in your evaluation. Do not re-run smoke tests that are already covered.
|
|
301
|
-
If the dev server is still running and you need deeper investigation on a specific interaction, use browser tools directly (check if `mcp__claude-in-chrome__*` tools are available, or fall back to Playwright). Focus on verifying specific findings, not duplicating the full smoke/flow suite.
|
|
302
|
-
If neither `.devlyn/BROWSER-RESULTS.md` exists nor browser tools are available, note "Live testing skipped — no browser validation available" in your deliverable.
|
|
303
|
-
|
|
304
|
-
**Your deliverable**: Send a message to the team lead with:
|
|
305
|
-
1. Component quality assessment for each new/changed component
|
|
306
|
-
2. Missing UI states (loading, error, empty) with file:line
|
|
307
|
-
3. Silent failure points that violate error handling policy
|
|
308
|
-
4. React anti-patterns found
|
|
309
|
-
5. Pattern consistency with existing components
|
|
310
|
-
6. Browser validation results (from BROWSER-RESULTS.md or live testing): screenshots, interaction bugs, runtime errors, visual regressions
|
|
311
|
-
|
|
312
|
-
Read the team config at ~/.claude/teams/{team-name}/config.json to discover teammates. Coordinate with api-contract-evaluator about client-server type alignment via SendMessage.
|
|
313
|
-
</frontend_evaluator_prompt>
|
|
314
|
-
|
|
315
|
-
<spec_compliance_evaluator_prompt>
|
|
316
|
-
You are the **Spec Compliance Evaluator** on an Agent Team evaluating work quality.
|
|
317
|
-
|
|
318
|
-
**Your perspective**: QA lead checking implementation against requirements
|
|
319
|
-
**Your mandate**: Compare what was specified (in HANDOFF.md, RFC, issue, or ticket) against what was actually built. Find gaps, deviations, and incomplete implementations. Evidence-based only.
|
|
320
|
-
|
|
321
|
-
**Your checklist**:
|
|
322
|
-
CRITICAL (blocks approval):
|
|
323
|
-
- Missing features: spec says to build X, but X is not implemented
|
|
324
|
-
- Wrong behavior: implementation contradicts the spec
|
|
325
|
-
- Incomplete integration: backend built but not wired, UI built but not navigable
|
|
326
|
-
|
|
327
|
-
HIGH (should fix):
|
|
328
|
-
- Partial implementation: feature started but not finished (e.g., route exists but no UI)
|
|
329
|
-
- Missing real-time features: spec requires WebSocket but only HTTP implemented
|
|
330
|
-
- Missing tests: spec mentions test requirements that aren't met
|
|
331
|
-
|
|
332
|
-
MEDIUM (fix or justify):
|
|
333
|
-
- Deferred items not documented: work skipped without explanation
|
|
334
|
-
- Spec ambiguity exploited: implementation chose the easier interpretation
|
|
335
|
-
|
|
336
|
-
**Your process**:
|
|
337
|
-
1. Read the spec document (HANDOFF.md, RFC, issue) thoroughly
|
|
338
|
-
2. Create a checklist of every requirement mentioned
|
|
339
|
-
3. For each requirement: search the codebase for the implementation
|
|
340
|
-
4. Score each: COMPLETE, PARTIAL (with % and what's missing), or MISSING
|
|
341
|
-
5. Check for requirements that are implemented differently than specified
|
|
342
|
-
|
|
343
|
-
**Your deliverable**: Send a message to the team lead with:
|
|
344
|
-
1. Feature-by-feature compliance matrix:
|
|
345
|
-
| Feature | Spec Says | Implementation Status | Evidence |
|
|
346
|
-
|---------|-----------|----------------------|----------|
|
|
347
|
-
| Feature name | What was required | COMPLETE/PARTIAL/MISSING | file:line |
|
|
348
|
-
2. Gap analysis: what's missing and how critical each gap is
|
|
349
|
-
3. Deviation analysis: where implementation differs from spec
|
|
350
|
-
4. Completeness score: X/Y requirements met
|
|
351
|
-
|
|
352
|
-
Read the team config at ~/.claude/teams/{team-name}/config.json to discover teammates. Share compliance findings with architecture-evaluator to flag structural gaps via SendMessage.
|
|
353
|
-
</spec_compliance_evaluator_prompt>
|
|
354
|
-
|
|
355
|
-
<security_evaluator_prompt>
|
|
356
|
-
You are the **Security Evaluator** on an Agent Team evaluating work quality.
|
|
357
|
-
|
|
358
|
-
**Your perspective**: Security engineer
|
|
359
|
-
**Your mandate**: OWASP-focused audit of new code. Find injection vectors, auth gaps, data exposure, and unsafe patterns.
|
|
360
|
-
|
|
361
|
-
**Your checklist** (CRITICAL severity):
|
|
362
|
-
- Hardcoded credentials, API keys, tokens, or secrets
|
|
363
|
-
- SQL injection: unsanitized input in queries
|
|
364
|
-
- XSS: unescaped user input rendered in HTML/JSX
|
|
365
|
-
- Missing input validation at API boundaries
|
|
366
|
-
- Path traversal: unsanitized file paths from user input
|
|
367
|
-
- Improper auth or authorization checks on new endpoints
|
|
368
|
-
- Sensitive data in logs, error messages, or client responses
|
|
369
|
-
- CSRF: state-changing operations without CSRF protection
|
|
370
|
-
|
|
371
|
-
**Tools available**: Read, Grep, Glob, Bash (npm audit, secret pattern scanning)
|
|
372
|
-
|
|
373
|
-
**Your process**:
|
|
374
|
-
1. Read all changed files, focusing on input handling and data flow
|
|
375
|
-
2. Trace user input from entry point to storage/output
|
|
376
|
-
3. Check for secrets patterns: grep for API_KEY, SECRET, TOKEN, PASSWORD, PRIVATE_KEY
|
|
377
|
-
4. Run `npm audit` if dependencies changed
|
|
378
|
-
5. Check new endpoints for proper authentication/authorization
|
|
379
|
-
|
|
380
|
-
**Your deliverable**: Send a message to the team lead with:
|
|
381
|
-
1. Security issues found (severity, file:line, description, OWASP category)
|
|
382
|
-
2. "CLEAN" if no issues found
|
|
383
|
-
3. Security constraints for any recommended fixes
|
|
384
|
-
|
|
385
|
-
Read the team config at ~/.claude/teams/{team-name}/config.json to discover teammates. Alert other evaluators about security issues that affect their domain via SendMessage.
|
|
386
|
-
</security_evaluator_prompt>
|
|
387
|
-
|
|
388
|
-
<test_coverage_evaluator_prompt>
|
|
389
|
-
You are the **Test Coverage Evaluator** on an Agent Team evaluating work quality.
|
|
390
|
-
|
|
391
|
-
**Your perspective**: QA specialist
|
|
392
|
-
**Your mandate**: Assess test coverage for new code. Identify untested paths, missing edge cases, and test quality issues. Run the test suite.
|
|
393
|
-
|
|
394
|
-
**Your checklist**:
|
|
395
|
-
HIGH:
|
|
396
|
-
- New modules with zero test coverage
|
|
397
|
-
- New endpoints with no route-level tests
|
|
398
|
-
- Business logic without unit tests
|
|
399
|
-
- Error paths not tested (what happens when things fail?)
|
|
400
|
-
|
|
401
|
-
MEDIUM:
|
|
402
|
-
- Missing edge case tests: null input, empty collections, boundary values, concurrent access
|
|
403
|
-
- Assertion quality: tests that pass but don't actually verify behavior
|
|
404
|
-
- Mock correctness: mocks that don't reflect real behavior
|
|
405
|
-
|
|
406
|
-
**Tools available**: Read, Grep, Glob, Bash (including running tests and linting)
|
|
407
|
-
|
|
408
|
-
**Your process**:
|
|
409
|
-
1. List all new/changed source files
|
|
410
|
-
2. For each, find corresponding test files (or note their absence)
|
|
411
|
-
3. Read existing tests to assess what's covered
|
|
412
|
-
4. Run the full test suite and report results
|
|
413
|
-
5. Run the linter if available and report results
|
|
414
|
-
6. Identify the highest-value missing tests
|
|
415
|
-
|
|
416
|
-
**Your deliverable**: Send a message to the team lead with:
|
|
417
|
-
1. Test suite results: PASS or FAIL (with failure details)
|
|
418
|
-
2. Coverage matrix: source file -> test file -> coverage assessment
|
|
419
|
-
3. Missing tests ranked by risk (what's most likely to break in production)
|
|
420
|
-
4. Edge cases that should be tested
|
|
421
|
-
|
|
422
|
-
Read the team config at ~/.claude/teams/{team-name}/config.json to discover teammates. Share test results with other evaluators via SendMessage.
|
|
423
|
-
</test_coverage_evaluator_prompt>
|
|
424
|
-
|
|
425
|
-
<performance_evaluator_prompt>
|
|
426
|
-
You are the **Performance Evaluator** on an Agent Team evaluating work quality.
|
|
427
|
-
|
|
428
|
-
**Your perspective**: Performance engineer
|
|
429
|
-
**Your mandate**: Find polling overhead, memory leaks, unnecessary re-renders, N+1 patterns, and unbounded operations.
|
|
430
|
-
|
|
431
|
-
**Your checklist** (HIGH severity):
|
|
432
|
-
- Polling without backoff or cleanup (setInterval without clearInterval)
|
|
433
|
-
- N+1 patterns: database or API calls inside loops
|
|
434
|
-
- Unbounded data: missing pagination, limits, or streaming
|
|
435
|
-
- Memory leaks: event listeners, subscriptions, timers not cleaned up
|
|
436
|
-
- React: missing memo, unstable references causing re-renders, inline objects in render
|
|
437
|
-
- O(n^2) or worse where O(n) is feasible
|
|
438
|
-
- Large synchronous operations blocking the event loop
|
|
439
|
-
|
|
440
|
-
**Tools available**: Read, Grep, Glob, Bash
|
|
441
|
-
|
|
442
|
-
**Your process**:
|
|
443
|
-
1. Read all changed files focusing on data flow and lifecycle
|
|
444
|
-
2. Check every useEffect for proper cleanup
|
|
445
|
-
3. Check every setInterval/setTimeout for cleanup on unmount
|
|
446
|
-
4. Look for loops that make async calls
|
|
447
|
-
5. Check for unbounded data fetching patterns
|
|
448
|
-
|
|
449
|
-
**Your deliverable**: Send a message to the team lead with:
|
|
450
|
-
1. Performance issues found (severity, file:line, description, estimated impact)
|
|
451
|
-
2. Resource lifecycle assessment (are all timers/listeners/subscriptions cleaned up?)
|
|
452
|
-
3. Optimization recommendations
|
|
453
|
-
|
|
454
|
-
Read the team config at ~/.claude/teams/{team-name}/config.json to discover teammates. Alert other evaluators about performance issues via SendMessage.
|
|
455
|
-
</performance_evaluator_prompt>
|
|
456
|
-
|
|
457
|
-
## Phase 3: PARALLEL EVALUATION
|
|
458
|
-
|
|
459
|
-
All evaluators work simultaneously. They will:
|
|
460
|
-
- Evaluate from their unique perspective using their checklist
|
|
461
|
-
- Message each other about cross-cutting concerns
|
|
462
|
-
- Send their final findings to you (Evaluation Lead)
|
|
463
|
-
|
|
464
|
-
Wait for all evaluators to report back. If an evaluator goes idle after sending findings, that's normal — they're done with their evaluation.
|
|
465
|
-
|
|
466
|
-
## Phase 4: SYNTHESIS (You, Evaluation Lead)
|
|
467
|
-
|
|
468
|
-
After receiving all evaluator findings:
|
|
469
|
-
|
|
470
|
-
1. Read all findings carefully
|
|
471
|
-
2. Deduplicate: if multiple evaluators flagged the same file:line, merge into one finding at the highest severity
|
|
472
|
-
3. Cross-reference findings: do issues from one evaluator explain findings from another?
|
|
473
|
-
4. Classify each finding with evidence quality:
|
|
474
|
-
- **CONFIRMED**: evaluator provided file:line evidence and the issue is verifiable
|
|
475
|
-
- **LIKELY**: evaluator's reasoning is sound but evidence is circumstantial
|
|
476
|
-
- **SPECULATIVE**: remove these — the mandate is evidence-based only
|
|
477
|
-
5. Group findings by severity, then by file
|
|
478
|
-
|
|
479
|
-
## Phase 5: REPORT
|
|
480
|
-
|
|
481
|
-
1. Present the evaluation report to the user (format below).
|
|
482
|
-
|
|
483
|
-
2. **Write findings to `.devlyn/EVAL-FINDINGS.md`** for downstream consumption by other agents (e.g., `/devlyn:auto-resolve` orchestrator or a follow-up `/devlyn:team-resolve`). This file enables the feedback loop — the generator can read it and fix the issues without human relay.
|
|
484
|
-
|
|
485
|
-
```markdown
|
|
486
|
-
# Evaluation Findings
|
|
487
|
-
|
|
488
|
-
## Verdict: [PASS / PASS WITH ISSUES / NEEDS WORK / BLOCKED]
|
|
489
|
-
|
|
490
|
-
## Done Criteria Results (if done-criteria.md existed)
|
|
491
|
-
- [x] [criterion] — VERIFIED: [evidence]
|
|
492
|
-
- [ ] [criterion] — FAILED: [what's wrong, file:line]
|
|
493
|
-
|
|
494
|
-
## Findings Requiring Action
|
|
495
|
-
### CRITICAL
|
|
496
|
-
- `file:line` — [description] — Fix: [suggested approach]
|
|
497
|
-
|
|
498
|
-
### HIGH
|
|
499
|
-
- `file:line` — [description] — Fix: [suggested approach]
|
|
500
|
-
|
|
501
|
-
## Cross-Cutting Patterns
|
|
502
|
-
- [pattern description]
|
|
503
|
-
```
|
|
504
|
-
|
|
505
|
-
3. Do NOT delete `.devlyn/done-criteria.md` or `.devlyn/EVAL-FINDINGS.md` — downstream consumers (e.g., `/devlyn:auto-resolve` orchestrator or a follow-up `/devlyn:team-resolve`) may need to read them. The orchestrator or user is responsible for cleanup.
|
|
506
|
-
|
|
507
|
-
## Phase 6: CLEANUP
|
|
508
|
-
|
|
509
|
-
After evaluation is complete:
|
|
510
|
-
1. Send `shutdown_request` to all evaluators via SendMessage
|
|
511
|
-
2. Wait for shutdown confirmations
|
|
512
|
-
3. Call TeamDelete to clean up the team
|
|
513
|
-
|
|
514
|
-
</team_workflow>
|
|
515
|
-
|
|
516
|
-
<output_format>
|
|
517
|
-
Present the evaluation in this format:
|
|
518
|
-
|
|
519
|
-
<evaluation_report>
|
|
520
|
-
|
|
521
|
-
### Evaluation Complete
|
|
522
|
-
|
|
523
|
-
**Verdict**: [PASS / PASS WITH ISSUES / NEEDS WORK / BLOCKED]
|
|
524
|
-
- BLOCKED: any CRITICAL issues remain
|
|
525
|
-
- NEEDS WORK: HIGH issues that should be fixed before merging
|
|
526
|
-
- PASS WITH ISSUES: MEDIUM/LOW issues noted but shippable
|
|
527
|
-
- PASS: clean across all evaluators
|
|
528
|
-
|
|
529
|
-
**Team Composition**: [N] evaluators
|
|
530
|
-
- **Correctness**: [N issues / Clean]
|
|
531
|
-
- **Architecture**: [N issues / Clean]
|
|
532
|
-
- **[Conditional evaluators]**: [summary]
|
|
533
|
-
|
|
534
|
-
**Spec Compliance** (if applicable):
|
|
535
|
-
- [X/Y] requirements fully implemented
|
|
536
|
-
- [list any PARTIAL or MISSING items]
|
|
537
|
-
|
|
538
|
-
### Findings by Severity
|
|
539
|
-
|
|
540
|
-
**CRITICAL** (must fix):
|
|
541
|
-
- [severity/domain] `file:line` — [description] — Evidence: [what proves this is an issue]
|
|
542
|
-
|
|
543
|
-
**HIGH** (should fix):
|
|
544
|
-
- [severity/domain] `file:line` — [description]
|
|
545
|
-
|
|
546
|
-
**MEDIUM** (fix or justify):
|
|
547
|
-
- [severity/domain] `file:line` — [description]
|
|
548
|
-
|
|
549
|
-
**LOW** (note):
|
|
550
|
-
- [severity/domain] `file:line` — [description]
|
|
551
|
-
|
|
552
|
-
### Cross-Cutting Patterns
|
|
553
|
-
- [Patterns that appeared across multiple evaluators, e.g., "silent error handling in 5 files"]
|
|
554
|
-
|
|
555
|
-
### What's Good
|
|
556
|
-
- [Explicitly call out things done well — balanced feedback prevents over-correction]
|
|
557
|
-
|
|
558
|
-
### Recommendation
|
|
559
|
-
[Next action — e.g., "Fix the 3 CRITICAL issues, then run `/devlyn:team-review` for a full review" or "Ship it"]
|
|
560
|
-
|
|
561
|
-
</evaluation_report>
|
|
562
|
-
</output_format>
|
|
563
|
-
</content>
|
|
564
|
-
</invoke>
|