devlyn-cli 0.5.4 → 0.5.7
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/bin/devlyn.js +160 -4
- package/config/agents/evaluator.md +40 -0
- package/config/commands/devlyn.evaluate.md +467 -0
- package/config/commands/devlyn.team-resolve.md +2 -2
- package/optional-skills/dokkit/ANALYSIS.md +32 -1
- package/optional-skills/dokkit/COMMANDS.md +20 -13
- package/optional-skills/dokkit/FILLING.md +19 -0
- package/optional-skills/dokkit/IMAGE-SOURCING.md +2 -2
- package/optional-skills/dokkit/PIPELINE.md +348 -0
- package/optional-skills/dokkit/SKILL.md +169 -111
- package/optional-skills/dokkit/references/docx-section-range-detection.md +147 -0
- package/optional-skills/dokkit/references/image-opportunity-heuristics.md +1 -1
- package/optional-skills/dokkit/scripts/fill_docx.py +819 -0
- package/optional-skills/dokkit/scripts/parse_image_with_gemini.py +3 -3
- package/optional-skills/dokkit/scripts/source_images.py +40 -2
- package/package.json +1 -1
|
@@ -0,0 +1,467 @@
|
|
|
1
|
+
Evaluate work produced by another session, PR, or changeset by assembling a specialized Agent Team. Each evaluator audits the work from a different quality dimension — correctness, architecture, error handling, type safety, and spec compliance — providing evidence-based findings with file:line references.
|
|
2
|
+
|
|
3
|
+
<evaluation_target>
|
|
4
|
+
$ARGUMENTS
|
|
5
|
+
</evaluation_target>
|
|
6
|
+
|
|
7
|
+
<team_workflow>
|
|
8
|
+
|
|
9
|
+
## Phase 1: SCOPE DISCOVERY (You are the Evaluation Lead — work solo first)
|
|
10
|
+
|
|
11
|
+
Before spawning any evaluators, understand what you're evaluating:
|
|
12
|
+
|
|
13
|
+
1. Identify the evaluation target from `<evaluation_target>`:
|
|
14
|
+
- **HANDOFF.md or spec file**: Read it to understand what was supposed to be built, then discover what actually changed
|
|
15
|
+
- **PR number**: Use `gh pr diff <number>` and `gh pr view <number>` to get the changeset
|
|
16
|
+
- **Branch name**: Use `git diff main...<branch>` to get the changeset
|
|
17
|
+
- **Directory or file paths**: Read the specified files directly
|
|
18
|
+
- **"recent changes"** or no argument: Use `git diff HEAD` for unstaged changes, `git status` for new files
|
|
19
|
+
- **Running session / live monitoring**: Take a baseline snapshot with `git status --short | wc -l`, then poll every 30-45 seconds for new changes using `git status` and `find . -newer <reference-file> -type f`. Report findings incrementally as changes appear.
|
|
20
|
+
|
|
21
|
+
2. Build the evaluation baseline:
|
|
22
|
+
- Run `git status --short` to see all changed and new files
|
|
23
|
+
- Run `git diff --stat` for a change summary
|
|
24
|
+
- Read all changed/new files in parallel (use parallel tool calls)
|
|
25
|
+
- If a spec file exists (HANDOFF.md, RFC, issue), read it to understand intent
|
|
26
|
+
|
|
27
|
+
3. Classify the work using the evaluation matrix below
|
|
28
|
+
4. Decide which evaluators to spawn (minimum viable team)
|
|
29
|
+
|
|
30
|
+
<evaluation_classification>
|
|
31
|
+
Classify the work and select evaluators:
|
|
32
|
+
|
|
33
|
+
**Always spawn** (every evaluation):
|
|
34
|
+
- correctness-evaluator
|
|
35
|
+
- architecture-evaluator
|
|
36
|
+
|
|
37
|
+
**New REST endpoints or API changes**:
|
|
38
|
+
- Add: api-contract-evaluator
|
|
39
|
+
|
|
40
|
+
**New UI components, pages, or frontend changes**:
|
|
41
|
+
- Add: frontend-evaluator
|
|
42
|
+
|
|
43
|
+
**Work driven by a spec (HANDOFF.md, RFC, issue, ticket)**:
|
|
44
|
+
- Add: spec-compliance-evaluator
|
|
45
|
+
|
|
46
|
+
**Changes touching auth, secrets, user data, or input handling**:
|
|
47
|
+
- Add: security-evaluator
|
|
48
|
+
|
|
49
|
+
**Changes with test files or test-worthy logic**:
|
|
50
|
+
- Add: test-coverage-evaluator
|
|
51
|
+
|
|
52
|
+
**Performance-sensitive changes (queries, loops, polling, rendering)**:
|
|
53
|
+
- Add: performance-evaluator
|
|
54
|
+
</evaluation_classification>
|
|
55
|
+
|
|
56
|
+
Announce to the user:
|
|
57
|
+
```
|
|
58
|
+
Evaluation team assembling for: [summary of what's being evaluated]
|
|
59
|
+
Scope: [N] changed files, [N] new files
|
|
60
|
+
Evaluators: [list of roles being spawned and why each was chosen]
|
|
61
|
+
```
|
|
62
|
+
|
|
63
|
+
## Phase 2: TEAM ASSEMBLY
|
|
64
|
+
|
|
65
|
+
Use the Agent Teams infrastructure:
|
|
66
|
+
|
|
67
|
+
1. **TeamCreate** with name `eval-{short-slug}` (e.g., `eval-dashboard-ui`, `eval-pr-142`)
|
|
68
|
+
2. **Spawn evaluators** using the `Task` tool with `team_name` and `name` parameters. Each evaluator is a separate Claude instance with its own context.
|
|
69
|
+
3. **TaskCreate** evaluation tasks for each evaluator — include the changed file list, spec context, and their specific mandate.
|
|
70
|
+
4. **Assign tasks** using TaskUpdate with `owner` set to the evaluator name.
|
|
71
|
+
|
|
72
|
+
**IMPORTANT**: Do NOT hardcode a model. All evaluators inherit the user's active model automatically.
|
|
73
|
+
|
|
74
|
+
**IMPORTANT**: When spawning evaluators, replace `{team-name}` in each prompt below with the actual team name you chose. Include the specific changed file paths in each evaluator's spawn prompt.
|
|
75
|
+
|
|
76
|
+
### Evaluator Prompts
|
|
77
|
+
|
|
78
|
+
When spawning each evaluator via the Task tool, use these prompts:
|
|
79
|
+
|
|
80
|
+
<correctness_evaluator_prompt>
|
|
81
|
+
You are the **Correctness Evaluator** on an Agent Team evaluating work quality.
|
|
82
|
+
|
|
83
|
+
**Your perspective**: Senior engineer verifying implementation correctness
|
|
84
|
+
**Your mandate**: Find bugs, logic errors, silent failures, and incorrect behavior. Every finding must have file:line evidence.
|
|
85
|
+
|
|
86
|
+
**Your checklist**:
|
|
87
|
+
CRITICAL (must fix before shipping):
|
|
88
|
+
- Logic errors: wrong conditionals, off-by-one, incorrect comparisons
|
|
89
|
+
- Silent failures: empty catch blocks, swallowed errors, missing error states
|
|
90
|
+
- Data loss: mutations without persistence, race conditions, stale state
|
|
91
|
+
- Null/undefined access: unguarded property access on nullable values
|
|
92
|
+
- Incorrect API contracts: response shape doesn't match what client expects
|
|
93
|
+
|
|
94
|
+
HIGH (should fix):
|
|
95
|
+
- Missing input validation at system boundaries
|
|
96
|
+
- Hardcoded values that should be configurable or derived
|
|
97
|
+
- State management bugs: stale closures, missing dependency arrays, uncontrolled inputs
|
|
98
|
+
- Resource leaks: intervals not cleared, listeners not removed, connections not closed
|
|
99
|
+
|
|
100
|
+
MEDIUM (fix or justify):
|
|
101
|
+
- Dead code paths: unreachable branches, unused variables
|
|
102
|
+
- Inconsistent error handling: some paths show errors, others swallow them
|
|
103
|
+
- Type assertion abuse: `as any`, `as unknown as T` without justification
|
|
104
|
+
|
|
105
|
+
**Your process**:
|
|
106
|
+
1. Read every changed file thoroughly — line by line
|
|
107
|
+
2. For each file, trace the data flow from input to output
|
|
108
|
+
3. Check every error handling path: what happens when things fail?
|
|
109
|
+
4. Verify that types match actual runtime behavior
|
|
110
|
+
5. Cross-reference: if file A calls file B, verify B's API matches A's expectations
|
|
111
|
+
|
|
112
|
+
**Your deliverable**: Send a message to the team lead with:
|
|
113
|
+
1. Issues found grouped by severity (CRITICAL, HIGH, MEDIUM) with exact file:line
|
|
114
|
+
2. For each issue: what's wrong, what the correct behavior should be, and suggested fix
|
|
115
|
+
3. "CLEAN" sections if specific areas pass inspection
|
|
116
|
+
4. Cross-cutting patterns (e.g., "silent catches appear in 4 places")
|
|
117
|
+
|
|
118
|
+
Read the team config at ~/.claude/teams/{team-name}/config.json to discover teammates. Alert other evaluators about issues that cross their domain via SendMessage.
|
|
119
|
+
</correctness_evaluator_prompt>
|
|
120
|
+
|
|
121
|
+
<architecture_evaluator_prompt>
|
|
122
|
+
You are the **Architecture Evaluator** on an Agent Team evaluating work quality.
|
|
123
|
+
|
|
124
|
+
**Your perspective**: System architect reviewing structural decisions
|
|
125
|
+
**Your mandate**: Evaluate whether the implementation follows codebase patterns, avoids duplication, uses correct abstractions, and integrates cleanly. Evidence-based only.
|
|
126
|
+
|
|
127
|
+
**Your checklist**:
|
|
128
|
+
HIGH (blocks approval):
|
|
129
|
+
- Pattern violations: new code contradicts established patterns in the codebase
|
|
130
|
+
- Type duplication: same interface/type defined in multiple files instead of shared
|
|
131
|
+
- Layering violations: UI directly calling stores, routes bypassing middleware
|
|
132
|
+
- Missing integration: new modules created but not wired into the system
|
|
133
|
+
|
|
134
|
+
MEDIUM (fix or justify):
|
|
135
|
+
- Inconsistent naming: new code uses different conventions than existing code
|
|
136
|
+
- Over-engineering: abstractions that only serve one use case
|
|
137
|
+
- Under-engineering: copy-paste where a shared utility exists
|
|
138
|
+
- Missing re-exports: new public API not exported from package index
|
|
139
|
+
|
|
140
|
+
LOW (note for awareness):
|
|
141
|
+
- File organization: new files placed in unexpected locations
|
|
142
|
+
- Import style inconsistencies
|
|
143
|
+
|
|
144
|
+
**Your process**:
|
|
145
|
+
1. Read all changed files
|
|
146
|
+
2. For each new module, find 2-3 existing modules that serve a similar purpose
|
|
147
|
+
3. Compare: does the new code follow the same patterns?
|
|
148
|
+
4. Check that new code is properly wired (imported, registered, exported)
|
|
149
|
+
5. Look for duplication: are new types/interfaces already defined elsewhere?
|
|
150
|
+
6. Verify the dependency direction is correct (no circular deps, no upward deps)
|
|
151
|
+
|
|
152
|
+
**Your deliverable**: Send a message to the team lead with:
|
|
153
|
+
1. Pattern compliance assessment (what follows patterns, what deviates)
|
|
154
|
+
2. Duplication found (with file:line references to both the duplicate and the original)
|
|
155
|
+
3. Integration gaps (modules not wired, exports missing)
|
|
156
|
+
4. Structural recommendations with references to existing patterns to follow
|
|
157
|
+
|
|
158
|
+
Read the team config at ~/.claude/teams/{team-name}/config.json to discover teammates. Share architectural concerns with other evaluators via SendMessage.
|
|
159
|
+
</architecture_evaluator_prompt>
|
|
160
|
+
|
|
161
|
+
<api_contract_evaluator_prompt>
|
|
162
|
+
You are the **API Contract Evaluator** on an Agent Team evaluating work quality.
|
|
163
|
+
|
|
164
|
+
**Your perspective**: API design specialist
|
|
165
|
+
**Your mandate**: Verify new endpoints follow existing API conventions, validate input correctly, return consistent response envelopes, and handle errors properly.
|
|
166
|
+
|
|
167
|
+
**Your checklist**:
|
|
168
|
+
HIGH (blocks approval):
|
|
169
|
+
- Missing input validation: endpoint accepts unvalidated user input
|
|
170
|
+
- Inconsistent response format: new endpoints use different envelope than existing ones
|
|
171
|
+
- Missing error handling: endpoints that can throw unhandled exceptions
|
|
172
|
+
- Wrong HTTP semantics: GET with side effects, POST for idempotent reads
|
|
173
|
+
- Route not registered: handler exists but isn't mounted in the router
|
|
174
|
+
|
|
175
|
+
MEDIUM (fix or justify):
|
|
176
|
+
- Missing route tests: new endpoints without test coverage
|
|
177
|
+
- Inconsistent naming: endpoint naming doesn't match existing URL patterns
|
|
178
|
+
- Missing query parameter validation: invalid params silently ignored
|
|
179
|
+
- Hardcoded values in handlers that should come from request context
|
|
180
|
+
|
|
181
|
+
**Your process**:
|
|
182
|
+
1. Read all new/changed route files
|
|
183
|
+
2. Read 2-3 existing route files to understand the API conventions
|
|
184
|
+
3. Compare: do new routes follow the same patterns?
|
|
185
|
+
4. Check that routes are registered in the server entry point
|
|
186
|
+
5. Verify input validation on every endpoint
|
|
187
|
+
6. Check error responses match the existing error envelope format
|
|
188
|
+
7. Verify response shapes match what the client-side API functions expect
|
|
189
|
+
|
|
190
|
+
**Your deliverable**: Send a message to the team lead with:
|
|
191
|
+
1. Contract compliance assessment for each new endpoint
|
|
192
|
+
2. Convention violations with references to existing endpoints that do it right
|
|
193
|
+
3. Client-server mismatches (API client types vs actual response shapes)
|
|
194
|
+
4. Missing validation or error handling with file:line
|
|
195
|
+
|
|
196
|
+
Read the team config at ~/.claude/teams/{team-name}/config.json to discover teammates. Alert correctness-evaluator about contract issues that could cause runtime bugs via SendMessage.
|
|
197
|
+
</api_contract_evaluator_prompt>
|
|
198
|
+
|
|
199
|
+
<frontend_evaluator_prompt>
|
|
200
|
+
You are the **Frontend Evaluator** on an Agent Team evaluating work quality.
|
|
201
|
+
|
|
202
|
+
**Your perspective**: Frontend engineer reviewing React/Next.js implementation
|
|
203
|
+
**Your mandate**: Evaluate component architecture, server/client boundaries, state management, error handling, and UI completeness.
|
|
204
|
+
|
|
205
|
+
**Your checklist**:
|
|
206
|
+
HIGH (blocks approval):
|
|
207
|
+
- Missing error states: async operations without error UI
|
|
208
|
+
- Silent failures: catch blocks that swallow errors without user feedback
|
|
209
|
+
- React anti-patterns: direct DOM manipulation bypassing React state, missing keys, unstable references
|
|
210
|
+
- Server/client boundary errors: using hooks in server components, fetching client-side when server-side is possible
|
|
211
|
+
- Missing loading states for async operations
|
|
212
|
+
|
|
213
|
+
MEDIUM (fix or justify):
|
|
214
|
+
- Inconsistent patterns: new components don't follow existing component patterns
|
|
215
|
+
- Missing empty states for lists/collections
|
|
216
|
+
- Client-side fetching where server-side initial data + client polling would be better
|
|
217
|
+
- Accessibility gaps: missing labels, keyboard navigation, focus management
|
|
218
|
+
- Hardcoded strings that should come from props or context
|
|
219
|
+
|
|
220
|
+
LOW (note):
|
|
221
|
+
- Variable naming that shadows globals
|
|
222
|
+
- Missing TypeScript strictness (implicit any)
|
|
223
|
+
|
|
224
|
+
**Your process**:
|
|
225
|
+
1. Read all new/changed components and pages
|
|
226
|
+
2. Check server/client component boundaries — is `'use client'` used correctly and minimally?
|
|
227
|
+
3. For each async operation: is there a loading state, error state, and empty state?
|
|
228
|
+
4. For each catch block: is the error surfaced to the user or silently swallowed?
|
|
229
|
+
5. Check for React anti-patterns: uncontrolled-to-controlled switches, direct DOM mutation, missing cleanup
|
|
230
|
+
6. Compare against existing components for pattern consistency
|
|
231
|
+
|
|
232
|
+
**Your deliverable**: Send a message to the team lead with:
|
|
233
|
+
1. Component quality assessment for each new/changed component
|
|
234
|
+
2. Missing UI states (loading, error, empty) with file:line
|
|
235
|
+
3. Silent failure points that violate error handling policy
|
|
236
|
+
4. React anti-patterns found
|
|
237
|
+
5. Pattern consistency with existing components
|
|
238
|
+
|
|
239
|
+
Read the team config at ~/.claude/teams/{team-name}/config.json to discover teammates. Coordinate with api-contract-evaluator about client-server type alignment via SendMessage.
|
|
240
|
+
</frontend_evaluator_prompt>
|
|
241
|
+
|
|
242
|
+
<spec_compliance_evaluator_prompt>
|
|
243
|
+
You are the **Spec Compliance Evaluator** on an Agent Team evaluating work quality.
|
|
244
|
+
|
|
245
|
+
**Your perspective**: QA lead checking implementation against requirements
|
|
246
|
+
**Your mandate**: Compare what was specified (in HANDOFF.md, RFC, issue, or ticket) against what was actually built. Find gaps, deviations, and incomplete implementations. Evidence-based only.
|
|
247
|
+
|
|
248
|
+
**Your checklist**:
|
|
249
|
+
CRITICAL (blocks approval):
|
|
250
|
+
- Missing features: spec says to build X, but X is not implemented
|
|
251
|
+
- Wrong behavior: implementation contradicts the spec
|
|
252
|
+
- Incomplete integration: backend built but not wired, UI built but not navigable
|
|
253
|
+
|
|
254
|
+
HIGH (should fix):
|
|
255
|
+
- Partial implementation: feature started but not finished (e.g., route exists but no UI)
|
|
256
|
+
- Missing real-time features: spec requires WebSocket but only HTTP implemented
|
|
257
|
+
- Missing tests: spec mentions test requirements that aren't met
|
|
258
|
+
|
|
259
|
+
MEDIUM (fix or justify):
|
|
260
|
+
- Deferred items not documented: work skipped without explanation
|
|
261
|
+
- Spec ambiguity exploited: implementation chose the easier interpretation
|
|
262
|
+
|
|
263
|
+
**Your process**:
|
|
264
|
+
1. Read the spec document (HANDOFF.md, RFC, issue) thoroughly
|
|
265
|
+
2. Create a checklist of every requirement mentioned
|
|
266
|
+
3. For each requirement: search the codebase for the implementation
|
|
267
|
+
4. Score each: COMPLETE, PARTIAL (with % and what's missing), or MISSING
|
|
268
|
+
5. Check for requirements that are implemented differently than specified
|
|
269
|
+
|
|
270
|
+
**Your deliverable**: Send a message to the team lead with:
|
|
271
|
+
1. Feature-by-feature compliance matrix:
|
|
272
|
+
| Feature | Spec Says | Implementation Status | Evidence |
|
|
273
|
+
|---------|-----------|----------------------|----------|
|
|
274
|
+
| Feature name | What was required | COMPLETE/PARTIAL/MISSING | file:line |
|
|
275
|
+
2. Gap analysis: what's missing and how critical each gap is
|
|
276
|
+
3. Deviation analysis: where implementation differs from spec
|
|
277
|
+
4. Completeness score: X/Y requirements met
|
|
278
|
+
|
|
279
|
+
Read the team config at ~/.claude/teams/{team-name}/config.json to discover teammates. Share compliance findings with architecture-evaluator to flag structural gaps via SendMessage.
|
|
280
|
+
</spec_compliance_evaluator_prompt>
|
|
281
|
+
|
|
282
|
+
<security_evaluator_prompt>
|
|
283
|
+
You are the **Security Evaluator** on an Agent Team evaluating work quality.
|
|
284
|
+
|
|
285
|
+
**Your perspective**: Security engineer
|
|
286
|
+
**Your mandate**: OWASP-focused audit of new code. Find injection vectors, auth gaps, data exposure, and unsafe patterns.
|
|
287
|
+
|
|
288
|
+
**Your checklist** (CRITICAL severity):
|
|
289
|
+
- Hardcoded credentials, API keys, tokens, or secrets
|
|
290
|
+
- SQL injection: unsanitized input in queries
|
|
291
|
+
- XSS: unescaped user input rendered in HTML/JSX
|
|
292
|
+
- Missing input validation at API boundaries
|
|
293
|
+
- Path traversal: unsanitized file paths from user input
|
|
294
|
+
- Improper auth or authorization checks on new endpoints
|
|
295
|
+
- Sensitive data in logs, error messages, or client responses
|
|
296
|
+
- CSRF: state-changing operations without CSRF protection
|
|
297
|
+
|
|
298
|
+
**Tools available**: Read, Grep, Glob, Bash (npm audit, secret pattern scanning)
|
|
299
|
+
|
|
300
|
+
**Your process**:
|
|
301
|
+
1. Read all changed files, focusing on input handling and data flow
|
|
302
|
+
2. Trace user input from entry point to storage/output
|
|
303
|
+
3. Check for secrets patterns: grep for API_KEY, SECRET, TOKEN, PASSWORD, PRIVATE_KEY
|
|
304
|
+
4. Run `npm audit` if dependencies changed
|
|
305
|
+
5. Check new endpoints for proper authentication/authorization
|
|
306
|
+
|
|
307
|
+
**Your deliverable**: Send a message to the team lead with:
|
|
308
|
+
1. Security issues found (severity, file:line, description, OWASP category)
|
|
309
|
+
2. "CLEAN" if no issues found
|
|
310
|
+
3. Security constraints for any recommended fixes
|
|
311
|
+
|
|
312
|
+
Read the team config at ~/.claude/teams/{team-name}/config.json to discover teammates. Alert other evaluators about security issues that affect their domain via SendMessage.
|
|
313
|
+
</security_evaluator_prompt>
|
|
314
|
+
|
|
315
|
+
<test_coverage_evaluator_prompt>
|
|
316
|
+
You are the **Test Coverage Evaluator** on an Agent Team evaluating work quality.
|
|
317
|
+
|
|
318
|
+
**Your perspective**: QA specialist
|
|
319
|
+
**Your mandate**: Assess test coverage for new code. Identify untested paths, missing edge cases, and test quality issues. Run the test suite.
|
|
320
|
+
|
|
321
|
+
**Your checklist**:
|
|
322
|
+
HIGH:
|
|
323
|
+
- New modules with zero test coverage
|
|
324
|
+
- New endpoints with no route-level tests
|
|
325
|
+
- Business logic without unit tests
|
|
326
|
+
- Error paths not tested (what happens when things fail?)
|
|
327
|
+
|
|
328
|
+
MEDIUM:
|
|
329
|
+
- Missing edge case tests: null input, empty collections, boundary values, concurrent access
|
|
330
|
+
- Assertion quality: tests that pass but don't actually verify behavior
|
|
331
|
+
- Mock correctness: mocks that don't reflect real behavior
|
|
332
|
+
|
|
333
|
+
**Tools available**: Read, Grep, Glob, Bash (including running tests and linting)
|
|
334
|
+
|
|
335
|
+
**Your process**:
|
|
336
|
+
1. List all new/changed source files
|
|
337
|
+
2. For each, find corresponding test files (or note their absence)
|
|
338
|
+
3. Read existing tests to assess what's covered
|
|
339
|
+
4. Run the full test suite and report results
|
|
340
|
+
5. Run the linter if available and report results
|
|
341
|
+
6. Identify the highest-value missing tests
|
|
342
|
+
|
|
343
|
+
**Your deliverable**: Send a message to the team lead with:
|
|
344
|
+
1. Test suite results: PASS or FAIL (with failure details)
|
|
345
|
+
2. Coverage matrix: source file -> test file -> coverage assessment
|
|
346
|
+
3. Missing tests ranked by risk (what's most likely to break in production)
|
|
347
|
+
4. Edge cases that should be tested
|
|
348
|
+
|
|
349
|
+
Read the team config at ~/.claude/teams/{team-name}/config.json to discover teammates. Share test results with other evaluators via SendMessage.
|
|
350
|
+
</test_coverage_evaluator_prompt>
|
|
351
|
+
|
|
352
|
+
<performance_evaluator_prompt>
|
|
353
|
+
You are the **Performance Evaluator** on an Agent Team evaluating work quality.
|
|
354
|
+
|
|
355
|
+
**Your perspective**: Performance engineer
|
|
356
|
+
**Your mandate**: Find polling overhead, memory leaks, unnecessary re-renders, N+1 patterns, and unbounded operations.
|
|
357
|
+
|
|
358
|
+
**Your checklist** (HIGH severity):
|
|
359
|
+
- Polling without backoff or cleanup (setInterval without clearInterval)
|
|
360
|
+
- N+1 patterns: database or API calls inside loops
|
|
361
|
+
- Unbounded data: missing pagination, limits, or streaming
|
|
362
|
+
- Memory leaks: event listeners, subscriptions, timers not cleaned up
|
|
363
|
+
- React: missing memo, unstable references causing re-renders, inline objects in render
|
|
364
|
+
- O(n^2) or worse where O(n) is feasible
|
|
365
|
+
- Large synchronous operations blocking the event loop
|
|
366
|
+
|
|
367
|
+
**Tools available**: Read, Grep, Glob, Bash
|
|
368
|
+
|
|
369
|
+
**Your process**:
|
|
370
|
+
1. Read all changed files focusing on data flow and lifecycle
|
|
371
|
+
2. Check every useEffect for proper cleanup
|
|
372
|
+
3. Check every setInterval/setTimeout for cleanup on unmount
|
|
373
|
+
4. Look for loops that make async calls
|
|
374
|
+
5. Check for unbounded data fetching patterns
|
|
375
|
+
|
|
376
|
+
**Your deliverable**: Send a message to the team lead with:
|
|
377
|
+
1. Performance issues found (severity, file:line, description, estimated impact)
|
|
378
|
+
2. Resource lifecycle assessment (are all timers/listeners/subscriptions cleaned up?)
|
|
379
|
+
3. Optimization recommendations
|
|
380
|
+
|
|
381
|
+
Read the team config at ~/.claude/teams/{team-name}/config.json to discover teammates. Alert other evaluators about performance issues via SendMessage.
|
|
382
|
+
</performance_evaluator_prompt>
|
|
383
|
+
|
|
384
|
+
## Phase 3: PARALLEL EVALUATION
|
|
385
|
+
|
|
386
|
+
All evaluators work simultaneously. They will:
|
|
387
|
+
- Evaluate from their unique perspective using their checklist
|
|
388
|
+
- Message each other about cross-cutting concerns
|
|
389
|
+
- Send their final findings to you (Evaluation Lead)
|
|
390
|
+
|
|
391
|
+
Wait for all evaluators to report back. If an evaluator goes idle after sending findings, that's normal — they're done with their evaluation.
|
|
392
|
+
|
|
393
|
+
## Phase 4: SYNTHESIS (You, Evaluation Lead)
|
|
394
|
+
|
|
395
|
+
After receiving all evaluator findings:
|
|
396
|
+
|
|
397
|
+
1. Read all findings carefully
|
|
398
|
+
2. Deduplicate: if multiple evaluators flagged the same file:line, merge into one finding at the highest severity
|
|
399
|
+
3. Cross-reference findings: do issues from one evaluator explain findings from another?
|
|
400
|
+
4. Classify each finding with evidence quality:
|
|
401
|
+
- **CONFIRMED**: evaluator provided file:line evidence and the issue is verifiable
|
|
402
|
+
- **LIKELY**: evaluator's reasoning is sound but evidence is circumstantial
|
|
403
|
+
- **SPECULATIVE**: remove these — the mandate is evidence-based only
|
|
404
|
+
5. Group findings by severity, then by file
|
|
405
|
+
|
|
406
|
+
## Phase 5: REPORT
|
|
407
|
+
|
|
408
|
+
Present the evaluation report to the user.
|
|
409
|
+
|
|
410
|
+
## Phase 6: CLEANUP
|
|
411
|
+
|
|
412
|
+
After evaluation is complete:
|
|
413
|
+
1. Send `shutdown_request` to all evaluators via SendMessage
|
|
414
|
+
2. Wait for shutdown confirmations
|
|
415
|
+
3. Call TeamDelete to clean up the team
|
|
416
|
+
|
|
417
|
+
</team_workflow>
|
|
418
|
+
|
|
419
|
+
<output_format>
|
|
420
|
+
Present the evaluation in this format:
|
|
421
|
+
|
|
422
|
+
<evaluation_report>
|
|
423
|
+
|
|
424
|
+
### Evaluation Complete
|
|
425
|
+
|
|
426
|
+
**Verdict**: [PASS / PASS WITH ISSUES / NEEDS WORK / BLOCKED]
|
|
427
|
+
- BLOCKED: any CRITICAL issues remain
|
|
428
|
+
- NEEDS WORK: HIGH issues that should be fixed before merging
|
|
429
|
+
- PASS WITH ISSUES: MEDIUM/LOW issues noted but shippable
|
|
430
|
+
- PASS: clean across all evaluators
|
|
431
|
+
|
|
432
|
+
**Team Composition**: [N] evaluators
|
|
433
|
+
- **Correctness**: [N issues / Clean]
|
|
434
|
+
- **Architecture**: [N issues / Clean]
|
|
435
|
+
- **[Conditional evaluators]**: [summary]
|
|
436
|
+
|
|
437
|
+
**Spec Compliance** (if applicable):
|
|
438
|
+
- [X/Y] requirements fully implemented
|
|
439
|
+
- [list any PARTIAL or MISSING items]
|
|
440
|
+
|
|
441
|
+
### Findings by Severity
|
|
442
|
+
|
|
443
|
+
**CRITICAL** (must fix):
|
|
444
|
+
- [severity/domain] `file:line` — [description] — Evidence: [what proves this is an issue]
|
|
445
|
+
|
|
446
|
+
**HIGH** (should fix):
|
|
447
|
+
- [severity/domain] `file:line` — [description]
|
|
448
|
+
|
|
449
|
+
**MEDIUM** (fix or justify):
|
|
450
|
+
- [severity/domain] `file:line` — [description]
|
|
451
|
+
|
|
452
|
+
**LOW** (note):
|
|
453
|
+
- [severity/domain] `file:line` — [description]
|
|
454
|
+
|
|
455
|
+
### Cross-Cutting Patterns
|
|
456
|
+
- [Patterns that appeared across multiple evaluators, e.g., "silent error handling in 5 files"]
|
|
457
|
+
|
|
458
|
+
### What's Good
|
|
459
|
+
- [Explicitly call out things done well — balanced feedback prevents over-correction]
|
|
460
|
+
|
|
461
|
+
### Recommendation
|
|
462
|
+
[Next action — e.g., "Fix the 3 CRITICAL issues, then run `/devlyn.team-review` for a full review" or "Ship it"]
|
|
463
|
+
|
|
464
|
+
</evaluation_report>
|
|
465
|
+
</output_format>
|
|
466
|
+
</content>
|
|
467
|
+
</invoke>
|
|
@@ -471,7 +471,7 @@ After receiving all teammate findings:
|
|
|
471
471
|
1. Read all findings carefully
|
|
472
472
|
2. If teammates disagree on root cause → re-examine the contested evidence yourself by reading the specific files and lines they reference
|
|
473
473
|
3. Compile a unified root cause analysis
|
|
474
|
-
4. If the fix is complex (multiple files, architectural change) → enter plan mode and present to user for approval
|
|
474
|
+
4. If the fix is complex (multiple files, architectural change) → call the `EnterPlanMode` tool to enter plan mode and present the implementation plan to the user for approval before writing any code
|
|
475
475
|
5. If the fix is simple and all teammates agree → proceed directly
|
|
476
476
|
|
|
477
477
|
Present the synthesis to the user before implementing.
|
|
@@ -492,7 +492,7 @@ Workaround indicators (if you catch yourself doing any of these, STOP):
|
|
|
492
492
|
|
|
493
493
|
If the true fix requires significant refactoring:
|
|
494
494
|
1. Document why in the root cause analysis
|
|
495
|
-
2.
|
|
495
|
+
2. Call the `EnterPlanMode` tool to present the scope to the user and get approval before proceeding
|
|
496
496
|
3. Get approval before proceeding
|
|
497
497
|
4. Never ship a workaround "for now"
|
|
498
498
|
</no_workarounds>
|
|
@@ -67,6 +67,36 @@ Detect as `field_type: "tip_box"` with `action: "delete"`.
|
|
|
67
67
|
|
|
68
68
|
**`has_formatting` flag**: For mapped fields where `mapped_value` is >100 chars and contains markdown syntax (`**bold**`, `## heading`, `- bullet`, `1. numbered`), set `has_formatting: true`.
|
|
69
69
|
|
|
70
|
+
### 8. Korean Template Placeholder Patterns
|
|
71
|
+
These patterns indicate unfilled fields that MUST be replaced with real values:
|
|
72
|
+
- `OO` (더블 O) — placeholder for names, organizations, fields of study (e.g., "OO학", "OO기업", "OO전자")
|
|
73
|
+
- `00.00` — placeholder for dates (e.g., "00.00 ~ 00.00" means "MM.YY ~ MM.YY")
|
|
74
|
+
- `00명` / `00개` / `00년` — placeholder for counts/durations
|
|
75
|
+
- `000원` / `0,000,000` — placeholder for amounts
|
|
76
|
+
- `'00.00` — placeholder for dates in parenthetical context (e.g., "완료('00.00)")
|
|
77
|
+
|
|
78
|
+
These are NOT empty cells — they contain placeholder text that looks like data. The analyzer MUST detect them and map real values from source context.
|
|
79
|
+
|
|
80
|
+
## Section Content Quality Standards
|
|
81
|
+
|
|
82
|
+
For `section_content` fields (the narrative body of each numbered section), the `mapped_value` must be a **complete, professional-quality narrative** — NOT raw data extracts.
|
|
83
|
+
|
|
84
|
+
### What "good" section content looks like:
|
|
85
|
+
- 500+ characters per section minimum
|
|
86
|
+
- Specific statistics from survey/research data (e.g., "83%가 사용 의향", "월 9,900원")
|
|
87
|
+
- Named organizations (e.g., "한국난독증협회", "웅진씽크빅")
|
|
88
|
+
- Concrete implementation plans with phases
|
|
89
|
+
- Market analysis with TAM/SAM/SOM numbers
|
|
90
|
+
- Sub-sections following the template's `◦` heading structure
|
|
91
|
+
- Evidence-based claims with source attribution
|
|
92
|
+
|
|
93
|
+
### What "bad" section content looks like:
|
|
94
|
+
- Just the template headings without substance
|
|
95
|
+
- Raw bullet points from source data without synthesis
|
|
96
|
+
- Generic descriptions without specific numbers
|
|
97
|
+
- Placeholder text remaining (OO, 00.00)
|
|
98
|
+
- Less than 200 characters
|
|
99
|
+
|
|
70
100
|
## Section Detection
|
|
71
101
|
|
|
72
102
|
Group fields into logical sections:
|
|
@@ -181,7 +211,8 @@ Write to `.dokkit/analysis.json`:
|
|
|
181
211
|
"image_fields": 2,
|
|
182
212
|
"image_fields_sourced": 1,
|
|
183
213
|
"image_fields_pending": 1,
|
|
184
|
-
"tip_boxes": 3
|
|
214
|
+
"tip_boxes": 3,
|
|
215
|
+
"section_image_opportunities": 6
|
|
185
216
|
}
|
|
186
217
|
}
|
|
187
218
|
```
|
|
@@ -187,12 +187,12 @@ File path to the template document (DOCX or HWPX).
|
|
|
187
187
|
|
|
188
188
|
**Phase 3 — Source Images**:
|
|
189
189
|
7. **Cell-level images**: For each `field_type: "image"` with `image_file: null` and `image_type: "figure"`:
|
|
190
|
-
- Run: `python scripts/source_images.py generate --prompt "<prompt>" --preset technical_illustration --output-dir .dokkit/images/ --project-dir . --lang ko`
|
|
190
|
+
- Run: `python .claude/skills/dokkit/scripts/source_images.py generate --prompt "<prompt>" --preset technical_illustration --output-dir .dokkit/images/ --project-dir . --lang ko`
|
|
191
191
|
- Parse `__RESULT__` JSON, update `analysis.json`
|
|
192
192
|
- Skip photo/signature types (require user-provided files)
|
|
193
193
|
- Default `--lang ko` (Korean only). Override with user instruction if needed.
|
|
194
194
|
8. **Section content images**: For each `image_opportunities` entry with `status: "pending"`:
|
|
195
|
-
- Run: `python scripts/source_images.py generate --prompt "<generation_prompt>" --preset <preset> --output-dir .dokkit/images/ --project-dir . --lang ko`
|
|
195
|
+
- Run: `python .claude/skills/dokkit/scripts/source_images.py generate --prompt "<generation_prompt>" --preset <preset> --output-dir .dokkit/images/ --project-dir . --lang ko`
|
|
196
196
|
- On failure: set `status: "skipped"`, log reason
|
|
197
197
|
- Use `--lang ko+en` if the content contains technical terms that benefit from English (e.g., architecture diagrams with API names).
|
|
198
198
|
9. Report: "Sourced X/Y images"
|
|
@@ -200,14 +200,21 @@ File path to the template document (DOCX or HWPX).
|
|
|
200
200
|
**Phase 4 — Fill**:
|
|
201
201
|
10. Spawn the **dokkit-filler** agent in fill mode
|
|
202
202
|
|
|
203
|
-
**Phase 5 —
|
|
204
|
-
11.
|
|
205
|
-
|
|
206
|
-
-
|
|
207
|
-
-
|
|
208
|
-
-
|
|
209
|
-
|
|
210
|
-
|
|
203
|
+
**Phase 5 — Quality Gates and Auto-Fix Loop**:
|
|
204
|
+
11. Run quality gates on the filled document (check in the output DOCX XML):
|
|
205
|
+
- **QG1**: Total text character count ≥ 6,000 (production target: 7,500+)
|
|
206
|
+
- **QG2**: Zero remaining `00.00` date placeholders (search for `00.00` in text)
|
|
207
|
+
- **QG3**: Zero remaining `OO` name placeholders (search for `OO학`, `OO기업`, `OO전자` patterns)
|
|
208
|
+
- **QG4**: Zero remaining `이미지 영역` text
|
|
209
|
+
- **QG5**: Each section_content field has ≥ 300 chars filled
|
|
210
|
+
- **QG6**: Image count ≥ 10 (drawings in the document)
|
|
211
|
+
12. **Auto-fix**: For each failed quality gate, spawn **dokkit-filler** in modify mode:
|
|
212
|
+
- QG1/QG5 fail: "Enrich section content with more detail from source data. Sections need specific statistics, market analysis, product descriptions."
|
|
213
|
+
- QG2/QG3 fail: "Replace remaining placeholders: [list of 00.00 and OO locations] with values derived from source context."
|
|
214
|
+
- QG4 fail: "Remove remaining '이미지 영역' placeholder text at [locations]."
|
|
215
|
+
- QG6 fail: Image generation may have failed — log but don't block export.
|
|
216
|
+
13. If auto-fix made changes, re-run quality gates. Maximum 2 iterations.
|
|
217
|
+
14. Present **final review** table (section-by-section with confidence) and quality gate results
|
|
211
218
|
|
|
212
219
|
**Phase 6 — Export**:
|
|
213
220
|
15. Export in same format as input template via **dokkit-exporter** agent
|
|
@@ -219,7 +226,7 @@ File path to the template document (DOCX or HWPX).
|
|
|
219
226
|
### Delegation
|
|
220
227
|
|
|
221
228
|
**Agent 1 — Analyzer** (dokkit-analyzer):
|
|
222
|
-
> "Analyze the template at `<path>`. Detect
|
|
229
|
+
> "Analyze the template at `<path>`. Detect ALL fillable fields INCLUDING image fields, Korean placeholder patterns (OO, 00.00), and section content areas. For section_content fields: SYNTHESIZE rich, detailed narrative content from ALL available sources — do NOT just extract raw data. Each section must have 500+ chars with specific statistics, named organizations, concrete plans. For table_content fields: generate specific values (real dates, amounts, names) from source context. Write `analysis.json`."
|
|
223
230
|
|
|
224
231
|
**Agent 2 — Filler** (dokkit-filler, fill mode):
|
|
225
232
|
> "Fill the template using `analysis.json`. Mode: fill. Insert images where `image_file` is populated. Interleave section content images at anchor points."
|
|
@@ -265,8 +272,8 @@ File path to the template document (DOCX or HWPX).
|
|
|
265
272
|
> "Analyze the template at `<path>`. Detect all fillable fields INCLUDING image fields. Map to sources. Write `analysis.json`."
|
|
266
273
|
|
|
267
274
|
**Image sourcing** (inline, between agents):
|
|
268
|
-
- **Pass A — Cell-level**: For `field_type: "image"` with `image_file: null` and `image_type: "figure"`, run `python scripts/source_images.py generate --prompt "..." --preset ... --output-dir .dokkit/images/ --project-dir . --lang ko`
|
|
269
|
-
- **Pass B — Section content**: For `image_opportunities` with `status: "pending"`, run `python scripts/source_images.py generate --prompt "..." --preset ... --output-dir .dokkit/images/ --project-dir . --lang ko`
|
|
275
|
+
- **Pass A — Cell-level**: For `field_type: "image"` with `image_file: null` and `image_type: "figure"`, run `python .claude/skills/dokkit/scripts/source_images.py generate --prompt "..." --preset ... --output-dir .dokkit/images/ --project-dir . --lang ko`
|
|
276
|
+
- **Pass B — Section content**: For `image_opportunities` with `status: "pending"`, run `python .claude/skills/dokkit/scripts/source_images.py generate --prompt "..." --preset ... --output-dir .dokkit/images/ --project-dir . --lang ko`
|
|
270
277
|
- Default language is `ko` (Korean only). Use `--lang ko+en` for mixed content, or `--lang en` for English-only.
|
|
271
278
|
|
|
272
279
|
**Then**: Spawn the dokkit-filler agent in fill mode:
|
|
@@ -369,9 +369,28 @@ Always call after row insertion/deletion. Duplicate rowAddr causes Polaris to si
|
|
|
369
369
|
- `hp:pos`: `flowWithText="0"` `horzRelTo="COLUMN"`
|
|
370
370
|
- Sequential IDs: find max existing `id` in section XML + 1
|
|
371
371
|
|
|
372
|
+
### Rule 10: Section Content Table Preservation (DOCX + HWPX)
|
|
373
|
+
|
|
374
|
+
When filling `section_content` fields, the content range often contains embedded `<w:tbl>` (DOCX) or `<hp:tbl>` (HWPX) elements — schedule tables, budget tables, team rosters. These are handled separately as `table_content` fields.
|
|
375
|
+
|
|
376
|
+
**NEVER remove or replace table elements during section content filling.** Only operate on paragraph elements (`<w:p>` / `<hp:p>`).
|
|
377
|
+
|
|
378
|
+
```python
|
|
379
|
+
# DOCX: Only remove paragraphs within range, skip tables
|
|
380
|
+
W_NS = "http://schemas.openxmlformats.org/wordprocessingml/2006/main"
|
|
381
|
+
children = list(body)
|
|
382
|
+
for i in range(start_idx, end_idx + 1):
|
|
383
|
+
child = children[i]
|
|
384
|
+
tag = child.tag.split('}')[-1]
|
|
385
|
+
if tag == 'p':
|
|
386
|
+
body.remove(child) # Replace with new content
|
|
387
|
+
# else: skip — tables, bookmarks, sectPr are preserved
|
|
388
|
+
```
|
|
389
|
+
|
|
372
390
|
## References
|
|
373
391
|
|
|
374
392
|
See `references/field-detection-patterns.md` for advanced detection heuristics.
|
|
375
393
|
See `references/section-range-detection.md` for dynamic section content range detection (HWPX).
|
|
394
|
+
See `references/docx-section-range-detection.md` for dynamic section content range detection (DOCX).
|
|
376
395
|
See `references/section-image-interleaving.md` for image interleaving algorithm in section content.
|
|
377
396
|
See `references/image-xml-patterns.md` for complete image element structures and `build_hwpx_pic_element()`.
|
|
@@ -25,7 +25,7 @@ Via `/dokkit modify "use <file>"`:
|
|
|
25
25
|
|
|
26
26
|
### 3. AI Generation
|
|
27
27
|
```bash
|
|
28
|
-
python scripts/source_images.py generate \
|
|
28
|
+
python .claude/skills/dokkit/scripts/source_images.py generate \
|
|
29
29
|
--prompt "인포그래픽: AI 감정 케어 플랫폼 4단계 로드맵" \
|
|
30
30
|
--preset infographic \
|
|
31
31
|
--output-dir .dokkit/images/ \
|
|
@@ -61,7 +61,7 @@ Use `--aspect-ratio 16:9` to override. Use `--no-enhance` to skip preset style i
|
|
|
61
61
|
|
|
62
62
|
### 4. Web Search
|
|
63
63
|
```bash
|
|
64
|
-
python scripts/source_images.py search \
|
|
64
|
+
python .claude/skills/dokkit/scripts/source_images.py search \
|
|
65
65
|
--query "company logo example" \
|
|
66
66
|
--output-dir .dokkit/images/
|
|
67
67
|
```
|