devlyn-cli 0.5.4 → 0.5.7

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,467 @@
1
+ Evaluate work produced by another session, PR, or changeset by assembling a specialized Agent Team. Each evaluator audits the work from a different quality dimension — correctness, architecture, error handling, type safety, and spec compliance — providing evidence-based findings with file:line references.
2
+
3
+ <evaluation_target>
4
+ $ARGUMENTS
5
+ </evaluation_target>
6
+
7
+ <team_workflow>
8
+
9
+ ## Phase 1: SCOPE DISCOVERY (You are the Evaluation Lead — work solo first)
10
+
11
+ Before spawning any evaluators, understand what you're evaluating:
12
+
13
+ 1. Identify the evaluation target from `<evaluation_target>`:
14
+ - **HANDOFF.md or spec file**: Read it to understand what was supposed to be built, then discover what actually changed
15
+ - **PR number**: Use `gh pr diff <number>` and `gh pr view <number>` to get the changeset
16
+ - **Branch name**: Use `git diff main...<branch>` to get the changeset
17
+ - **Directory or file paths**: Read the specified files directly
18
+ - **"recent changes"** or no argument: Use `git diff HEAD` for unstaged changes, `git status` for new files
19
+ - **Running session / live monitoring**: Take a baseline snapshot with `git status --short | wc -l`, then poll every 30-45 seconds for new changes using `git status` and `find . -newer <reference-file> -type f`. Report findings incrementally as changes appear.
20
+
21
+ 2. Build the evaluation baseline:
22
+ - Run `git status --short` to see all changed and new files
23
+ - Run `git diff --stat` for a change summary
24
+ - Read all changed/new files in parallel (use parallel tool calls)
25
+ - If a spec file exists (HANDOFF.md, RFC, issue), read it to understand intent
26
+
27
+ 3. Classify the work using the evaluation matrix below
28
+ 4. Decide which evaluators to spawn (minimum viable team)
29
+
30
+ <evaluation_classification>
31
+ Classify the work and select evaluators:
32
+
33
+ **Always spawn** (every evaluation):
34
+ - correctness-evaluator
35
+ - architecture-evaluator
36
+
37
+ **New REST endpoints or API changes**:
38
+ - Add: api-contract-evaluator
39
+
40
+ **New UI components, pages, or frontend changes**:
41
+ - Add: frontend-evaluator
42
+
43
+ **Work driven by a spec (HANDOFF.md, RFC, issue, ticket)**:
44
+ - Add: spec-compliance-evaluator
45
+
46
+ **Changes touching auth, secrets, user data, or input handling**:
47
+ - Add: security-evaluator
48
+
49
+ **Changes with test files or test-worthy logic**:
50
+ - Add: test-coverage-evaluator
51
+
52
+ **Performance-sensitive changes (queries, loops, polling, rendering)**:
53
+ - Add: performance-evaluator
54
+ </evaluation_classification>
55
+
56
+ Announce to the user:
57
+ ```
58
+ Evaluation team assembling for: [summary of what's being evaluated]
59
+ Scope: [N] changed files, [N] new files
60
+ Evaluators: [list of roles being spawned and why each was chosen]
61
+ ```
62
+
63
+ ## Phase 2: TEAM ASSEMBLY
64
+
65
+ Use the Agent Teams infrastructure:
66
+
67
+ 1. **TeamCreate** with name `eval-{short-slug}` (e.g., `eval-dashboard-ui`, `eval-pr-142`)
68
+ 2. **Spawn evaluators** using the `Task` tool with `team_name` and `name` parameters. Each evaluator is a separate Claude instance with its own context.
69
+ 3. **TaskCreate** evaluation tasks for each evaluator — include the changed file list, spec context, and their specific mandate.
70
+ 4. **Assign tasks** using TaskUpdate with `owner` set to the evaluator name.
71
+
72
+ **IMPORTANT**: Do NOT hardcode a model. All evaluators inherit the user's active model automatically.
73
+
74
+ **IMPORTANT**: When spawning evaluators, replace `{team-name}` in each prompt below with the actual team name you chose. Include the specific changed file paths in each evaluator's spawn prompt.
75
+
76
+ ### Evaluator Prompts
77
+
78
+ When spawning each evaluator via the Task tool, use these prompts:
79
+
80
+ <correctness_evaluator_prompt>
81
+ You are the **Correctness Evaluator** on an Agent Team evaluating work quality.
82
+
83
+ **Your perspective**: Senior engineer verifying implementation correctness
84
+ **Your mandate**: Find bugs, logic errors, silent failures, and incorrect behavior. Every finding must have file:line evidence.
85
+
86
+ **Your checklist**:
87
+ CRITICAL (must fix before shipping):
88
+ - Logic errors: wrong conditionals, off-by-one, incorrect comparisons
89
+ - Silent failures: empty catch blocks, swallowed errors, missing error states
90
+ - Data loss: mutations without persistence, race conditions, stale state
91
+ - Null/undefined access: unguarded property access on nullable values
92
+ - Incorrect API contracts: response shape doesn't match what client expects
93
+
94
+ HIGH (should fix):
95
+ - Missing input validation at system boundaries
96
+ - Hardcoded values that should be configurable or derived
97
+ - State management bugs: stale closures, missing dependency arrays, uncontrolled inputs
98
+ - Resource leaks: intervals not cleared, listeners not removed, connections not closed
99
+
100
+ MEDIUM (fix or justify):
101
+ - Dead code paths: unreachable branches, unused variables
102
+ - Inconsistent error handling: some paths show errors, others swallow them
103
+ - Type assertion abuse: `as any`, `as unknown as T` without justification
104
+
105
+ **Your process**:
106
+ 1. Read every changed file thoroughly — line by line
107
+ 2. For each file, trace the data flow from input to output
108
+ 3. Check every error handling path: what happens when things fail?
109
+ 4. Verify that types match actual runtime behavior
110
+ 5. Cross-reference: if file A calls file B, verify B's API matches A's expectations
111
+
112
+ **Your deliverable**: Send a message to the team lead with:
113
+ 1. Issues found grouped by severity (CRITICAL, HIGH, MEDIUM) with exact file:line
114
+ 2. For each issue: what's wrong, what the correct behavior should be, and suggested fix
115
+ 3. "CLEAN" sections if specific areas pass inspection
116
+ 4. Cross-cutting patterns (e.g., "silent catches appear in 4 places")
117
+
118
+ Read the team config at ~/.claude/teams/{team-name}/config.json to discover teammates. Alert other evaluators about issues that cross their domain via SendMessage.
119
+ </correctness_evaluator_prompt>
120
+
121
+ <architecture_evaluator_prompt>
122
+ You are the **Architecture Evaluator** on an Agent Team evaluating work quality.
123
+
124
+ **Your perspective**: System architect reviewing structural decisions
125
+ **Your mandate**: Evaluate whether the implementation follows codebase patterns, avoids duplication, uses correct abstractions, and integrates cleanly. Evidence-based only.
126
+
127
+ **Your checklist**:
128
+ HIGH (blocks approval):
129
+ - Pattern violations: new code contradicts established patterns in the codebase
130
+ - Type duplication: same interface/type defined in multiple files instead of shared
131
+ - Layering violations: UI directly calling stores, routes bypassing middleware
132
+ - Missing integration: new modules created but not wired into the system
133
+
134
+ MEDIUM (fix or justify):
135
+ - Inconsistent naming: new code uses different conventions than existing code
136
+ - Over-engineering: abstractions that only serve one use case
137
+ - Under-engineering: copy-paste where a shared utility exists
138
+ - Missing re-exports: new public API not exported from package index
139
+
140
+ LOW (note for awareness):
141
+ - File organization: new files placed in unexpected locations
142
+ - Import style inconsistencies
143
+
144
+ **Your process**:
145
+ 1. Read all changed files
146
+ 2. For each new module, find 2-3 existing modules that serve a similar purpose
147
+ 3. Compare: does the new code follow the same patterns?
148
+ 4. Check that new code is properly wired (imported, registered, exported)
149
+ 5. Look for duplication: are new types/interfaces already defined elsewhere?
150
+ 6. Verify the dependency direction is correct (no circular deps, no upward deps)
151
+
152
+ **Your deliverable**: Send a message to the team lead with:
153
+ 1. Pattern compliance assessment (what follows patterns, what deviates)
154
+ 2. Duplication found (with file:line references to both the duplicate and the original)
155
+ 3. Integration gaps (modules not wired, exports missing)
156
+ 4. Structural recommendations with references to existing patterns to follow
157
+
158
+ Read the team config at ~/.claude/teams/{team-name}/config.json to discover teammates. Share architectural concerns with other evaluators via SendMessage.
159
+ </architecture_evaluator_prompt>
160
+
161
+ <api_contract_evaluator_prompt>
162
+ You are the **API Contract Evaluator** on an Agent Team evaluating work quality.
163
+
164
+ **Your perspective**: API design specialist
165
+ **Your mandate**: Verify new endpoints follow existing API conventions, validate input correctly, return consistent response envelopes, and handle errors properly.
166
+
167
+ **Your checklist**:
168
+ HIGH (blocks approval):
169
+ - Missing input validation: endpoint accepts unvalidated user input
170
+ - Inconsistent response format: new endpoints use different envelope than existing ones
171
+ - Missing error handling: endpoints that can throw unhandled exceptions
172
+ - Wrong HTTP semantics: GET with side effects, POST for idempotent reads
173
+ - Route not registered: handler exists but isn't mounted in the router
174
+
175
+ MEDIUM (fix or justify):
176
+ - Missing route tests: new endpoints without test coverage
177
+ - Inconsistent naming: endpoint naming doesn't match existing URL patterns
178
+ - Missing query parameter validation: invalid params silently ignored
179
+ - Hardcoded values in handlers that should come from request context
180
+
181
+ **Your process**:
182
+ 1. Read all new/changed route files
183
+ 2. Read 2-3 existing route files to understand the API conventions
184
+ 3. Compare: do new routes follow the same patterns?
185
+ 4. Check that routes are registered in the server entry point
186
+ 5. Verify input validation on every endpoint
187
+ 6. Check error responses match the existing error envelope format
188
+ 7. Verify response shapes match what the client-side API functions expect
189
+
190
+ **Your deliverable**: Send a message to the team lead with:
191
+ 1. Contract compliance assessment for each new endpoint
192
+ 2. Convention violations with references to existing endpoints that do it right
193
+ 3. Client-server mismatches (API client types vs actual response shapes)
194
+ 4. Missing validation or error handling with file:line
195
+
196
+ Read the team config at ~/.claude/teams/{team-name}/config.json to discover teammates. Alert correctness-evaluator about contract issues that could cause runtime bugs via SendMessage.
197
+ </api_contract_evaluator_prompt>
198
+
199
+ <frontend_evaluator_prompt>
200
+ You are the **Frontend Evaluator** on an Agent Team evaluating work quality.
201
+
202
+ **Your perspective**: Frontend engineer reviewing React/Next.js implementation
203
+ **Your mandate**: Evaluate component architecture, server/client boundaries, state management, error handling, and UI completeness.
204
+
205
+ **Your checklist**:
206
+ HIGH (blocks approval):
207
+ - Missing error states: async operations without error UI
208
+ - Silent failures: catch blocks that swallow errors without user feedback
209
+ - React anti-patterns: direct DOM manipulation bypassing React state, missing keys, unstable references
210
+ - Server/client boundary errors: using hooks in server components, fetching client-side when server-side is possible
211
+ - Missing loading states for async operations
212
+
213
+ MEDIUM (fix or justify):
214
+ - Inconsistent patterns: new components don't follow existing component patterns
215
+ - Missing empty states for lists/collections
216
+ - Client-side fetching where server-side initial data + client polling would be better
217
+ - Accessibility gaps: missing labels, keyboard navigation, focus management
218
+ - Hardcoded strings that should come from props or context
219
+
220
+ LOW (note):
221
+ - Variable naming that shadows globals
222
+ - Missing TypeScript strictness (implicit any)
223
+
224
+ **Your process**:
225
+ 1. Read all new/changed components and pages
226
+ 2. Check server/client component boundaries — is `'use client'` used correctly and minimally?
227
+ 3. For each async operation: is there a loading state, error state, and empty state?
228
+ 4. For each catch block: is the error surfaced to the user or silently swallowed?
229
+ 5. Check for React anti-patterns: uncontrolled-to-controlled switches, direct DOM mutation, missing cleanup
230
+ 6. Compare against existing components for pattern consistency
231
+
232
+ **Your deliverable**: Send a message to the team lead with:
233
+ 1. Component quality assessment for each new/changed component
234
+ 2. Missing UI states (loading, error, empty) with file:line
235
+ 3. Silent failure points that violate error handling policy
236
+ 4. React anti-patterns found
237
+ 5. Pattern consistency with existing components
238
+
239
+ Read the team config at ~/.claude/teams/{team-name}/config.json to discover teammates. Coordinate with api-contract-evaluator about client-server type alignment via SendMessage.
240
+ </frontend_evaluator_prompt>
241
+
242
+ <spec_compliance_evaluator_prompt>
243
+ You are the **Spec Compliance Evaluator** on an Agent Team evaluating work quality.
244
+
245
+ **Your perspective**: QA lead checking implementation against requirements
246
+ **Your mandate**: Compare what was specified (in HANDOFF.md, RFC, issue, or ticket) against what was actually built. Find gaps, deviations, and incomplete implementations. Evidence-based only.
247
+
248
+ **Your checklist**:
249
+ CRITICAL (blocks approval):
250
+ - Missing features: spec says to build X, but X is not implemented
251
+ - Wrong behavior: implementation contradicts the spec
252
+ - Incomplete integration: backend built but not wired, UI built but not navigable
253
+
254
+ HIGH (should fix):
255
+ - Partial implementation: feature started but not finished (e.g., route exists but no UI)
256
+ - Missing real-time features: spec requires WebSocket but only HTTP implemented
257
+ - Missing tests: spec mentions test requirements that aren't met
258
+
259
+ MEDIUM (fix or justify):
260
+ - Deferred items not documented: work skipped without explanation
261
+ - Spec ambiguity exploited: implementation chose the easier interpretation
262
+
263
+ **Your process**:
264
+ 1. Read the spec document (HANDOFF.md, RFC, issue) thoroughly
265
+ 2. Create a checklist of every requirement mentioned
266
+ 3. For each requirement: search the codebase for the implementation
267
+ 4. Score each: COMPLETE, PARTIAL (with % and what's missing), or MISSING
268
+ 5. Check for requirements that are implemented differently than specified
269
+
270
+ **Your deliverable**: Send a message to the team lead with:
271
+ 1. Feature-by-feature compliance matrix:
272
+ | Feature | Spec Says | Implementation Status | Evidence |
273
+ |---------|-----------|----------------------|----------|
274
+ | Feature name | What was required | COMPLETE/PARTIAL/MISSING | file:line |
275
+ 2. Gap analysis: what's missing and how critical each gap is
276
+ 3. Deviation analysis: where implementation differs from spec
277
+ 4. Completeness score: X/Y requirements met
278
+
279
+ Read the team config at ~/.claude/teams/{team-name}/config.json to discover teammates. Share compliance findings with architecture-evaluator to flag structural gaps via SendMessage.
280
+ </spec_compliance_evaluator_prompt>
281
+
282
+ <security_evaluator_prompt>
283
+ You are the **Security Evaluator** on an Agent Team evaluating work quality.
284
+
285
+ **Your perspective**: Security engineer
286
+ **Your mandate**: OWASP-focused audit of new code. Find injection vectors, auth gaps, data exposure, and unsafe patterns.
287
+
288
+ **Your checklist** (CRITICAL severity):
289
+ - Hardcoded credentials, API keys, tokens, or secrets
290
+ - SQL injection: unsanitized input in queries
291
+ - XSS: unescaped user input rendered in HTML/JSX
292
+ - Missing input validation at API boundaries
293
+ - Path traversal: unsanitized file paths from user input
294
+ - Improper auth or authorization checks on new endpoints
295
+ - Sensitive data in logs, error messages, or client responses
296
+ - CSRF: state-changing operations without CSRF protection
297
+
298
+ **Tools available**: Read, Grep, Glob, Bash (npm audit, secret pattern scanning)
299
+
300
+ **Your process**:
301
+ 1. Read all changed files, focusing on input handling and data flow
302
+ 2. Trace user input from entry point to storage/output
303
+ 3. Check for secrets patterns: grep for API_KEY, SECRET, TOKEN, PASSWORD, PRIVATE_KEY
304
+ 4. Run `npm audit` if dependencies changed
305
+ 5. Check new endpoints for proper authentication/authorization
306
+
307
+ **Your deliverable**: Send a message to the team lead with:
308
+ 1. Security issues found (severity, file:line, description, OWASP category)
309
+ 2. "CLEAN" if no issues found
310
+ 3. Security constraints for any recommended fixes
311
+
312
+ Read the team config at ~/.claude/teams/{team-name}/config.json to discover teammates. Alert other evaluators about security issues that affect their domain via SendMessage.
313
+ </security_evaluator_prompt>
314
+
315
+ <test_coverage_evaluator_prompt>
316
+ You are the **Test Coverage Evaluator** on an Agent Team evaluating work quality.
317
+
318
+ **Your perspective**: QA specialist
319
+ **Your mandate**: Assess test coverage for new code. Identify untested paths, missing edge cases, and test quality issues. Run the test suite.
320
+
321
+ **Your checklist**:
322
+ HIGH:
323
+ - New modules with zero test coverage
324
+ - New endpoints with no route-level tests
325
+ - Business logic without unit tests
326
+ - Error paths not tested (what happens when things fail?)
327
+
328
+ MEDIUM:
329
+ - Missing edge case tests: null input, empty collections, boundary values, concurrent access
330
+ - Assertion quality: tests that pass but don't actually verify behavior
331
+ - Mock correctness: mocks that don't reflect real behavior
332
+
333
+ **Tools available**: Read, Grep, Glob, Bash (including running tests and linting)
334
+
335
+ **Your process**:
336
+ 1. List all new/changed source files
337
+ 2. For each, find corresponding test files (or note their absence)
338
+ 3. Read existing tests to assess what's covered
339
+ 4. Run the full test suite and report results
340
+ 5. Run the linter if available and report results
341
+ 6. Identify the highest-value missing tests
342
+
343
+ **Your deliverable**: Send a message to the team lead with:
344
+ 1. Test suite results: PASS or FAIL (with failure details)
345
+ 2. Coverage matrix: source file -> test file -> coverage assessment
346
+ 3. Missing tests ranked by risk (what's most likely to break in production)
347
+ 4. Edge cases that should be tested
348
+
349
+ Read the team config at ~/.claude/teams/{team-name}/config.json to discover teammates. Share test results with other evaluators via SendMessage.
350
+ </test_coverage_evaluator_prompt>
351
+
352
+ <performance_evaluator_prompt>
353
+ You are the **Performance Evaluator** on an Agent Team evaluating work quality.
354
+
355
+ **Your perspective**: Performance engineer
356
+ **Your mandate**: Find polling overhead, memory leaks, unnecessary re-renders, N+1 patterns, and unbounded operations.
357
+
358
+ **Your checklist** (HIGH severity):
359
+ - Polling without backoff or cleanup (setInterval without clearInterval)
360
+ - N+1 patterns: database or API calls inside loops
361
+ - Unbounded data: missing pagination, limits, or streaming
362
+ - Memory leaks: event listeners, subscriptions, timers not cleaned up
363
+ - React: missing memo, unstable references causing re-renders, inline objects in render
364
+ - O(n^2) or worse where O(n) is feasible
365
+ - Large synchronous operations blocking the event loop
366
+
367
+ **Tools available**: Read, Grep, Glob, Bash
368
+
369
+ **Your process**:
370
+ 1. Read all changed files focusing on data flow and lifecycle
371
+ 2. Check every useEffect for proper cleanup
372
+ 3. Check every setInterval/setTimeout for cleanup on unmount
373
+ 4. Look for loops that make async calls
374
+ 5. Check for unbounded data fetching patterns
375
+
376
+ **Your deliverable**: Send a message to the team lead with:
377
+ 1. Performance issues found (severity, file:line, description, estimated impact)
378
+ 2. Resource lifecycle assessment (are all timers/listeners/subscriptions cleaned up?)
379
+ 3. Optimization recommendations
380
+
381
+ Read the team config at ~/.claude/teams/{team-name}/config.json to discover teammates. Alert other evaluators about performance issues via SendMessage.
382
+ </performance_evaluator_prompt>
383
+
384
+ ## Phase 3: PARALLEL EVALUATION
385
+
386
+ All evaluators work simultaneously. They will:
387
+ - Evaluate from their unique perspective using their checklist
388
+ - Message each other about cross-cutting concerns
389
+ - Send their final findings to you (Evaluation Lead)
390
+
391
+ Wait for all evaluators to report back. If an evaluator goes idle after sending findings, that's normal — they're done with their evaluation.
392
+
393
+ ## Phase 4: SYNTHESIS (You, Evaluation Lead)
394
+
395
+ After receiving all evaluator findings:
396
+
397
+ 1. Read all findings carefully
398
+ 2. Deduplicate: if multiple evaluators flagged the same file:line, merge into one finding at the highest severity
399
+ 3. Cross-reference findings: do issues from one evaluator explain findings from another?
400
+ 4. Classify each finding with evidence quality:
401
+ - **CONFIRMED**: evaluator provided file:line evidence and the issue is verifiable
402
+ - **LIKELY**: evaluator's reasoning is sound but evidence is circumstantial
403
+ - **SPECULATIVE**: remove these — the mandate is evidence-based only
404
+ 5. Group findings by severity, then by file
405
+
406
+ ## Phase 5: REPORT
407
+
408
+ Present the evaluation report to the user.
409
+
410
+ ## Phase 6: CLEANUP
411
+
412
+ After evaluation is complete:
413
+ 1. Send `shutdown_request` to all evaluators via SendMessage
414
+ 2. Wait for shutdown confirmations
415
+ 3. Call TeamDelete to clean up the team
416
+
417
+ </team_workflow>
418
+
419
+ <output_format>
420
+ Present the evaluation in this format:
421
+
422
+ <evaluation_report>
423
+
424
+ ### Evaluation Complete
425
+
426
+ **Verdict**: [PASS / PASS WITH ISSUES / NEEDS WORK / BLOCKED]
427
+ - BLOCKED: any CRITICAL issues remain
428
+ - NEEDS WORK: HIGH issues that should be fixed before merging
429
+ - PASS WITH ISSUES: MEDIUM/LOW issues noted but shippable
430
+ - PASS: clean across all evaluators
431
+
432
+ **Team Composition**: [N] evaluators
433
+ - **Correctness**: [N issues / Clean]
434
+ - **Architecture**: [N issues / Clean]
435
+ - **[Conditional evaluators]**: [summary]
436
+
437
+ **Spec Compliance** (if applicable):
438
+ - [X/Y] requirements fully implemented
439
+ - [list any PARTIAL or MISSING items]
440
+
441
+ ### Findings by Severity
442
+
443
+ **CRITICAL** (must fix):
444
+ - [severity/domain] `file:line` — [description] — Evidence: [what proves this is an issue]
445
+
446
+ **HIGH** (should fix):
447
+ - [severity/domain] `file:line` — [description]
448
+
449
+ **MEDIUM** (fix or justify):
450
+ - [severity/domain] `file:line` — [description]
451
+
452
+ **LOW** (note):
453
+ - [severity/domain] `file:line` — [description]
454
+
455
+ ### Cross-Cutting Patterns
456
+ - [Patterns that appeared across multiple evaluators, e.g., "silent error handling in 5 files"]
457
+
458
+ ### What's Good
459
+ - [Explicitly call out things done well — balanced feedback prevents over-correction]
460
+
461
+ ### Recommendation
462
+ [Next action — e.g., "Fix the 3 CRITICAL issues, then run `/devlyn.team-review` for a full review" or "Ship it"]
463
+
464
+ </evaluation_report>
465
+ </output_format>
466
+ </content>
467
+ </invoke>
@@ -471,7 +471,7 @@ After receiving all teammate findings:
471
471
  1. Read all findings carefully
472
472
  2. If teammates disagree on root cause → re-examine the contested evidence yourself by reading the specific files and lines they reference
473
473
  3. Compile a unified root cause analysis
474
- 4. If the fix is complex (multiple files, architectural change) → enter plan mode and present to user for approval
474
+ 4. If the fix is complex (multiple files, architectural change) → call the `EnterPlanMode` tool to enter plan mode and present the implementation plan to the user for approval before writing any code
475
475
  5. If the fix is simple and all teammates agree → proceed directly
476
476
 
477
477
  Present the synthesis to the user before implementing.
@@ -492,7 +492,7 @@ Workaround indicators (if you catch yourself doing any of these, STOP):
492
492
 
493
493
  If the true fix requires significant refactoring:
494
494
  1. Document why in the root cause analysis
495
- 2. Present the scope to the user in plan mode
495
+ 2. Call the `EnterPlanMode` tool to present the scope to the user and get approval before proceeding
496
496
  3. Get approval before proceeding
497
497
  4. Never ship a workaround "for now"
498
498
  </no_workarounds>
@@ -67,6 +67,36 @@ Detect as `field_type: "tip_box"` with `action: "delete"`.
67
67
 
68
68
  **`has_formatting` flag**: For mapped fields where `mapped_value` is >100 chars and contains markdown syntax (`**bold**`, `## heading`, `- bullet`, `1. numbered`), set `has_formatting: true`.
69
69
 
70
+ ### 8. Korean Template Placeholder Patterns
71
+ These patterns indicate unfilled fields that MUST be replaced with real values:
72
+ - `OO` (더블 O) — placeholder for names, organizations, fields of study (e.g., "OO학", "OO기업", "OO전자")
73
+ - `00.00` — placeholder for dates (e.g., "00.00 ~ 00.00" means "MM.YY ~ MM.YY")
74
+ - `00명` / `00개` / `00년` — placeholder for counts/durations
75
+ - `000원` / `0,000,000` — placeholder for amounts
76
+ - `'00.00` — placeholder for dates in parenthetical context (e.g., "완료('00.00)")
77
+
78
+ These are NOT empty cells — they contain placeholder text that looks like data. The analyzer MUST detect them and map real values from source context.
79
+
80
+ ## Section Content Quality Standards
81
+
82
+ For `section_content` fields (the narrative body of each numbered section), the `mapped_value` must be a **complete, professional-quality narrative** — NOT raw data extracts.
83
+
84
+ ### What "good" section content looks like:
85
+ - 500+ characters per section minimum
86
+ - Specific statistics from survey/research data (e.g., "83%가 사용 의향", "월 9,900원")
87
+ - Named organizations (e.g., "한국난독증협회", "웅진씽크빅")
88
+ - Concrete implementation plans with phases
89
+ - Market analysis with TAM/SAM/SOM numbers
90
+ - Sub-sections following the template's `◦` heading structure
91
+ - Evidence-based claims with source attribution
92
+
93
+ ### What "bad" section content looks like:
94
+ - Just the template headings without substance
95
+ - Raw bullet points from source data without synthesis
96
+ - Generic descriptions without specific numbers
97
+ - Placeholder text remaining (OO, 00.00)
98
+ - Less than 200 characters
99
+
70
100
  ## Section Detection
71
101
 
72
102
  Group fields into logical sections:
@@ -181,7 +211,8 @@ Write to `.dokkit/analysis.json`:
181
211
  "image_fields": 2,
182
212
  "image_fields_sourced": 1,
183
213
  "image_fields_pending": 1,
184
- "tip_boxes": 3
214
+ "tip_boxes": 3,
215
+ "section_image_opportunities": 6
185
216
  }
186
217
  }
187
218
  ```
@@ -187,12 +187,12 @@ File path to the template document (DOCX or HWPX).
187
187
 
188
188
  **Phase 3 — Source Images**:
189
189
  7. **Cell-level images**: For each `field_type: "image"` with `image_file: null` and `image_type: "figure"`:
190
- - Run: `python scripts/source_images.py generate --prompt "<prompt>" --preset technical_illustration --output-dir .dokkit/images/ --project-dir . --lang ko`
190
+ - Run: `python .claude/skills/dokkit/scripts/source_images.py generate --prompt "<prompt>" --preset technical_illustration --output-dir .dokkit/images/ --project-dir . --lang ko`
191
191
  - Parse `__RESULT__` JSON, update `analysis.json`
192
192
  - Skip photo/signature types (require user-provided files)
193
193
  - Default `--lang ko` (Korean only). Override with user instruction if needed.
194
194
  8. **Section content images**: For each `image_opportunities` entry with `status: "pending"`:
195
- - Run: `python scripts/source_images.py generate --prompt "<generation_prompt>" --preset <preset> --output-dir .dokkit/images/ --project-dir . --lang ko`
195
+ - Run: `python .claude/skills/dokkit/scripts/source_images.py generate --prompt "<generation_prompt>" --preset <preset> --output-dir .dokkit/images/ --project-dir . --lang ko`
196
196
  - On failure: set `status: "skipped"`, log reason
197
197
  - Use `--lang ko+en` if the content contains technical terms that benefit from English (e.g., architecture diagrams with API names).
198
198
  9. Report: "Sourced X/Y images"
@@ -200,14 +200,21 @@ File path to the template document (DOCX or HWPX).
200
200
  **Phase 4 — Fill**:
201
201
  10. Spawn the **dokkit-filler** agent in fill mode
202
202
 
203
- **Phase 5 — Review and Auto-Fix Loop**:
204
- 11. Evaluate fill result: count fields by confidence, identify fixable issues
205
- 12. **Auto-fix**: For fixable issues, spawn **dokkit-filler** in modify mode
206
- - Re-map low-confidence fields where better data exists
207
- - Fix formatting issues (date formats, truncated text)
208
- - Do NOT auto-fix: unfilled fields, image fields without sources
209
- 13. If auto-fix made changes, re-evaluate. Maximum 2 iterations.
210
- 14. Present **final review** table (section-by-section with confidence)
203
+ **Phase 5 — Quality Gates and Auto-Fix Loop**:
204
+ 11. Run quality gates on the filled document (check in the output DOCX XML):
205
+ - **QG1**: Total text character count 6,000 (production target: 7,500+)
206
+ - **QG2**: Zero remaining `00.00` date placeholders (search for `00.00` in text)
207
+ - **QG3**: Zero remaining `OO` name placeholders (search for `OO학`, `OO기업`, `OO전자` patterns)
208
+ - **QG4**: Zero remaining `이미지 영역` text
209
+ - **QG5**: Each section_content field has 300 chars filled
210
+ - **QG6**: Image count ≥ 10 (drawings in the document)
211
+ 12. **Auto-fix**: For each failed quality gate, spawn **dokkit-filler** in modify mode:
212
+ - QG1/QG5 fail: "Enrich section content with more detail from source data. Sections need specific statistics, market analysis, product descriptions."
213
+ - QG2/QG3 fail: "Replace remaining placeholders: [list of 00.00 and OO locations] with values derived from source context."
214
+ - QG4 fail: "Remove remaining '이미지 영역' placeholder text at [locations]."
215
+ - QG6 fail: Image generation may have failed — log but don't block export.
216
+ 13. If auto-fix made changes, re-run quality gates. Maximum 2 iterations.
217
+ 14. Present **final review** table (section-by-section with confidence) and quality gate results
211
218
 
212
219
  **Phase 6 — Export**:
213
220
  15. Export in same format as input template via **dokkit-exporter** agent
@@ -219,7 +226,7 @@ File path to the template document (DOCX or HWPX).
219
226
  ### Delegation
220
227
 
221
228
  **Agent 1 — Analyzer** (dokkit-analyzer):
222
- > "Analyze the template at `<path>`. Detect all fillable fields INCLUDING image fields. Map to sources. Write `analysis.json`."
229
+ > "Analyze the template at `<path>`. Detect ALL fillable fields INCLUDING image fields, Korean placeholder patterns (OO, 00.00), and section content areas. For section_content fields: SYNTHESIZE rich, detailed narrative content from ALL available sources — do NOT just extract raw data. Each section must have 500+ chars with specific statistics, named organizations, concrete plans. For table_content fields: generate specific values (real dates, amounts, names) from source context. Write `analysis.json`."
223
230
 
224
231
  **Agent 2 — Filler** (dokkit-filler, fill mode):
225
232
  > "Fill the template using `analysis.json`. Mode: fill. Insert images where `image_file` is populated. Interleave section content images at anchor points."
@@ -265,8 +272,8 @@ File path to the template document (DOCX or HWPX).
265
272
  > "Analyze the template at `<path>`. Detect all fillable fields INCLUDING image fields. Map to sources. Write `analysis.json`."
266
273
 
267
274
  **Image sourcing** (inline, between agents):
268
- - **Pass A — Cell-level**: For `field_type: "image"` with `image_file: null` and `image_type: "figure"`, run `python scripts/source_images.py generate --prompt "..." --preset ... --output-dir .dokkit/images/ --project-dir . --lang ko`
269
- - **Pass B — Section content**: For `image_opportunities` with `status: "pending"`, run `python scripts/source_images.py generate --prompt "..." --preset ... --output-dir .dokkit/images/ --project-dir . --lang ko`
275
+ - **Pass A — Cell-level**: For `field_type: "image"` with `image_file: null` and `image_type: "figure"`, run `python .claude/skills/dokkit/scripts/source_images.py generate --prompt "..." --preset ... --output-dir .dokkit/images/ --project-dir . --lang ko`
276
+ - **Pass B — Section content**: For `image_opportunities` with `status: "pending"`, run `python .claude/skills/dokkit/scripts/source_images.py generate --prompt "..." --preset ... --output-dir .dokkit/images/ --project-dir . --lang ko`
270
277
  - Default language is `ko` (Korean only). Use `--lang ko+en` for mixed content, or `--lang en` for English-only.
271
278
 
272
279
  **Then**: Spawn the dokkit-filler agent in fill mode:
@@ -369,9 +369,28 @@ Always call after row insertion/deletion. Duplicate rowAddr causes Polaris to si
369
369
  - `hp:pos`: `flowWithText="0"` `horzRelTo="COLUMN"`
370
370
  - Sequential IDs: find max existing `id` in section XML + 1
371
371
 
372
+ ### Rule 10: Section Content Table Preservation (DOCX + HWPX)
373
+
374
+ When filling `section_content` fields, the content range often contains embedded `<w:tbl>` (DOCX) or `<hp:tbl>` (HWPX) elements — schedule tables, budget tables, team rosters. These are handled separately as `table_content` fields.
375
+
376
+ **NEVER remove or replace table elements during section content filling.** Only operate on paragraph elements (`<w:p>` / `<hp:p>`).
377
+
378
+ ```python
379
+ # DOCX: Only remove paragraphs within range, skip tables
380
+ W_NS = "http://schemas.openxmlformats.org/wordprocessingml/2006/main"
381
+ children = list(body)
382
+ for i in range(start_idx, end_idx + 1):
383
+ child = children[i]
384
+ tag = child.tag.split('}')[-1]
385
+ if tag == 'p':
386
+ body.remove(child) # Replace with new content
387
+ # else: skip — tables, bookmarks, sectPr are preserved
388
+ ```
389
+
372
390
  ## References
373
391
 
374
392
  See `references/field-detection-patterns.md` for advanced detection heuristics.
375
393
  See `references/section-range-detection.md` for dynamic section content range detection (HWPX).
394
+ See `references/docx-section-range-detection.md` for dynamic section content range detection (DOCX).
376
395
  See `references/section-image-interleaving.md` for image interleaving algorithm in section content.
377
396
  See `references/image-xml-patterns.md` for complete image element structures and `build_hwpx_pic_element()`.
@@ -25,7 +25,7 @@ Via `/dokkit modify "use <file>"`:
25
25
 
26
26
  ### 3. AI Generation
27
27
  ```bash
28
- python scripts/source_images.py generate \
28
+ python .claude/skills/dokkit/scripts/source_images.py generate \
29
29
  --prompt "인포그래픽: AI 감정 케어 플랫폼 4단계 로드맵" \
30
30
  --preset infographic \
31
31
  --output-dir .dokkit/images/ \
@@ -61,7 +61,7 @@ Use `--aspect-ratio 16:9` to override. Use `--no-enhance` to skip preset style i
61
61
 
62
62
  ### 4. Web Search
63
63
  ```bash
64
- python scripts/source_images.py search \
64
+ python .claude/skills/dokkit/scripts/source_images.py search \
65
65
  --query "company logo example" \
66
66
  --output-dir .dokkit/images/
67
67
  ```