codex-workflows 0.4.1 → 0.4.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -49,12 +49,14 @@ Sources: OWASP Top 10:2025, DryRun Agentic Coding Security Report (2026-03)
49
49
  - Endpoints or route handlers defined without authentication middleware
50
50
  - Resource access operations (read, update, delete) without authorization verification
51
51
  - Administrative or destructive operations accessible without elevated permissions
52
- - Recent research indicates this pattern appears at elevated rates in AI-generated code treat as high-priority review target
52
+ - AI-generated code frequently omits authentication middleware and authorization checks; treat every route handler and resource access operation as an explicit verification target during review
53
+ - Detection approach: search for route or endpoint handlers without authentication middleware, and resource operations (read, update, delete) without authorization checks in the call chain
53
54
 
54
55
  ### Mishandling of Exceptional Conditions (OWASP A10:2025)
55
56
  - Error handlers that expose internal system details (stack traces, database errors, file paths) in responses
56
- - Error handlers that fail open (grant access or skip validation on error)
57
+ - Error handlers that grant access, skip authentication, or bypass authorization when an exception occurs
57
58
  - Missing error handling on security-critical operations (authentication, authorization, cryptographic operations)
59
+ - Detection approach: search for catch or error-handler blocks that return stack traces, database errors, or file paths in responses, and for handlers that continue with success-path behavior without re-validating security state
58
60
 
59
61
  ### Software Supply Chain Patterns (OWASP A03:2025)
60
62
  - Dependencies imported without version pinning
@@ -7,13 +7,13 @@
7
7
 
8
8
  ## Comment Writing Rules
9
9
  - **Function Description Focus**: Describe what the code "does"
10
- - **No Historical Information**: Do not record development history
10
+ - **History in Version Control**: Record development history in commits and PRs instead of code comments
11
11
  - **Timeless**: Write only content that remains valid whenever read
12
12
  - **Conciseness**: Keep explanations to necessary minimum
13
13
 
14
14
  ## Type Safety
15
15
 
16
- **Absolute Rule**: any type is completely prohibited. It disables type checking and becomes a source of runtime errors.
16
+ **Absolute Rule**: Use `unknown`, generics, unions, intersections, or validated assertions instead of `any`. `any` disables type checking and becomes a source of runtime errors.
17
17
 
18
18
  **any Type Alternatives (Priority Order)**
19
19
  1. **unknown Type + Type Guards**: Use for validating external input (API responses, localStorage, URL parameters)
@@ -91,7 +91,7 @@ setUsers(users)
91
91
 
92
92
  **Props Design (Props-driven Approach)**
93
93
  - Props are the interface: Define all necessary information as props
94
- - Avoid implicit dependencies: Do not depend on global state or context without necessity
94
+ - Declare dependencies explicitly through props, hooks, or injected modules instead of relying on ambient global state
95
95
  - Type-safe: Always define Props type explicitly
96
96
 
97
97
  **Environment Variables**
@@ -146,7 +146,7 @@ const response = await fetch('/api/data') // Backend handles API key authenticat
146
146
 
147
147
  ## Error Handling
148
148
 
149
- **Absolute Rule**: Error suppression prohibited. All errors must have log output and appropriate handling.
149
+ **Absolute Rule**: Handle every error explicitly with log output, recovery logic, or escalation appropriate to the failure mode.
150
150
 
151
151
  **Fail-Fast Principle**: Fail quickly on errors to prevent continued processing in invalid states
152
152
  ```typescript
@@ -53,23 +53,19 @@ description: "Documentation creation criteria for PRD, ADR, Design Doc, UI Spec,
53
53
 
54
54
  ### PRD (Product Requirements Document)
55
55
  **Purpose**: Define business requirements and user value
56
- **Includes**: Business requirements, success metrics, user stories, MoSCoW prioritization, MVP/Future phase separation, user journey diagram, scope boundary diagram
57
- **Excludes**: Technical implementation details, technical selection rationale, implementation phases, task breakdown
56
+ **Scope**: Business requirements, user value, success metrics, user stories, MoSCoW prioritization, MVP/Future phase separation, user journey diagram, and scope boundary diagram only. Technical implementation details belong in Design Doc, technical decision rationale in ADR, and implementation phases or task breakdown belong in Work Plan.
58
57
 
59
58
  ### ADR (Architecture Decision Record)
60
59
  **Purpose**: Record technical decision rationale and background
61
- **Includes**: Decision, rationale, option comparison (minimum 3 options), architecture impact, principled implementation guidelines
62
- **Excludes**: Implementation schedule, detailed procedures, specific code examples, resource assignments
60
+ **Scope**: Decision, rationale, option comparison (minimum 3 options), architecture impact, and principled implementation guidance only. Implementation procedures and code examples belong in Design Doc, while schedule and resource assignments belong in Work Plan.
63
61
 
64
62
  ### UI Specification
65
63
  **Purpose**: Define UI structure, screen transitions, component decomposition, and interaction design
66
- **Includes**: Screen list and transitions, component state x display matrix, interaction definitions, AC traceability, existing component reuse map, accessibility requirements
67
- **Excludes**: Technical implementation details, API contracts, test implementation (generated by acceptance-test-generator), implementation schedule
64
+ **Scope**: Screen list and transitions, component state x display matrix, component decomposition, interaction definitions, AC traceability, existing component reuse map, visual acceptance criteria, and accessibility requirements only. Technical implementation and API contracts belong in Design Doc, test implementation belongs in generated test skeletons, and schedule belongs in Work Plan.
68
65
 
69
66
  ### Design Document
70
67
  **Purpose**: Define technical implementation methods in detail
71
- **Includes**: Existing codebase analysis, technical approach, dependencies and constraints, interface/contract definitions, data flow, acceptance criteria, change impact map, code inspection evidence
72
- **Excludes**: Why that technology was chosen (reference ADR), when/who to implement (reference Work Plan), detailed test strategy and test case selection (generated by acceptance-test-generator from acceptance criteria)
68
+ **Scope**: Existing codebase analysis, technical approach, dependencies and constraints, interface and contract definitions, data flow, acceptance criteria, change impact map, code inspection evidence, and verification strategy only. Technology selection rationale belongs in ADR, schedule and assignments belong in Work Plan, and detailed test strategy or case selection belongs in generated test skeletons.
73
69
 
74
70
  **Required Structural Elements**:
75
71
  - Existing codebase analysis and code inspection evidence
@@ -85,8 +81,7 @@ description: "Documentation creation criteria for PRD, ADR, Design Doc, UI Spec,
85
81
 
86
82
  ### Work Plan
87
83
  **Purpose**: Implementation task management and progress tracking
88
- **Includes**: Task breakdown, schedule estimates, test skeleton file paths, Verification Strategy summaries from each Design Doc, final Quality Assurance phase (required), progress records
89
- **Excludes**: Technical rationale, design details
84
+ **Scope**: Task breakdown, dependencies, schedule estimates, test skeleton file paths, Verification Strategy summaries from each Design Doc, final Quality Assurance phase, and progress tracking only. Technical rationale belongs in ADR and design details belong in Design Doc.
90
85
 
91
86
  **Phase Division Criteria**:
92
87
 
@@ -103,7 +98,7 @@ description: "Documentation creation criteria for PRD, ADR, Design Doc, UI Spec,
103
98
 
104
99
  **When Hybrid is selected**:
105
100
  - Combine vertical and horizontal phase structures as defined in the Design Doc
106
- - Final phase is always Quality Assurance
101
+ - Final phase is always Quality Assurance with acceptance criteria verification, all tests passing, and quality checks complete
107
102
 
108
103
  ## Creation Process [MANDATORY]
109
104
 
@@ -2,7 +2,7 @@
2
2
 
3
3
  ## Status
4
4
 
5
- [Proposed | Accepted | Deprecated | Superseded]
5
+ [Proposed | Accepted | Deprecated | Superseded | Rejected]
6
6
 
7
7
  ## Context
8
8
 
@@ -54,6 +54,10 @@
54
54
 
55
55
  - [List changes that are neither good nor bad]
56
56
 
57
+ ## Architecture Impact
58
+
59
+ [Describe how this decision affects existing architecture: components changed, dependencies introduced or removed, and new architectural constraints]
60
+
57
61
  ## Implementation Guidance
58
62
 
59
63
  [Principled direction only. Implementation procedures go to Design Doc]
@@ -277,11 +277,16 @@ System Invariants:
277
277
 
278
278
  ### Error Handling
279
279
 
280
- [Types of errors and how to handle them]
280
+ | Error Category | Example | Detection | Recovery Strategy | User Impact |
281
+ |---------------|---------|-----------|-------------------|-------------|
282
+ | [Validation / External / Infrastructure / Business logic] | [Specific error] | [How detected] | [Retry / Fallback / Propagate / Log-and-continue] | [User-facing message or silent handling] |
281
283
 
282
284
  ### Logging and Monitoring
283
285
 
284
- [What to record in logs and how to monitor]
286
+ - **Log events**: [Key events to log: state transitions, external calls, error occurrences, performance thresholds]
287
+ - **Log levels**: [Which events use DEBUG / INFO / WARN / ERROR]
288
+ - **Sensitive data**: [Fields to mask or exclude; align with Security Considerations]
289
+ - **Monitoring**: [Metrics to track, alert thresholds, dashboard requirements]
285
290
 
286
291
  ## Implementation Plan
287
292
 
@@ -301,12 +306,6 @@ System Invariants:
301
306
  - Technical Reason: [Technical necessity to implement after A]
302
307
  - Prerequisites: [Required pre-implementations]
303
308
 
304
- ### Integration Points
305
-
306
- **Integration Point 1: [Name]**
307
- - Components: [Component A] to [Component B]
308
- - Contract: [Interface/API contract between components]
309
-
310
309
  ### Migration Strategy
311
310
 
312
311
  [Technical migration approach, ensuring backward compatibility]
@@ -323,7 +322,9 @@ Mark items as N/A with brief rationale when the feature has no relevant trust bo
323
322
 
324
323
  ## Future Extensibility
325
324
 
326
- [Considerations for future feature additions or changes]
325
+ - **Extension points**: [Interfaces, hooks, or plugin mechanisms designed for future use]
326
+ - **Known future requirements**: [Planned features that influenced current design decisions]
327
+ - **Intentional limitations**: [What was deliberately kept simple and why]
327
328
 
328
329
  ## Alternative Solutions
329
330
 
@@ -98,14 +98,24 @@ ENFORCEMENT: Sub-agent prompts missing the constraint suffix MUST be re-issued w
98
98
 
99
99
  VERIFY approval status before proceeding. Once confirmed, INITIATE autonomous execution mode.
100
100
 
101
- ## Security Review (After All Tasks Complete)
102
-
103
- After all task cycles finish, collect all `filesModified` from every task-executor response (deduplicated), then invoke security-reviewer before the completion report:
104
- 1. Spawn security-reviewer agent: "Design Doc: [path]. Implementation files: [collected filesModified list]. Review security compliance."
105
- 2. Check response:
106
- - `approved` or `approved_with_notes` -> Proceed to completion report (include notes if present)
107
- - `needs_revision` -> Spawn task-executor with `requiredFixes`, then quality-fixer, then re-invoke security-reviewer
108
- - `blocked` -> Escalate to user
101
+ ## Post-Implementation Verification (After All Tasks Complete)
102
+
103
+ After all task cycles finish, collect all `filesModified` from every task-executor response (deduplicated), then run both verification agents before the completion report:
104
+ 1. Spawn code-verifier agent: "Verify implementation consistency against the Design Doc. `doc_type: design-doc`. `document_path`: [path]. `code_paths`: [collected filesModified list]."
105
+ 2. Spawn security-reviewer agent: "Design Doc: [path]. Implementation files: [collected filesModified list]. Review security compliance."
106
+ 3. Consolidate results:
107
+ - code-verifier passes when `summary.status` is `consistent` or `mostly_consistent`
108
+ - code-verifier fails when `summary.status` is `needs_review` or `inconsistent`
109
+ - security-reviewer passes when `status` is `approved` or `approved_with_notes`
110
+ - security-reviewer fails when `status` is `needs_revision`
111
+ - security-reviewer `blocked` -> Escalate to user
112
+ 4. If either verifier fails:
113
+ - Create a single fix task covering verifier discrepancies and security requiredFixes
114
+ - Spawn task-executor with that consolidated task
115
+ - Spawn quality-fixer
116
+ - Re-run only the verifier(s) that failed
117
+ - Maximum retry count is 1 verification fix cycle; if any failed verifier still fails after re-run, escalate to the user
118
+ 5. If both verifiers pass -> Proceed to completion report
109
119
 
110
120
  **[STOP — BLOCKING]** Upon detecting ANY requirement changes, halt execution immediately.
111
121
  **CANNOT proceed until user explicitly confirms the change scope.**
@@ -106,14 +106,24 @@ ENFORCEMENT: Sub-agent prompts missing the constraint suffix MUST be re-issued w
106
106
 
107
107
  VERIFY approval status before proceeding. Once confirmed, INITIATE autonomous execution mode.
108
108
 
109
- ## Security Review (After All Tasks Complete)
110
-
111
- After all task cycles finish, collect all `filesModified` from every task-executor-frontend response (deduplicated), then invoke security-reviewer before the completion report:
112
- 1. Spawn security-reviewer agent: "Design Doc: [path]. Implementation files: [collected filesModified list]. Review security compliance."
113
- 2. Check response:
114
- - `approved` or `approved_with_notes` -> Proceed to completion report (include notes if present)
115
- - `needs_revision` -> Spawn task-executor-frontend with `requiredFixes`, then quality-fixer-frontend, then re-invoke security-reviewer
116
- - `blocked` -> Escalate to user
109
+ ## Post-Implementation Verification (After All Tasks Complete)
110
+
111
+ After all task cycles finish, collect all `filesModified` from every task-executor-frontend response (deduplicated), then run both verification agents before the completion report:
112
+ 1. Spawn code-verifier agent: "Verify implementation consistency against the Design Doc. `doc_type: design-doc`. `document_path`: [path]. `code_paths`: [collected filesModified list]."
113
+ 2. Spawn security-reviewer agent: "Design Doc: [path]. Implementation files: [collected filesModified list]. Review security compliance."
114
+ 3. Consolidate results:
115
+ - code-verifier passes when `summary.status` is `consistent` or `mostly_consistent`
116
+ - code-verifier fails when `summary.status` is `needs_review` or `inconsistent`
117
+ - security-reviewer passes when `status` is `approved` or `approved_with_notes`
118
+ - security-reviewer fails when `status` is `needs_revision`
119
+ - security-reviewer `blocked` -> Escalate to user
120
+ 4. If either verifier fails:
121
+ - Create a single fix task covering verifier discrepancies and security requiredFixes
122
+ - Spawn task-executor-frontend with that consolidated task
123
+ - Spawn quality-fixer-frontend
124
+ - Re-run only the verifier(s) that failed
125
+ - Maximum retry count is 1 verification fix cycle; if any failed verifier still fails after re-run, escalate to the user
126
+ 5. If both verifiers pass -> Proceed to completion report
117
127
 
118
128
  **[STOP -- BLOCKING]** Upon detecting ANY requirement changes, halt execution immediately.
119
129
  **CANNOT proceed until user explicitly confirms the change scope.**
@@ -33,7 +33,7 @@ Identify the Design Doc in docs/design/ and check implementation files changed f
33
33
  **CANNOT proceed without both a Design Doc and implementation files.**
34
34
 
35
35
  ### 2. Execute code-reviewer
36
- Spawn code-reviewer agent: "Validate Design Doc compliance for [design-doc-path]. Implementation files: [git diff file list]. Review mode: full. Return structured JSON report with complianceRate, verdict, acceptanceCriteria, and qualityIssues."
36
+ Spawn code-reviewer agent: "Validate Design Doc compliance for [design-doc-path]. Implementation files: [git diff file list]. Review mode: full. Return structured JSON report per your Output Format specification."
37
37
 
38
38
  **Store output as**: `$STEP_2_OUTPUT`
39
39
 
@@ -59,10 +59,16 @@ Spawn security-reviewer agent: "Design Doc: [path]. Implementation files: [file
59
59
  ```
60
60
  Code Compliance: [complianceRate from code-reviewer]
61
61
  Verdict: [verdict from code-reviewer]
62
+ Identifier Match Rate: [identifierMatchRate from code-reviewer]
62
63
  Acceptance Criteria:
63
- - [fulfilled] [item]
64
+ - [fulfilled] [item] (confidence: [high/medium/low])
64
65
  - [partially_fulfilled] [item]: [gap] — [suggestion]
65
66
  - [unfulfilled] [item]: [gap] — [suggestion]
67
+ Identifier Mismatches (show only mismatches; write `None` if all identifiers match):
68
+ - None
69
+ - [identifier]: DD=[designDocValue] Code=[codeValue] at [location] (confidence: [high/medium/low])
70
+ Quality Findings:
71
+ - [category] [location]: [description] — [rationale]
66
72
 
67
73
  Security Review: [status from security-reviewer]
68
74
  Findings by category:
@@ -116,14 +116,24 @@ ENFORCEMENT: Sub-agent prompts missing the constraint suffix MUST be re-issued w
116
116
 
117
117
  VERIFY approval status before proceeding. Once confirmed, INITIATE autonomous execution mode.
118
118
 
119
- ## Security Review (After All Tasks Complete)
120
-
121
- After all task cycles finish, collect all `filesModified` from every task-executor/task-executor-frontend response (deduplicated), then invoke security-reviewer before the completion report:
122
- 1. Spawn security-reviewer agent: "Design Doc: [path(s)]. Implementation files: [collected filesModified list]. Review security compliance."
123
- 2. Check response:
124
- - `approved` or `approved_with_notes` -> Proceed to completion report (include notes if present)
125
- - `needs_revision` -> Spawn layer-appropriate task-executor with `requiredFixes`, then quality-fixer, then re-invoke security-reviewer
126
- - `blocked` -> Escalate to user
119
+ ## Post-Implementation Verification (After All Tasks Complete)
120
+
121
+ After all task cycles finish, collect all `filesModified` from every task-executor/task-executor-frontend response (deduplicated), then run both verification agents before the completion report:
122
+ 1. Spawn code-verifier once per Design Doc: "Verify implementation consistency against the Design Doc. `doc_type: design-doc`. `document_path`: [single design doc path]. `code_paths`: [collected filesModified list]."
123
+ 2. Spawn security-reviewer agent: "Design Doc: [path(s)]. Implementation files: [collected filesModified list]. Review security compliance."
124
+ 3. Consolidate results:
125
+ - each code-verifier run passes when `summary.status` is `consistent` or `mostly_consistent`
126
+ - a code-verifier run fails when `summary.status` is `needs_review` or `inconsistent`
127
+ - security-reviewer passes when `status` is `approved` or `approved_with_notes`
128
+ - security-reviewer fails when `status` is `needs_revision`
129
+ - security-reviewer `blocked` -> Escalate to user
130
+ 4. If any verifier fails:
131
+ - Create a single fix task covering verifier discrepancies and security requiredFixes
132
+ - Spawn the layer-appropriate task-executor
133
+ - Spawn the layer-appropriate quality-fixer
134
+ - Re-run only the verifier(s) that failed
135
+ - Maximum retry count is 1 verification fix cycle; if any failed verifier still fails after re-run, escalate to the user
136
+ 5. If all verifiers pass -> Proceed to completion report
127
137
 
128
138
  **[STOP -- BLOCKING]** Upon detecting ANY requirement changes, halt execution immediately.
129
139
  **CANNOT proceed until user explicitly confirms the change scope.**
@@ -127,14 +127,24 @@ ENFORCEMENT: Sub-agent prompts missing the constraint suffix MUST be re-issued w
127
127
  3. Quality-fixer MUST run after each executor (no skipping)
128
128
  4. Commit MUST execute when quality-fixer returns `status: "approved"` (do not defer to end)
129
129
 
130
- ### Security Review (After All Tasks Complete)
131
-
132
- After all task cycles finish, collect all `filesModified` from every task-executor/task-executor-frontend response (deduplicated), then invoke security-reviewer before the completion report:
133
- 1. Spawn security-reviewer agent: "Design Doc: [path(s)]. Implementation files: [collected filesModified list]. Review security compliance."
134
- 2. Check response:
135
- - `approved` or `approved_with_notes` -> Proceed to completion report (include notes if present)
136
- - `needs_revision` -> Spawn layer-appropriate task-executor with `requiredFixes`, then quality-fixer, then re-invoke security-reviewer
137
- - `blocked` -> Escalate to user
130
+ ### Post-Implementation Verification (After All Tasks Complete)
131
+
132
+ After all task cycles finish, collect all `filesModified` from every task-executor/task-executor-frontend response (deduplicated), then run both verification agents before the completion report:
133
+ 1. Spawn code-verifier once per Design Doc: "Verify implementation consistency against the Design Doc. `doc_type: design-doc`. `document_path`: [single design doc path]. `code_paths`: [collected filesModified list]."
134
+ 2. Spawn security-reviewer agent: "Design Doc: [path(s)]. Implementation files: [collected filesModified list]. Review security compliance."
135
+ 3. Consolidate results:
136
+ - each code-verifier run passes when `summary.status` is `consistent` or `mostly_consistent`
137
+ - a code-verifier run fails when `summary.status` is `needs_review` or `inconsistent`
138
+ - security-reviewer passes when `status` is `approved` or `approved_with_notes`
139
+ - security-reviewer fails when `status` is `needs_revision`
140
+ - security-reviewer `blocked` -> Escalate to user
141
+ 4. If any verifier fails:
142
+ - Create a single fix task covering verifier discrepancies and security requiredFixes
143
+ - Spawn the layer-appropriate task-executor
144
+ - Spawn the layer-appropriate quality-fixer
145
+ - Re-run only the verifier(s) that failed
146
+ - Maximum retry count is 1 verification fix cycle; if any failed verifier still fails after re-run, escalate to the user
147
+ 5. If all verifiers pass -> Proceed to completion report
138
148
 
139
149
  ### Test Information Communication
140
150
  After acceptance-test-generator execution, when calling work-planner, communicate:
@@ -108,14 +108,24 @@ After user grants "batch approval for entire implementation phase", enter autono
108
108
  3. Spawn quality-fixer (or quality-fixer-frontend) agent: "Quality check and fixes"
109
109
  4. git commit -> Execute on `status: "approved"`
110
110
 
111
- ### Security Review (After All Tasks Complete)
112
-
113
- After all task cycles finish, collect all `filesModified` from every executor response (task-executor and task-executor-frontend, deduplicated), then invoke security-reviewer before the completion report:
114
- 1. Spawn security-reviewer agent: "Design Doc: [path]. Implementation files: [collected filesModified list]. Review security compliance."
115
- 2. Check response:
116
- - `approved` or `approved_with_notes` -> Proceed to completion report (include notes if present)
117
- - `needs_revision` -> Spawn layer-appropriate executor (task-executor or task-executor-frontend per task filename routing) with `requiredFixes`, then layer-appropriate quality-fixer, then re-invoke security-reviewer
118
- - `blocked` -> Escalate to user
111
+ ### Post-Implementation Verification (After All Tasks Complete)
112
+
113
+ After all task cycles finish, collect all `filesModified` from every executor response (task-executor and task-executor-frontend, deduplicated), then run both verification agents before the completion report:
114
+ 1. Spawn code-verifier agent: "Verify implementation consistency against the Design Doc. `doc_type: design-doc`. `document_path`: [path]. `code_paths`: [collected filesModified list]."
115
+ 2. Spawn security-reviewer agent: "Design Doc: [path]. Implementation files: [collected filesModified list]. Review security compliance."
116
+ 3. Consolidate results:
117
+ - code-verifier passes when `summary.status` is `consistent` or `mostly_consistent`
118
+ - code-verifier fails when `summary.status` is `needs_review` or `inconsistent`
119
+ - security-reviewer passes when `status` is `approved` or `approved_with_notes`
120
+ - security-reviewer fails when `status` is `needs_revision`
121
+ - security-reviewer `blocked` -> Escalate to user
122
+ 4. If either verifier fails:
123
+ - Create a single fix task covering verifier discrepancies and security requiredFixes
124
+ - Spawn the layer-appropriate executor
125
+ - Spawn the layer-appropriate quality-fixer
126
+ - Re-run only the verifier(s) that failed
127
+ - Maximum retry count is 1 verification fix cycle; if any failed verifier still fails after re-run, escalate to the user
128
+ 5. If both verifiers pass -> Proceed to completion report
119
129
 
120
130
  ### Test Information Communication
121
131
  After acceptance-test-generator execution, when spawning work-planner, communicate:
@@ -35,7 +35,7 @@ Design Doc (uses most recent if omitted): $ARGUMENTS
35
35
  Identify Design Doc in docs/design/ and check implementation files via git diff.
36
36
 
37
37
  ### Step 2: Execute code-reviewer
38
- Spawn code-reviewer agent: "Validate Design Doc compliance for the implementation. Design Doc path: [path]. Implementation files: [git diff file list]. Review mode: full. Return structured JSON report with complianceRate, verdict, acceptanceCriteria, and qualityIssues."
38
+ Spawn code-reviewer agent: "Validate Design Doc compliance for the implementation. Design Doc path: [path]. Implementation files: [git diff file list]. Review mode: full. Return structured JSON report per your Output Format specification."
39
39
 
40
40
  **Store output as**: `$STEP_2_OUTPUT`
41
41
 
@@ -61,10 +61,16 @@ Spawn security-reviewer agent: "Design Doc: [path]. Implementation files: [file
61
61
  ```
62
62
  Code Compliance: [complianceRate from code-reviewer]
63
63
  Verdict: [verdict from code-reviewer]
64
+ Identifier Match Rate: [identifierMatchRate from code-reviewer]
64
65
  Acceptance Criteria:
65
- - [fulfilled] [item]
66
+ - [fulfilled] [item] (confidence: [high/medium/low])
66
67
  - [partially_fulfilled] [item]: [gap] — [suggestion]
67
68
  - [unfulfilled] [item]: [gap] — [suggestion]
69
+ Identifier Mismatches (show only mismatches; write `None` if all identifiers match):
70
+ - None
71
+ - [identifier]: DD=[designDocValue] Code=[codeValue] at [location] (confidence: [high/medium/low])
72
+ Quality Findings:
73
+ - [category] [location]: [description] — [rationale]
68
74
 
69
75
  Security Review: [status from security-reviewer]
70
76
  Findings by category:
@@ -69,7 +69,7 @@ The following subagents are available:
69
69
  10. **technical-designer**: ADR/Design Doc creation
70
70
  11. **work-planner**: Work plan creation from Design Doc and test skeletons
71
71
  12. **document-reviewer**: Single document quality and rule compliance check
72
- 13. **code-verifier**: Document-code consistency verification for review inputs
72
+ 13. **code-verifier**: Document-code consistency verification for review inputs and post-implementation verification
73
73
  14. **design-sync**: Design Doc consistency verification across multiple documents
74
74
  15. **acceptance-test-generator**: Generate integration and E2E test skeletons from Design Doc ACs
75
75
 
@@ -182,7 +182,7 @@ Subagents respond in JSON format. The final response from each JSON-returning su
182
182
  - **task-executor**: status (escalation_needed/completed), escalation_type (design_compliance_violation/similar_function_found/similar_component_found/investigation_target_not_found/out_of_scope_file/test_environment_not_ready), testsAdded, requiresTestReview
183
183
  - **quality-fixer**: status (approved/blocked). For blocked responses, discriminate by `reason`: specification conflicts use `blockingIssues[]`; execution prerequisites use `missingPrerequisites[]`, and each item provides its own `resolutionSteps`
184
184
  - **document-reviewer**: verdict.decision (approved/approved_with_conditions/needs_revision/rejected)
185
- - **code-verifier**: summary, discrepancies, reverseCoverage
185
+ - **code-verifier**: summary.status, summary.consistencyScore, discrepancies, reverseCoverage
186
186
  - **design-sync**: sync_status (CONFLICTS_FOUND/NO_CONFLICTS) — text format with [SUMMARY] block
187
187
  - **integration-test-reviewer**: status (approved/needs_revision/blocked), requiredFixes
188
188
  - **security-reviewer**: status (approved/approved_with_notes/needs_revision/blocked), findings, notes, requiredFixes
@@ -300,9 +300,9 @@ Batch approval -> Start autonomous execution mode
300
300
  -> Orchestrator: Execute git commit
301
301
  -> Check remaining tasks:
302
302
  - Yes -> next task
303
- - No -> security-reviewer: Security review
304
- - approved/approved_with_notes -> Completion report
305
- - needs_revision -> layer-appropriate task-executor: Security fixes -> quality-fixer -> security-reviewer
303
+ - No -> code-verifier + security-reviewer: Post-implementation verification
304
+ - all pass -> Completion report
305
+ - any fail -> layer-appropriate task-executor: Verification fixes -> quality-fixer -> re-run failed verifiers
306
306
  - blocked -> Escalate to user
307
307
  ```
308
308
 
@@ -321,6 +321,16 @@ Use the task loop defined in the autonomous execution diagram above. The canonic
321
321
  3. quality-fixer quality gate
322
322
  4. git commit on approval
323
323
 
324
+ ### Post-Implementation Verification Pass/Fail Criteria
325
+
326
+ | Verifier | Pass | Fail | Blocked |
327
+ |----------|------|------|---------|
328
+ | code-verifier | `summary.status` is `consistent` or `mostly_consistent` | `summary.status` is `needs_review` or `inconsistent` | — |
329
+ | security-reviewer | `status` is `approved` or `approved_with_notes` | `status` is `needs_revision` | `status` is `blocked` |
330
+
331
+ Re-run only verifiers that failed on the previous verification cycle.
332
+ Maximum retry count is 1 verification fix cycle. If any failed verifier still fails after the re-run, escalate to the user.
333
+
324
334
  ## Main Orchestrator Roles
325
335
 
326
336
  1. **State Management**: Track current phase, each subagent's state, and next action
@@ -64,7 +64,7 @@ All tests MUST be:
64
64
 
65
65
  - **Independent**: No dependencies between tests
66
66
  - **Reproducible**: Same input always produces same output
67
- - **Fast**: Complete test suite runs in reasonable time
67
+ - **Fast**: Complete the full test suite within the project's accepted feedback window and flag suites that materially slow local iteration or CI
68
68
  - **Self-checking**: Clear pass/fail without manual verification
69
69
  - **Timely**: Written close to the code they test
70
70
 
@@ -162,7 +162,7 @@ Test names should clearly describe:
162
162
 
163
163
  - Use setup hooks to prepare test environment
164
164
  - Use teardown hooks to clean up resources
165
- - Keep setup minimal and focused
165
+ - Keep setup scoped to the data, dependencies, and fixtures required for the behavior under test
166
166
  - Ensure teardown runs even if test fails
167
167
 
168
168
  ## Mocking and Test Doubles
@@ -177,7 +177,7 @@ Test names should clearly describe:
177
177
  ### Mocking Principles [MANDATORY]
178
178
 
179
179
  - Mock at boundaries, not internally — use real implementations for internal utilities
180
- - Keep mocks simple and focused
180
+ - Keep each mock limited to the behavior the test needs to control or observe
181
181
  - Verify mock expectations when relevant
182
182
  - Use adapters for external libraries/frameworks you do not control
183
183
 
@@ -345,7 +345,7 @@ Eliminate tests that fail intermittently:
345
345
  - Add test for every bug fix
346
346
  - Maintain comprehensive test suite
347
347
  - Run full suite regularly
348
- - Don't delete tests without good reason
348
+ - Delete a test only when the covered behavior no longer exists or the same behavior is covered by a stronger test at the correct level
349
349
 
350
350
  ### Legacy Code
351
351
 
@@ -242,7 +242,8 @@ These annotations are used when planning and prioritizing test implementation.
242
242
  ## Constraints and Quality Standards
243
243
 
244
244
  **Mandatory Compliance**:
245
- - Output only test skeletons (prohibit implementation code, assertions, mocks)
245
+ - Output test skeletons only: verification points, expected results, and pass criteria
246
+ - Downstream consumers treat these skeletons as design artifacts rather than runnable tests
246
247
  - Clearly state verification points, expected results, and pass criteria for each test
247
248
  - Preserve original AC statements in comments (ensure traceability)
248
249
  - Stay within test budget; report if budget insufficient for critical tests
@@ -273,7 +274,7 @@ These annotations are used when planning and prioritizing test implementation.
273
274
  - Framework/Language: Auto-detect from existing test files
274
275
  - Placement: Identify test directory with project-specific patterns
275
276
  - Naming: Follow existing file naming conventions
276
- - Output: Test skeleton only (exclude implementation code)
277
+ - Output: Test skeletons only (follow Constraints and Quality Standards for the boundary)
277
278
 
278
279
  **File Operations**:
279
280
  - Existing files: Append to end, prevent duplication (check existing tests)
@@ -59,27 +59,79 @@ Skill Status:
59
59
  ## Workflow
60
60
 
61
61
  ### 1. Load Baseline
62
- Read the Design Doc and extract:
62
+ Read the Design Doc in full and extract:
63
63
  - Functional requirements and acceptance criteria (list each AC individually)
64
64
  - Architecture design and data flow
65
+ - Interface contracts (function signatures, API endpoints, data structures)
66
+ - Identifier specifications explicitly written in the Design Doc as exact values, literals, labels, or named fields (resource names, endpoint paths, configuration keys, error codes, schema/model names)
65
67
  - Error handling policy
66
68
  - Non-functional requirements
67
69
 
68
- ### 2. Map Implementation to Acceptance Criteria
70
+ ### 2. Map Implementation to Design Doc
71
+
72
+ #### 2-1. Acceptance Criteria Verification
69
73
  For each acceptance criterion extracted in Step 1:
70
74
  - Search implementation files for the corresponding code
71
75
  - Determine status: fulfilled / partially fulfilled / unfulfilled
72
76
  - Record the file path and relevant code location
73
77
  - Note any deviations from the Design Doc specification
74
78
 
79
+ #### 2-2. Identifier Verification
80
+ For each identifier specification extracted in Step 1:
81
+ 1. Search implementation files for the exact string
82
+ 2. Compare code values against the Design Doc specification
83
+ 3. Flag discrepancies such as missing references, misspellings, or inconsistent naming
84
+ 4. Evaluate every identifier and update overall totals for matched and mismatched results
85
+ 5. Emit only mismatches in `identifierVerification`, with `{ identifier, designDocValue, codeValue, location, match, confidence, evidence }`
86
+
87
+ Identifier extraction constraints:
88
+ - Only verify identifiers that are explicitly written in the Design Doc as exact values, literals, labeled fields, or code-facing names
89
+ - Do not infer identifiers from descriptive prose, conceptual summaries, or implied naming conventions
90
+ - If the Design Doc names a concept without an exact code-facing value, treat it as a normal Design Doc claim, not an identifier check
91
+
92
+ #### 2-3. Evidence Collection
93
+ For each acceptance criterion and identifier check:
94
+ 1. Primary evidence: direct implementation in source files
95
+ 2. Secondary evidence: corresponding tests
96
+ 3. Tertiary evidence: config, schemas, or type definitions
97
+
98
+ `agreeing sources` means multiple sources independently support the same determination about the same acceptance criterion or identifier. Naming overlap alone is NOT agreement; the evidence must support the same behavior, contract, or exact value match.
99
+
100
+ Assign confidence based on evidence count:
101
+ - high: 3+ agreeing sources
102
+ - medium: 2 agreeing sources
103
+ - low: 1 source only
104
+
75
105
  ### 3. Assess Code Quality
76
- Read each implementation file and check:
77
- - Function length (ideal: <50 lines, max: 200 lines)
78
- - Nesting depth (ideal: <=3 levels, max: 4 levels)
79
- - Single responsibility adherence
80
- - Error handling implementation
81
- - Appropriate logging
82
- - Test coverage for acceptance criteria
106
+ Read each implementation file and evaluate:
107
+
108
+ #### 3-1. Structural Quality
109
+ For each implementation file, read the concrete functions, handlers, or components in scope and evaluate them against the active coding-rules skill:
110
+ - Function organization: flag `maintainability` when a single function mixes multiple distinct concerns such as validation, orchestration, persistence, and presentation formatting
111
+ - Control-flow clarity: flag `maintainability` when branches, nested conditions, or early-exit patterns make the execution path materially difficult to follow
112
+ - Single responsibility adherence: flag `maintainability` when a function or file has more than one primary responsibility
113
+ - Naming clarity: flag `maintainability` when ambiguous names materially obscure intent, domain meaning, or responsibility
114
+
115
+ #### 3-2. Error Handling and Reliability
116
+ Read error paths and boundary handling directly in the code:
117
+ - Error handling implementation: verify failures are either propagated explicitly or handled with context
118
+ - Explicit failure paths over silent suppression: flag `reliability` when errors are swallowed, converted to defaults without justification, or otherwise hidden from callers and operators
119
+ - Boundary validation: flag `reliability` when external input, deserialized data, or cross-system responses enter important logic without the validation implied by the Design Doc, type contracts, or code boundary shape
120
+
121
+ #### 3-3. Test Coverage for Acceptance Criteria
122
+ - For each fulfilled AC, check whether tests exercise the expected behavior
123
+
124
+ Classify each quality finding into one of:
125
+ - `dd_violation`: implementation deviates from the Design Doc
126
+ - `maintainability`: code structure impedes change or comprehension
127
+ - `reliability`: missing safeguards could cause runtime failure
128
+ - `coverage_gap`: acceptance criteria lack meaningful test verification
129
+
130
+ Each finding MUST include a rationale:
131
+ - `dd_violation`: what the Design Doc says vs what code does
132
+ - `maintainability`: the concrete maintenance or comprehension risk
133
+ - `reliability`: the failure scenario and triggering conditions
134
+ - `coverage_gap`: the untested AC and why coverage matters
83
135
 
84
136
  ### 4. Check Architecture Compliance
85
137
  Verify against the Design Doc architecture:
@@ -89,9 +141,10 @@ Verify against the Design Doc architecture:
89
141
  - No unnecessary duplicate implementations (Pattern 5 from ai-development-guide skill)
90
142
  - Existing codebase analysis section includes similar functionality investigation results
91
143
 
92
- ### 5. Calculate Compliance
144
+ ### 5. Calculate Compliance and Consolidate
93
145
  - Compliance rate = (fulfilled items + 0.5 x partially fulfilled items) / total AC items x 100
94
- - Compile all AC statuses, quality issues with specific locations
146
+ - Identifier match rate = matched identifiers / total identifiers x 100
147
+ - Compile all AC statuses, identifier results, and quality findings with specific locations
95
148
  - Determine verdict based on compliance rate
96
149
 
97
150
  ### 6. Return JSON Result
@@ -102,50 +155,100 @@ Return the JSON result as the final response. See Output Format for the schema.
102
155
  ```json
103
156
  {
104
157
  "complianceRate": "[X]%",
158
+ "identifierMatchRate": "[X]%",
105
159
  "verdict": "[pass/needs-improvement/needs-redesign]",
106
160
 
107
161
  "acceptanceCriteria": [
108
162
  {
109
163
  "item": "[acceptance criteria name]",
110
164
  "status": "fulfilled|partially_fulfilled|unfulfilled",
165
+ "confidence": "high|medium|low",
111
166
  "location": "[file:line, if implemented]",
167
+ "evidence": ["[source1: file:line]", "[source2: test file:line]"],
112
168
  "gap": "[what is missing or deviating, if not fully fulfilled]",
113
169
  "suggestion": "[specific fix, if not fully fulfilled]"
114
170
  }
115
171
  ],
116
172
 
117
- "qualityIssues": [
173
+ "identifierVerification": [
174
+ {
175
+ "identifier": "[identifier name]",
176
+ "designDocValue": "[value specified in Design Doc]",
177
+ "codeValue": "[value found in code, or 'not found']",
178
+ "location": "[file:line]",
179
+ "confidence": "high|medium|low",
180
+ "evidence": ["[source1: file:line]", "[source2: config file:line]"],
181
+ "match": false
182
+ }
183
+ ],
184
+
185
+ "qualityFindings": [
118
186
  {
119
- "type": "[long-function/deep-nesting/multiple-responsibilities]",
120
- "location": "[filename:function]",
187
+ "category": "dd_violation|maintainability|reliability|coverage_gap",
188
+ "location": "[filename:function or file:line]",
189
+ "description": "[specific issue]",
190
+ "rationale": "[why this matters]",
121
191
  "suggestion": "[specific improvement]"
122
192
  }
123
193
  ],
124
194
 
195
+ "summary": {
196
+ "acsTotal": 0,
197
+ "acsFulfilled": 0,
198
+ "acsPartial": 0,
199
+ "acsUnfulfilled": 0,
200
+ "identifiersTotal": 0,
201
+ "identifiersMatched": 0,
202
+ "lowConfidenceItems": 0,
203
+ "findingsByCategory": {
204
+ "dd_violation": 0,
205
+ "maintainability": 0,
206
+ "reliability": 0,
207
+ "coverage_gap": 0
208
+ }
209
+ },
210
+
125
211
  "nextAction": "[highest priority action needed]"
126
212
  }
127
213
  ```
128
214
 
215
+ `identifierVerification` MUST include mismatches only. Use `summary.identifiersTotal` and `summary.identifiersMatched` for overall counts.
216
+
129
217
  ## Verdict Criteria
130
218
 
131
219
  - **90%+**: pass — Minor adjustments only
132
220
  - **70-89%**: needs-improvement — Critical gaps exist
133
221
  - **<70%**: needs-redesign — Major revision required
134
222
 
223
+ Lower the verdict by one level only when at least one identifier mismatch has confidence `medium` or `high`.
224
+
135
225
  ## Important Notes
136
226
 
137
227
  ### Review Principles
138
228
  - Use Design Doc as single source of truth; evaluate independent of implementation context
229
+ - Every finding must include file:line evidence
230
+ - Low-confidence determinations must be explicit
231
+ - Convert abstract skill rules into concrete, code-backed review findings rather than restating the rule alone
139
232
  - Provide solutions, not just problems; quantify wherever possible
140
- - Acknowledge good implementations; present improvements as actionable items
141
233
 
142
234
  ## Completion Criteria
143
235
 
144
- - [ ] All acceptance criteria individually evaluated
145
- - [ ] Compliance rate calculated
236
+ - [ ] All acceptance criteria individually evaluated with confidence
237
+ - [ ] Identifier specifications verified against implementation
238
+ - [ ] Compliance rate and identifier match rate calculated
239
+ - [ ] Quality findings classified with rationale
146
240
  - [ ] Verdict determined
147
241
  - [ ] Final response is the JSON output
148
242
 
243
+ ## Output Self-Check
244
+
245
+ - [ ] Every AC determination cites evidence
246
+ - [ ] Identifier comparisons use exact strings from the Design Doc and code
247
+ - [ ] Low-confidence items are explicit
248
+ - [ ] Every quality finding includes category, rationale, and file:line
249
+ - [ ] Every maintainability or reliability finding is backed by code that was actually read, not inferred from naming alone
250
+ - [ ] identifierVerification contains mismatches only, and each mismatch includes confidence and evidence
251
+
149
252
  ### Escalation Criteria
150
253
  Recommend higher-level review when: Design Doc itself has deficiencies, security concerns discovered, or critical performance issues found.
151
254
 
@@ -55,7 +55,7 @@ Scale determination and required document details follow the principles in docum
55
55
  - **Medium**: 3-5 files, spanning multiple components
56
56
  - **Large**: 6+ files, architecture-level changes
57
57
 
58
- ADR conditions (contract system changes, data flow changes, architecture changes, external dependency changes) require ADR regardless of scale
58
+ Note: ADR conditions (contract system changes, data flow changes, architecture changes, external dependency changes) require ADR regardless of scale
59
59
 
60
60
  ### Important: Clear Determination Expressions
61
61
  MUST use the following expressions to show clear determinations:
@@ -162,7 +162,7 @@ Return the JSON result as the final response. See Output Format for the schema.
162
162
  - [ ] Do I understand the user's true purpose?
163
163
  - [ ] Have I properly estimated the impact scope?
164
164
  - [ ] Have I correctly determined ADR necessity?
165
- - [ ] Have I not overlooked technical risks?
165
+ - [ ] Have I identified all technical risks and dependencies?
166
166
  - [ ] Have I listed scopeDependencies for uncertain scale?
167
167
  - [ ] Final response is the JSON output
168
168
 
@@ -162,7 +162,7 @@ Select and execute files with pattern `docs/plans/tasks/*-task-*.md` that have u
162
162
  **Unavailable**: Escalate with `status: "escalation_needed"`, `reason: "test_environment_not_ready"`
163
163
 
164
164
  #### Pre-implementation Verification (Pattern 5 Compliant)
165
- 1. **Read relevant Design Doc sections** and understand accurately
165
+ 1. **Read relevant Design Doc sections** and extract interface contracts, data structures, dependency constraints, and verification expectations
166
166
  2. **Investigate existing implementations**: Search for similar functions in same domain/responsibility
167
167
  3. **Cross-check against Investigation Notes**: Ensure planned implementation is consistent with the observations recorded in the task file
168
168
  4. **Execute determination**: Determine continue/escalation per "Mandatory Judgment Criteria" above
@@ -80,7 +80,7 @@ Must be performed before Design Doc creation:
80
80
  - Search existing code for keywords related to planned component
81
81
  - Look for components with same domain, responsibilities, or UI patterns
82
82
  - Decision and action:
83
- - Similar component found → Use that component (do not create new component)
83
+ - Similar component found → Reuse, compose, or extend that component path and document the reuse decision
84
84
  - Similar component is technical debt → Create ADR improvement proposal before implementation
85
85
  - No similar component → Proceed with new implementation
86
86
 
@@ -97,7 +97,7 @@ Must be performed before Design Doc creation:
97
97
  - Search existing code for keywords related to planned functionality
98
98
  - Look for implementations with same domain, responsibilities, or configuration patterns
99
99
  - Decision and action:
100
- - Similar functionality found → Use that implementation (do not create new implementation)
100
+ - Similar functionality found → Reuse or extend that implementation path and document the reuse decision
101
101
  - Similar functionality is technical debt → Create ADR improvement proposal before implementation
102
102
  - No similar functionality → Proceed with new implementation
103
103
 
@@ -111,7 +111,7 @@ Execute file output immediately. Final approval is managed by the orchestrator r
111
111
  1. **Executable Granularity**: Each task as logical 1-commit unit, clear completion criteria, explicit dependencies
112
112
  2. **Built-in Quality**: Simultaneous test implementation, quality checks in each phase
113
113
  3. **Risk Management**: List risks and countermeasures in advance, define detection methods
114
- 4. **Ensure Flexibility**: Prioritize essential purpose, avoid excessive detail
114
+ 4. **Ensure Flexibility**: Prioritize essential purpose and include only details required for task execution, verification, and dependency management
115
115
  5. **Design Doc Compliance**: All task completion criteria derived from Design Doc specifications
116
116
  6. **Implementation Pattern Consistency**: When including implementation samples, MUST ensure strict compliance with Design Doc implementation approach
117
117
 
package/README.md CHANGED
@@ -4,9 +4,9 @@
4
4
  [![Agent Skills](https://img.shields.io/badge/Agent%20Skills-Spec%20Compliant-blue)](https://developers.openai.com/codex/skills/)
5
5
  [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
6
6
 
7
- **End-to-end AI coding workflows for [Codex CLI](https://developers.openai.com/codex/cli)** — specialized subagents handle requirements, design, implementation, and quality checks so you get code with explicit design docs, test coverage, and commit-level traceability — not just raw generations.
7
+ **Structured agentic coding workflows for [OpenAI Codex CLI](https://developers.openai.com/codex/cli)** — specialized AI coding agents plan, implement, test, and review changes with traceable docs, task-level commits, and quality gates.
8
8
 
9
- Built on the [Agent Skills specification](https://developers.openai.com/codex/skills/) and [Codex subagents](https://developers.openai.com/codex/subagents). Works with the latest GPT models.
9
+ Built on the [Agent Skills specification](https://developers.openai.com/codex/skills/) and [Codex subagents](https://developers.openai.com/codex/subagents). Designed for long-running tasks, large refactors, and reviewable changes.
10
10
 
11
11
  ---
12
12
 
@@ -25,36 +25,45 @@ $recipe-implement Add user authentication with JWT
25
25
 
26
26
  `$` is Codex CLI's syntax for invoking a skill explicitly. Type `$recipe-` to see all available recipes via tab completion.
27
27
 
28
- The framework runs a structured workflow requirements → design → task decomposition → TDD implementation → quality gates — all through specialized subagents.
28
+ Small changes stay lightweight. Larger tasks get structure: requirements → design → task decomposition → TDD implementation → quality gates.
29
+
30
+ codex-workflows is the Codex-native counterpart of [Claude Code Workflows](https://github.com/shinpr/claude-code-workflows): same document-driven development style, adapted for Codex CLI, subagents, and GPT models.
29
31
 
30
32
  ---
31
33
 
32
34
  ## Why codex-workflows?
33
35
 
34
- **Without codex-workflows:**
35
- - Code generation is inconsistent across large tasks
36
- - Requirements and design decisions are implicit lost after the session
37
- - Refactoring and debugging become harder as context grows
36
+ Codex is already strong at one-shot implementation. The problem starts when a change spans multiple files, needs design decisions to stay visible, or has to survive review, testing, and follow-up edits.
37
+
38
+ For larger tasks, explicit planning changes the job from raw generation into verification against a design, a task breakdown, and acceptance criteria. That matters because review loops are more reliable than first-shot generation once scope and ambiguity grow.
39
+
40
+ codex-workflows adds the missing structure around those jobs:
41
+ - Traceable artifacts: PRD → Design Doc → Task → Commit
42
+ - Built-in TDD and quality gates before code is ready to commit
43
+ - Agent context separation for large refactors, migrations, and PR-sized changes
44
+ - Diagnosis and reverse-engineering flows for bugs and legacy code
45
+
46
+ ## Not Designed For
38
47
 
39
- **With codex-workflows:**
40
- - Every change is traceable: PRD Design Doc Task → Commit
41
- - Built-in TDD and quality gates catch regressions before commit
42
- - Large tasks stay structured and reviewable through agent context separation
48
+ - One-shot toy scripts or vibe-coding sessions where speed matters more than traceability
49
+ - Repositories that do not use tests, lint, builds, or reviewable commits
50
+ - Teams that do not want design docs, task breakdowns, or explicit quality gates
43
51
 
44
52
  ---
45
53
 
46
54
  ## What It Does
47
55
 
48
- A single request becomes a structured development process:
56
+ A single request becomes a structured development process. The framework chooses the level of ceremony based on scope:
57
+
58
+ | Scale | File Count | What Happens |
59
+ |-------|------------|-------------|
60
+ | Small | 1-2 | Simplified plan → direct implementation |
61
+ | Medium | 3-5 | Design Doc → work plan → task execution |
62
+ | Large | 6+ | PRD → ADR → Design Doc → test skeletons → work plan → autonomous execution |
49
63
 
50
- 1. **Understand** the problem (scale, constraints, affected files)
51
- 2. **Analyze the existing codebase** (dependencies, data layer, risk areas)
52
- 3. **Design** the solution (ADR, Design Doc with acceptance criteria)
53
- 4. **Break it into tasks** (atomic, 1 commit each)
54
- 5. **Implement with tests** (TDD per task)
55
- 6. **Run quality checks** (lint, test, build — no failing checks)
64
+ For larger work, the path usually looks like this: understand the problem, analyze the codebase, design the change, break it into atomic tasks, implement with tests, and run quality checks before commit.
56
65
 
57
- Each step is handled by a specialized subagent in its own context, preventing context pollution and reducing error accumulation in long-running tasks:
66
+ Each step is handled by a specialized subagent in its own context, using context engineering to prevent context pollution and reduce error accumulation in long-running tasks:
58
67
 
59
68
  ```
60
69
  User Request
@@ -96,6 +105,8 @@ Problem → investigator → verifier (ACH + Devil's Advocate) → solver → Ac
96
105
  Existing code → scope-discoverer (discoveredUnits + prdUnits) → prd-creator → code-verifier → document-reviewer → Design Docs
97
106
  ```
98
107
 
108
+ This works best when repository knowledge is explicit and local. Short `AGENTS.md` files can act as entry points, while design docs, plans, and task files hold the deeper instructions that agents need to execute reliably.
109
+
99
110
  ---
100
111
 
101
112
  ## Installation
@@ -103,7 +114,7 @@ Existing code → scope-discoverer (discoveredUnits + prdUnits) → prd-creator
103
114
  ### Requirements
104
115
 
105
116
  - [Codex CLI](https://developers.openai.com/codex/cli) (latest)
106
- - Node.js >= 20
117
+ - Node.js >= 22
107
118
 
108
119
  ### Install
109
120
 
@@ -266,16 +277,6 @@ Codex spawns these as needed during recipe execution. Each agent runs in its own
266
277
 
267
278
  ## How It Works
268
279
 
269
- ### Scale-Based Workflow Selection
270
-
271
- The framework automatically determines the right level of ceremony:
272
-
273
- | Scale | File Count | What Happens |
274
- |-------|------------|-------------|
275
- | Small | 1-2 | Simplified plan → direct implementation |
276
- | Medium | 3-5 | Design Doc → work plan → task execution |
277
- | Large | 6+ | PRD → ADR → Design Doc → test skeletons → work plan → autonomous execution |
278
-
279
280
  ### Autonomous Execution Mode
280
281
 
281
282
  After work plan approval, the framework enters guided autonomous execution with escalation points:
@@ -287,7 +288,8 @@ After work plan approval, the framework enters guided autonomous execution with
287
288
 
288
289
  ### Context Separation
289
290
 
290
- Each subagent runs in a fresh context. This matters because:
291
+ Each subagent runs in a fresh context. This context-engineering pattern keeps long-running agentic coding tasks legible and reviewable:
292
+ - generation and verification happen in separate contexts, reducing author bias and carry-over assumptions
291
293
  - **document-reviewer** reviews without the author's bias
292
294
  - **investigator** collects evidence without confirmation bias
293
295
  - **code-reviewer** validates compliance without implementation context
@@ -349,10 +351,6 @@ A: Yes. Edit the TOML files in `.codex/agents/` — change model, sandbox_mode,
349
351
 
350
352
  A: `$recipe-implement` is the universal entry point. It runs requirement-analyzer first, detects affected layers from the codebase, and automatically routes to backend, frontend, or fullstack flow. `$recipe-fullstack-implement` skips the detection and goes straight into the fullstack flow (separate Design Docs per layer, design-sync, layer-aware task execution). Use `$recipe-implement` when you're not sure; use `$recipe-fullstack-implement` when you know upfront that the feature spans both layers.
351
353
 
352
- **Q: How does this relate to Claude Code Workflows?**
353
-
354
- A: codex-workflows is the Codex-native counterpart of [Claude Code Workflows](https://github.com/shinpr/claude-code-workflows). Same development philosophy, adapted for Codex CLI's subagent architecture and GPT model family.
355
-
356
354
  **Q: Does this work with MCP servers?**
357
355
 
358
356
  A: Yes. Codex skills and subagents work alongside [MCP](https://developers.openai.com/codex/mcp) — skills operate at the instruction layer while MCP operates at the tool transport layer. You can add MCP servers to any agent's TOML configuration.
@@ -363,6 +361,19 @@ A: Subagents escalate to the user when they encounter design deviations, ambiguo
363
361
 
364
362
  ---
365
363
 
364
+ ## Design Rationale
365
+
366
+ <details>
367
+ <summary>Background reading behind the workflow design</summary>
368
+
369
+ - [Planning Is the Real Superpower of Agentic Coding](https://www.norsica.jp/blog/planning-superpower-agentic-coding) — why explicit planning turns large-task execution from raw generation into verification against a design and task breakdown
370
+ - [Why LLMs Are Bad at 'First Try' and Great at Verification](https://www.norsica.jp/blog/llm-verification-over-generation) — why review loops and session separation are more reliable than first-shot generation on complex work
371
+ - [Stop Putting Everything in AGENTS.md](https://www.norsica.jp/blog/stop-putting-everything-in-agents-md) — why `AGENTS.md` should stay lean while rules, docs, and task instructions live near the point of use
372
+
373
+ </details>
374
+
375
+ ---
376
+
366
377
  ## License
367
378
 
368
379
  MIT License — free to use, modify, and distribute.
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "codex-workflows",
3
- "version": "0.4.1",
3
+ "version": "0.4.3",
4
4
  "description": "Task-oriented agentic coding framework for OpenAI Codex CLI — skills, recipes, and subagents for structured development workflows",
5
5
  "license": "MIT",
6
6
  "author": "Shinsuke Kagawa",
@@ -22,9 +22,12 @@
22
22
  "agent-skills",
23
23
  "agentic-coding",
24
24
  "ai-coding",
25
+ "ai-coding-agent",
25
26
  "subagents",
26
27
  "multi-agent",
28
+ "harness-engineering",
27
29
  "context-engineering",
30
+ "ai-development-workflow",
28
31
  "tdd",
29
32
  "code-generation"
30
33
  ],