@exaudeus/workrail 3.27.0 → 3.29.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (160) hide show
  1. package/dist/console/assets/{index-FtTaDku8.js → index-BZ6HkxGf.js} +1 -1
  2. package/dist/console/index.html +1 -1
  3. package/dist/manifest.json +3 -3
  4. package/docs/README.md +57 -0
  5. package/docs/adrs/001-hybrid-storage-backend.md +38 -0
  6. package/docs/adrs/002-four-layer-context-classification.md +38 -0
  7. package/docs/adrs/003-checkpoint-trigger-strategy.md +35 -0
  8. package/docs/adrs/004-opt-in-encryption-strategy.md +36 -0
  9. package/docs/adrs/005-agent-first-workflow-execution-tokens.md +105 -0
  10. package/docs/adrs/006-append-only-session-run-event-log.md +76 -0
  11. package/docs/adrs/007-resume-and-checkpoint-only-sessions.md +51 -0
  12. package/docs/adrs/008-blocked-nodes-architectural-upgrade.md +178 -0
  13. package/docs/adrs/009-bridge-mode-single-instance-mcp.md +195 -0
  14. package/docs/adrs/010-release-pipeline.md +89 -0
  15. package/docs/architecture/README.md +7 -0
  16. package/docs/architecture/refactor-audit.md +364 -0
  17. package/docs/authoring-v2.md +527 -0
  18. package/docs/authoring.md +873 -0
  19. package/docs/changelog-recent.md +201 -0
  20. package/docs/configuration.md +505 -0
  21. package/docs/ctc-mcp-proposal.md +518 -0
  22. package/docs/design/README.md +22 -0
  23. package/docs/design/agent-cascade-protocol.md +96 -0
  24. package/docs/design/autonomous-console-design-candidates.md +253 -0
  25. package/docs/design/autonomous-console-design-review.md +111 -0
  26. package/docs/design/autonomous-platform-mvp-discovery.md +525 -0
  27. package/docs/design/claude-code-source-deep-dive.md +713 -0
  28. package/docs/design/console-cyberpunk-ui-discovery.md +504 -0
  29. package/docs/design/console-execution-trace-candidates-final.md +160 -0
  30. package/docs/design/console-execution-trace-candidates.md +211 -0
  31. package/docs/design/console-execution-trace-design-candidates-v2.md +113 -0
  32. package/docs/design/console-execution-trace-design-review.md +74 -0
  33. package/docs/design/console-execution-trace-discovery.md +394 -0
  34. package/docs/design/console-execution-trace-final-review.md +77 -0
  35. package/docs/design/console-execution-trace-review.md +92 -0
  36. package/docs/design/console-performance-discovery.md +415 -0
  37. package/docs/design/console-ui-backlog.md +280 -0
  38. package/docs/design/daemon-architecture-discovery.md +853 -0
  39. package/docs/design/daemon-design-candidates.md +318 -0
  40. package/docs/design/daemon-design-review-findings.md +119 -0
  41. package/docs/design/daemon-engine-design-candidates.md +210 -0
  42. package/docs/design/daemon-engine-design-review.md +131 -0
  43. package/docs/design/daemon-execution-engine-discovery.md +280 -0
  44. package/docs/design/daemon-gap-analysis.md +554 -0
  45. package/docs/design/daemon-owns-console-plan.md +168 -0
  46. package/docs/design/daemon-owns-console-review.md +91 -0
  47. package/docs/design/daemon-owns-console.md +195 -0
  48. package/docs/design/data-model-erd.md +11 -0
  49. package/docs/design/design-candidates-consolidate-dev-staleness.md +98 -0
  50. package/docs/design/design-candidates-walk-cache-depth-limit.md +80 -0
  51. package/docs/design/design-review-consolidate-dev-staleness.md +54 -0
  52. package/docs/design/design-review-walk-cache-depth-limit.md +48 -0
  53. package/docs/design/implementation-plan-consolidate-dev-staleness.md +142 -0
  54. package/docs/design/implementation-plan-walk-cache-depth-limit.md +141 -0
  55. package/docs/design/layer3b-ghost-nodes-design-candidates.md +229 -0
  56. package/docs/design/layer3b-ghost-nodes-design-review.md +93 -0
  57. package/docs/design/layer3b-ghost-nodes-implementation-plan.md +219 -0
  58. package/docs/design/list-workflows-latency-fix-plan.md +128 -0
  59. package/docs/design/list-workflows-latency-fix-review.md +55 -0
  60. package/docs/design/list-workflows-latency-fix.md +109 -0
  61. package/docs/design/native-context-management-api.md +11 -0
  62. package/docs/design/performance-sweep-2026-04.md +96 -0
  63. package/docs/design/routines-guide.md +219 -0
  64. package/docs/design/sequence-diagrams.md +11 -0
  65. package/docs/design/subagent-design-principles.md +220 -0
  66. package/docs/design/temporal-patterns-design-candidates.md +312 -0
  67. package/docs/design/temporal-patterns-design-review-findings.md +163 -0
  68. package/docs/design/test-isolation-from-config-file.md +335 -0
  69. package/docs/design/v2-core-design-locks.md +2746 -0
  70. package/docs/design/v2-lock-registry.json +734 -0
  71. package/docs/design/workflow-authoring-v2.md +1044 -0
  72. package/docs/design/workflow-docs-spec.md +218 -0
  73. package/docs/design/workflow-extension-points.md +687 -0
  74. package/docs/design/workrail-auto-trigger-system.md +359 -0
  75. package/docs/design/workrail-config-file-discovery.md +513 -0
  76. package/docs/docker.md +110 -0
  77. package/docs/generated/v2-lock-closure-plan.md +26 -0
  78. package/docs/generated/v2-lock-coverage.json +797 -0
  79. package/docs/generated/v2-lock-coverage.md +177 -0
  80. package/docs/ideas/backlog.md +3927 -0
  81. package/docs/ideas/design-candidates-mcp-resilience.md +208 -0
  82. package/docs/ideas/design-review-findings-mcp-resilience.md +119 -0
  83. package/docs/ideas/implementation_plan.md +249 -0
  84. package/docs/ideas/third-party-workflow-setup-design-thinking.md +1948 -0
  85. package/docs/implementation/02-architecture.md +316 -0
  86. package/docs/implementation/04-testing-strategy.md +124 -0
  87. package/docs/implementation/09-simple-workflow-guide.md +835 -0
  88. package/docs/implementation/13-advanced-validation-guide.md +874 -0
  89. package/docs/implementation/README.md +21 -0
  90. package/docs/integrations/claude-code.md +300 -0
  91. package/docs/integrations/firebender.md +315 -0
  92. package/docs/migration/v0.1.0.md +147 -0
  93. package/docs/naming-conventions.md +45 -0
  94. package/docs/planning/README.md +104 -0
  95. package/docs/planning/github-ticketing-playbook.md +195 -0
  96. package/docs/plans/README.md +24 -0
  97. package/docs/plans/agent-managed-ticketing-design.md +605 -0
  98. package/docs/plans/agentic-orchestration-roadmap.md +112 -0
  99. package/docs/plans/assessment-gates-engine-handoff.md +536 -0
  100. package/docs/plans/content-coherence-and-references.md +151 -0
  101. package/docs/plans/library-extraction-plan.md +340 -0
  102. package/docs/plans/mr-review-workflow-redesign.md +1451 -0
  103. package/docs/plans/native-context-management-epic.md +11 -0
  104. package/docs/plans/perf-fixes-design-candidates.md +225 -0
  105. package/docs/plans/perf-fixes-design-review-findings.md +61 -0
  106. package/docs/plans/perf-fixes-new-issues-candidates.md +264 -0
  107. package/docs/plans/perf-fixes-new-issues-review.md +110 -0
  108. package/docs/plans/prompt-fragments.md +53 -0
  109. package/docs/plans/ui-ux-workflow-design-candidates.md +120 -0
  110. package/docs/plans/ui-ux-workflow-discovery.md +100 -0
  111. package/docs/plans/ui-ux-workflow-review.md +48 -0
  112. package/docs/plans/v2-followup-enhancements.md +587 -0
  113. package/docs/plans/workflow-categories-candidates.md +105 -0
  114. package/docs/plans/workflow-categories-discovery.md +110 -0
  115. package/docs/plans/workflow-categories-review.md +51 -0
  116. package/docs/plans/workflow-discovery-model-candidates.md +94 -0
  117. package/docs/plans/workflow-discovery-model-discovery.md +74 -0
  118. package/docs/plans/workflow-discovery-model-review.md +48 -0
  119. package/docs/plans/workflow-source-setup-phase-1.md +245 -0
  120. package/docs/plans/workflow-source-setup-phase-2.md +361 -0
  121. package/docs/plans/workflow-staleness-detection-candidates.md +104 -0
  122. package/docs/plans/workflow-staleness-detection-review.md +58 -0
  123. package/docs/plans/workflow-staleness-detection.md +80 -0
  124. package/docs/plans/workflow-v2-design.md +69 -0
  125. package/docs/plans/workflow-v2-roadmap.md +74 -0
  126. package/docs/plans/workflow-validation-design.md +98 -0
  127. package/docs/plans/workflow-validation-roadmap.md +108 -0
  128. package/docs/plans/workrail-platform-vision.md +420 -0
  129. package/docs/reference/agent-context-cleaner-snippet.md +94 -0
  130. package/docs/reference/agent-context-guidance.md +140 -0
  131. package/docs/reference/context-optimization.md +284 -0
  132. package/docs/reference/example-workflow-repository-template/.github/workflows/validate.yml +125 -0
  133. package/docs/reference/example-workflow-repository-template/README.md +268 -0
  134. package/docs/reference/example-workflow-repository-template/workflows/example-workflow.json +80 -0
  135. package/docs/reference/external-workflow-repositories.md +916 -0
  136. package/docs/reference/feature-flags-architecture.md +472 -0
  137. package/docs/reference/feature-flags.md +349 -0
  138. package/docs/reference/god-tier-workflow-validation.md +272 -0
  139. package/docs/reference/loop-optimization.md +209 -0
  140. package/docs/reference/loop-validation.md +176 -0
  141. package/docs/reference/loops.md +465 -0
  142. package/docs/reference/mcp-platform-constraints.md +59 -0
  143. package/docs/reference/recovery.md +88 -0
  144. package/docs/reference/releases.md +177 -0
  145. package/docs/reference/troubleshooting.md +105 -0
  146. package/docs/reference/workflow-execution-contract.md +998 -0
  147. package/docs/roadmap/README.md +22 -0
  148. package/docs/roadmap/legacy-planning-status.md +103 -0
  149. package/docs/roadmap/now-next-later.md +70 -0
  150. package/docs/roadmap/open-work-inventory.md +389 -0
  151. package/docs/tickets/README.md +39 -0
  152. package/docs/tickets/next-up.md +76 -0
  153. package/docs/workflow-management.md +317 -0
  154. package/docs/workflow-templates.md +423 -0
  155. package/docs/workflow-validation.md +184 -0
  156. package/docs/workflows.md +254 -0
  157. package/package.json +3 -1
  158. package/spec/authoring-spec.json +61 -16
  159. package/workflows/workflow-for-workflows.json +252 -93
  160. package/workflows/workflow-for-workflows.v2.json +188 -77
@@ -0,0 +1,1451 @@
1
+ # MR Review Workflow Redesign
2
+
3
+ ## Status
4
+
5
+ This is a working design document for redesigning `workflows/mr-review-workflow.agentic.v2.json`.
6
+
7
+ It is intentionally ahead of the current workflow JSON. The goal is to converge on the right workflow shape here first, then update the bundled workflow once the design feels solid.
8
+
9
+ ## Problem Statement
10
+
11
+ The current MR review workflow is meaningfully better than the older review flows, but it is still not in its best shape.
12
+
13
+ It is strong on:
14
+
15
+ - reviewer-family parallelism
16
+ - contradiction-aware synthesis
17
+ - notes-first durability
18
+ - final validation and human-facing handoff
19
+
20
+ It is weaker than elite human review practice in the areas that determine whether the review is even operating on the right target and the right context:
21
+
22
+ - identifying the actual MR / PR rather than just a diff
23
+ - finding the true review boundary and merge base
24
+ - handling stacked branches and inherited changes
25
+ - reconstructing author intent from tickets and docs
26
+ - filtering noise and low-signal churn before deep review begins
27
+ - degrading gracefully when discovery surfaces are unavailable
28
+ - explaining to the user what the workflow could and could not access
29
+
30
+ ## Design Goal
31
+
32
+ Redesign the workflow so that it behaves more like a top-tier human reviewer:
33
+
34
+ 1. find the real review target
35
+ 2. find the real review boundary
36
+ 3. gather all realistically available context
37
+ 4. adapt review rigor to the shape of the change
38
+ 5. disclose confidence, uncertainty, and environment limitations clearly
39
+
40
+ ## Product Decisions
41
+
42
+ This section records the current recommended product direction for the redesign. These are stronger than brainstorming notes, but still revisable if later review reveals a better shape.
43
+
44
+ ### The workflow should try to find the actual MR, not just inspect local changes
45
+
46
+ The workflow should explicitly prefer discovering the actual MR / PR when possible.
47
+
48
+ Why:
49
+
50
+ - the PR often contains intent, scope, linked issues, reviewer discussion, and branch context that a raw diff does not
51
+ - PR metadata can help identify the correct review boundary
52
+ - strong human reviewers usually review the authored change in its review context, not just the visible patch
53
+
54
+ Recommended behavior:
55
+
56
+ - if the user already provides a PR URL or identifier, use that as the first-class review target
57
+ - if the user provides only a branch, patch, or vague review request, attempt to discover the corresponding MR / PR
58
+ - if no MR / PR can be found, continue with diff-based review rather than block
59
+ - if the review remains diff-only, lower context confidence and disclose the limitation in final handoff
60
+
61
+ ### The workflow should try to discover ticket and document context
62
+
63
+ The workflow should attempt to recover ticket and supporting-document context when available.
64
+
65
+ Priority sources include:
66
+
67
+ - linked issues or tickets from the PR or commit messages
68
+ - repo-local docs such as BRDs, RFCs, specs, rollout docs, migration docs, and product notes
69
+ - external systems reachable through installed CLI tools or MCPs
70
+ - web-retrieved docs when browsing is available and genuinely useful
71
+
72
+ Why:
73
+
74
+ - code review quality depends heavily on whether the implementation matches intended behavior, non-goals, constraints, and rollout expectations
75
+ - strong reviewers compare code against intent, not just against syntax and local correctness
76
+
77
+ Recommended behavior:
78
+
79
+ - attempt discovery opportunistically and non-blockingly
80
+ - extract intent, acceptance criteria, non-goals, risks, and rollout expectations into durable workflow context
81
+ - if ticket/doc context is missing, continue the review but lower context confidence
82
+
83
+ ### Missing enrichment sources should not block by default
84
+
85
+ The workflow should degrade gracefully when discovery surfaces are missing or insufficient.
86
+
87
+ Recommended behavior:
88
+
89
+ - do not block merely because `gh`, ticket MCPs, or web browsing are unavailable
90
+ - only block when the review target itself is missing or when the workflow cannot inspect enough material to review anything meaningful
91
+ - record what was accessible, what failed, what was not attempted, and why
92
+ - surface improvement suggestions in the final handoff, not as noisy mid-workflow nags unless the missing source is critical
93
+
94
+ ### The workflow should be never-stop by default for enrichment and confidence gaps
95
+
96
+ The redesign should make "continue with degraded confidence" the default behavior.
97
+
98
+ That means the workflow should not stop merely because it cannot:
99
+
100
+ - find the PR/MR
101
+ - find the ticket
102
+ - find supporting docs
103
+ - use `gh`
104
+ - use web browsing
105
+ - use ticket-system MCPs
106
+ - confidently establish the merge base on the first attempt
107
+
108
+ It should stop only when:
109
+
110
+ - there is no meaningful review target
111
+ - there is no inspectable material to review
112
+ - the user must provide a missing artifact that cannot be recovered any other way
113
+
114
+ This should be treated as a core product requirement, not a soft preference.
115
+
116
+ ### The final handoff should include an explicit environment-status section
117
+
118
+ The workflow should end with a user-visible summary of review environment quality.
119
+
120
+ That section should explain:
121
+
122
+ - what the workflow successfully accessed
123
+ - what it attempted but could not access
124
+ - what it never attempted
125
+ - how those gaps affected boundary confidence, context confidence, or final recommendation confidence
126
+ - what tooling or workflow habits would improve future reviews
127
+
128
+ This is important because graceful degradation only helps the user if they can actually see what degraded.
129
+
130
+ ### Ancestor and merge-base handling must become first-class
131
+
132
+ The redesign should explicitly treat review-boundary detection as a core responsibility, not a hidden subtask.
133
+
134
+ Recommended behavior:
135
+
136
+ - determine candidate base branch or parent branch
137
+ - attempt to find the true merge base / ancestor
138
+ - detect stacked branches, stale branches, and divergent branches
139
+ - separate branch-specific changes from inherited or upstream changes
140
+ - produce an explicit boundary-confidence assessment
141
+
142
+ If confidence remains low:
143
+
144
+ - continue with best effort
145
+ - lower recommendation confidence when boundary uncertainty materially affects findings
146
+ - disclose the uncertainty clearly in final handoff
147
+
148
+ ### The workflow should adapt by change type, not just size
149
+
150
+ The redesign should keep QUICK / STANDARD / THOROUGH, but add structured change-shape adaptation.
151
+
152
+ This should influence:
153
+
154
+ - reviewer-family selection
155
+ - validation depth
156
+ - whether simulation or rollout analysis is needed
157
+ - how much false-positive suppression is required
158
+ - how much boundary follow-up is warranted
159
+
160
+ ### The workflow should account for repo rules, user preferences, and coding philosophy
161
+
162
+ The redesign should treat review-policy context as first-class review input.
163
+
164
+ That includes:
165
+
166
+ - repo conventions
167
+ - user-specified rules and preferences
168
+ - architecture guidance
169
+ - coding philosophy
170
+ - review-specific instructions
171
+ - any explicit team or project constraints discoverable from docs, workflow guidance, or user-provided rules
172
+
173
+ Why:
174
+
175
+ - a technically sharp review can still be wrong for the repo if it ignores the user’s rules and architectural preferences
176
+ - strong human reviewers adapt their evaluation to the standards of the codebase they are reviewing
177
+
178
+ Recommended behavior:
179
+
180
+ - attempt to discover policy context alongside ticket/doc context
181
+ - extract durable summaries of rules, conventions, and constraints into workflow context
182
+ - use this policy context to calibrate findings, recommendations, and false-positive suppression
183
+
184
+ ### This redesign is necessary, but not sufficient for elite human parity
185
+
186
+ This redesign would make the workflow materially stronger and much more trustworthy.
187
+
188
+ However, even after these changes, it will likely still lag the best human reviewers in:
189
+
190
+ - severity calibration
191
+ - historical reasoning
192
+ - large-review partitioning
193
+ - subtle product judgment under weak evidence
194
+
195
+ That means this redesign should be treated as a major step toward high-quality review, not the final endpoint.
196
+
197
+ ## Non-Goals
198
+
199
+ - changing engine behavior
200
+ - adding new MCP tool implementations in this pass
201
+ - forcing the workflow to block whenever enrichment sources are missing
202
+ - making the human-facing review doc canonical workflow state
203
+ - fully solving severity calibration, historical reasoning, or large-MR partitioning in the same pass
204
+
205
+ ## WorkRail-Native Authoring Opportunities We Should Use
206
+
207
+ The redesign should take fuller advantage of WorkRail's current v2 surface area instead of expressing everything as plain prompt prose.
208
+
209
+ ### Features with config, not just simple toggles
210
+
211
+ The workflow should likely use:
212
+
213
+ - `wr.features.mode_guidance`
214
+ - `wr.features.durable_recap_guidance`
215
+ - `wr.features.capabilities`
216
+ - `wr.features.output_contracts`
217
+
218
+ And it should consider using the configurable form where helpful, especially for:
219
+
220
+ - collapsed capability probes
221
+ - artifact-backed capability observations
222
+ - consistent enforcement of output contracts across blocking and never-stop modes
223
+
224
+ ### Template-anchored capability probes
225
+
226
+ Where the workflow needs to learn whether `delegation` or `web_browsing` is actually usable, it should prefer explicit template-anchored probing over handwritten duplicated probe prose.
227
+
228
+ The clearest existing fit is:
229
+
230
+ - `wr.templates.capability_probe`
231
+
232
+ paired with:
233
+
234
+ - `wr.contracts.capability_observation`
235
+
236
+ This is especially relevant for the early enrichment phase.
237
+
238
+ ### Prompt refs instead of duplicated guidance
239
+
240
+ The redesign should plan to use `wr.refs.*` snippets for repeated canonical guidance rather than copying the same durable-state or synthesis instructions into many steps.
241
+
242
+ High-value likely uses include:
243
+
244
+ - notes-first durability guidance
245
+ - synthesis-under-disagreement guidance
246
+ - parallelize-cognition / serialize-synthesis guidance
247
+ - adversarial challenge guidance
248
+
249
+ ### PromptBlocks as the default step shape
250
+
251
+ The workflow should favor `promptBlocks` over large single-string prompts for major phases.
252
+
253
+ That makes it easier to:
254
+
255
+ - keep the prompts deterministic
256
+ - expose clear `goal`, `constraints`, `procedure`, `outputRequired`, and `verify` structure
257
+ - attach reusable references and future feature injections cleanly
258
+
259
+ ### Conditions and loop contracts as first-class control flow
260
+
261
+ The redesign already wants loop-based contradiction and follow-up handling. It should lean into current v2 patterns more explicitly by:
262
+
263
+ - defining named conditions where that improves clarity
264
+ - using `wr.contracts.loop_control` consistently for loop decisions
265
+ - treating loop continuation as data, not prose
266
+
267
+ ### Decision-trace and never-stop semantics awareness
268
+
269
+ The workflow should be written with WorkRail's durable `blocked` / `gap_recorded` semantics in mind.
270
+
271
+ That means:
272
+
273
+ - blocking vs never-stop should be intentional per capability/input requirement
274
+ - missing preferred capabilities should degrade with durable disclosure
275
+ - important confidence-relevant misses should be representable as explicit gaps, not only narrative caveats
276
+
277
+ ### Auditor-style delegation, not only executor-style delegation
278
+
279
+ The subagent design docs strongly support the auditor model.
280
+
281
+ The MR review redesign should make fuller use of that by treating many delegations as:
282
+
283
+ - audits of the main agent's gathered context
284
+ - challenges to the main agent's current hypothesis
285
+ - verification of the current recommendation
286
+
287
+ rather than always delegating broad independent ownership of a phase.
288
+
289
+ This is especially valuable for:
290
+
291
+ - context completeness / depth audits
292
+ - boundary-confidence audits
293
+ - philosophy-alignment audits
294
+ - final recommendation validation
295
+
296
+ ### Routine reuse should be explicit
297
+
298
+ The redesign currently references a few routines conceptually, but it should make clearer use of the current routine catalog.
299
+
300
+ High-value candidates include:
301
+
302
+ - `routine-context-gathering`
303
+ - `routine-hypothesis-challenge`
304
+ - `routine-execution-simulation`
305
+ - `routine-philosophy-alignment`
306
+ - `routine-final-verification`
307
+
308
+ These should be treated as current reusable building blocks, not future ideas.
309
+
310
+ ### Direct execution vs delegation vs injection should be chosen deliberately
311
+
312
+ The routines guide gives three valid consumption modes:
313
+
314
+ - delegation to a WorkRail Executor
315
+ - direct execution by the current agent
316
+ - compile-time injection via routine templates
317
+
318
+ The redesign should decide per use case:
319
+
320
+ - delegate when independent cognitive perspective is valuable
321
+ - execute directly when overhead is unnecessary
322
+ - inject when step visibility, confirmation behavior, and session traceability matter
323
+
324
+ ### Extension points can improve customization without weakening orchestration
325
+
326
+ The redesign previously deferred extension points for readability. That was reasonable, but the current WorkRail extension-point model is strong enough that we should explicitly plan where bounded customization would add value.
327
+
328
+ The best candidates appear to be:
329
+
330
+ - reviewer-family bundle policy
331
+ - philosophy-alignment review
332
+ - final verification
333
+
334
+ The parent workflow should still own sequencing, loop control, and canonical synthesis.
335
+
336
+ ### AgentRole is underused
337
+
338
+ The redesign should consider a stronger workflow-level `agentRole` and selective step-level overrides for:
339
+
340
+ - boundary detective mode
341
+ - evidence-first synthesizer mode
342
+ - adversarial validator mode
343
+ - philosophy auditor mode
344
+
345
+ This is lower leverage than control flow and routines, but still worth using intentionally.
346
+
347
+ ## Structure-Balance Framework
348
+
349
+ The redesign should optimize for structured freedom rather than either extreme:
350
+
351
+ - not a loose "trust the model" review flow
352
+ - not a rigid form-filling bureaucracy
353
+
354
+ The workflow should be rigid where determinism, safety, or honesty matter, and adaptive where LLM reasoning quality matters most.
355
+
356
+ ### Keep rigid
357
+
358
+ These are the parts that should stay explicitly structured and hard to skip:
359
+
360
+ - phase boundaries
361
+ - minimum required outputs before advancing
362
+ - confidence reporting
363
+ - loop / follow-up triggers
364
+ - blocked vs never-stop semantics
365
+ - final handoff sections
366
+ - explicit disclosure of gaps and unknowns
367
+ - the rule that reviewer/subagent output is evidence, not canonical truth
368
+
369
+ These are the workflow invariants. They prevent omission, hidden drift, and fake certainty.
370
+
371
+ ### Keep semi-structured
372
+
373
+ These should have strong guidance and matrices, but not exhaustive decision automation:
374
+
375
+ - shape/type routing
376
+ - confidence combination rules
377
+ - severity calibration
378
+ - artifact vs context split
379
+ - when to delegate vs inject vs execute directly
380
+ - when policy-context should materially affect findings
381
+
382
+ These are the parts where structured heuristics help, but judgment still matters.
383
+
384
+ ### Keep adaptive
385
+
386
+ These should deliberately leave room for model creativity and non-obvious reasoning:
387
+
388
+ - exploration order
389
+ - which evidence sources seem most promising first
390
+ - how to connect clues across PR, code, docs, history, and repo patterns
391
+ - how to synthesize multiple weak signals into a coherent concern
392
+ - how to phrase findings for maximum clarity and usefulness
393
+ - when an unusual MR deserves extra scrutiny beyond the default routing heuristics
394
+
395
+ These are the parts where LLMs can outperform rigid scripts.
396
+
397
+ ### Matrix and field admission rule
398
+
399
+ A matrix, field, or ledger element earns its place only if it does at least one of these:
400
+
401
+ - prevents a real recurring failure mode
402
+ - improves deterministic control flow or resumability
403
+ - improves user-visible honesty or explainability
404
+ - changes routing or review depth in a meaningful way
405
+
406
+ If it does none of those, it should be removed or downgraded to advisory guidance.
407
+
408
+ ### Preferred design bias
409
+
410
+ When in doubt:
411
+
412
+ - constrain outcomes, not cognition
413
+ - require explicit state, not rigid thought order
414
+ - keep taxonomies small
415
+ - prefer a few high-value matrices over many low-value classifications
416
+ - use structure to prevent omission, not to suppress intelligent exploration
417
+
418
+ ### Practical consequence for this redesign
419
+
420
+ This means:
421
+
422
+ - keep the confidence matrix
423
+ - keep the gap / non-blocking matrix
424
+ - keep the shape/type routing matrix
425
+ - keep the artifact vs context split
426
+ - avoid exploding shape/type categories beyond what actually changes behavior
427
+ - avoid adding ledgers or flags that do not affect routing, honesty, or final quality
428
+
429
+ ## Current Workflow Gaps
430
+
431
+ ### Review boundary correctness
432
+
433
+ The current workflow does not make review-boundary detection a first-class responsibility.
434
+
435
+ Missing or under-specified behavior:
436
+
437
+ - determine the actual PR/MR when possible
438
+ - identify candidate base branches
439
+ - find the true merge base / ancestor
440
+ - detect stacked branches
441
+ - detect stale or divergent branches
442
+ - separate branch-specific changes from inherited changes
443
+ - explain confidence in the chosen review surface
444
+
445
+ This is the highest-priority gap because a review can be thorough and still be wrong if it reviews the wrong surface.
446
+
447
+ ### Source discovery and context enrichment
448
+
449
+ The current workflow asks for MR purpose and ticket context, but it does not strongly instruct the agent to discover them from all available sources.
450
+
451
+ Missing or under-specified behavior:
452
+
453
+ - discover the actual PR body and metadata when available
454
+ - discover linked ticket / issue context
455
+ - discover repo-local specs, RFCs, design docs, rollout docs, and acceptance criteria
456
+ - search commit messages, branch names, and nearby docs for intent clues
457
+ - use web or other external sources only when available and useful
458
+
459
+ ### Capability-aware graceful degradation
460
+
461
+ The current workflow assumes tool-driven discovery in spirit, but it does not explicitly model discovery-surface availability or insufficiency.
462
+
463
+ Missing or under-specified behavior:
464
+
465
+ - probe whether GitHub CLI access is available
466
+ - probe whether ticket-system access exists
467
+ - probe whether web browsing is available
468
+ - probe whether repo-local docs exist and are discoverable
469
+ - record unavailable or insufficient sources without failing the whole review
470
+
471
+ ### Review-surface hygiene
472
+
473
+ The current workflow moves from context gathering to review too quickly.
474
+
475
+ Missing or under-specified behavior:
476
+
477
+ - classify generated files
478
+ - classify mechanical churn
479
+ - classify rename-only or move-only changes
480
+ - classify likely inherited upstream changes
481
+ - classify out-of-scope or low-signal material
482
+ - focus the fact packet on the true review surface rather than all visible changes equally
483
+
484
+ ### Adaptation by change shape
485
+
486
+ The current workflow adapts mostly by review size and risk.
487
+
488
+ Missing or under-specified behavior:
489
+
490
+ - adapt reviewer-family selection by change type
491
+ - distinguish API changes from migrations, refactors, config edits, test-only changes, docs-only changes, security-sensitive changes, and performance-sensitive changes
492
+ - increase boundary rigor when ancestry is ambiguous
493
+ - reduce over-review for clearly mechanical or low-risk changes
494
+
495
+ ### Final disclosure
496
+
497
+ The current final handoff does not strongly require the workflow to explain:
498
+
499
+ - what it successfully accessed
500
+ - what it attempted but could not access
501
+ - what it never attempted
502
+ - how those limits affected review quality
503
+ - what environment improvements would make future reviews stronger
504
+
505
+ ## Target Design Principles
506
+
507
+ ### Correctness before depth
508
+
509
+ A shallow review on the right boundary is better than a deep review on the wrong boundary.
510
+
511
+ ### Discover first, ask second
512
+
513
+ The workflow should aggressively use available tools and sources before asking the user for missing information.
514
+
515
+ ### Degrade gracefully
516
+
517
+ Missing enrichment sources should lower confidence and be disclosed, not automatically block the workflow.
518
+
519
+ ### Evidence over assumptions
520
+
521
+ The workflow should explicitly distinguish:
522
+
523
+ - directly observed facts
524
+ - inferred context
525
+ - missing evidence
526
+ - contradictory evidence
527
+
528
+ ### Human-readable truth, workflow-owned truth
529
+
530
+ Human-facing artifacts are useful, but durable workflow truth remains in notes and explicit context fields.
531
+
532
+ ### Review the change that matters
533
+
534
+ The workflow should separate core review surface from noise before deep analysis begins.
535
+
536
+ ### Honest confidence over false certainty
537
+
538
+ The workflow should prefer saying "I could not confidently establish the boundary" over quietly pretending it found the right ancestor.
539
+
540
+ ## Proposed Workflow Shape
541
+
542
+ ## Phase 0: Locate, Bound, Enrich, and Classify
543
+
544
+ This phase replaces the current front-half flow.
545
+
546
+ It should execute five structured sub-steps.
547
+
548
+ ### 0.1 Locate the review target
549
+
550
+ Determine, when possible:
551
+
552
+ - PR/MR URL or number
553
+ - branch name
554
+ - HEAD SHA
555
+ - diff source type
556
+ - whether the user provided:
557
+ - PR URL
558
+ - branch
559
+ - patch
560
+ - local diff
561
+ - only a vague review request
562
+
563
+ Recommended decision:
564
+
565
+ - if a discoverable PR/MR exists, treat it as the primary review target
566
+ - if no PR/MR exists or can be found, fall back to branch or diff review without blocking
567
+
568
+ ### 0.2 Find the true review boundary
569
+
570
+ Attempt to determine:
571
+
572
+ - candidate base branch
573
+ - merge base / ancestor
574
+ - whether the branch is stacked
575
+ - whether the branch is stale or divergent
576
+ - exact commits under review
577
+ - exact files under review
578
+ - inherited changes to exclude
579
+ - why the workflow believes this is the correct review surface
580
+
581
+ If the workflow cannot establish this confidently, it should:
582
+
583
+ - continue with best effort
584
+ - lower boundary confidence
585
+ - record warnings
586
+ - disclose the uncertainty in final handoff
587
+
588
+ This phase should be considered incomplete if it does not at least attempt merge-base / ancestor reasoning.
589
+
590
+ ### 0.3 Discover enrichments
591
+
592
+ Attempt to discover:
593
+
594
+ - PR metadata and body
595
+ - ticket / issue context
596
+ - repo-local product or design docs
597
+ - repo-local rules, conventions, and project guidance
598
+ - RFCs and specs
599
+ - rollout or migration docs
600
+ - acceptance criteria
601
+ - product risks and non-goals
602
+
603
+ The workflow should explicitly prefer recovering this context itself before asking the user for it.
604
+
605
+ Preferred discovery order:
606
+
607
+ 1. direct CLI / MCP surfaces
608
+ 2. repo-local docs and links
609
+ 3. branch names and commit messages
610
+ 4. PR body and issue links
611
+ 5. nearby documentation by naming convention
612
+ 6. web or browser access when available
613
+
614
+ The workflow should treat missing enrichments as confidence-relevant, not as automatic failure conditions.
615
+
616
+ It should treat policy-context discovery as part of enrichment, not as a separate optional nicety.
617
+
618
+ ### 0.4 Probe capability availability lazily
619
+
620
+ Without blocking unless correctness requires it, attempt to determine availability or insufficiency of:
621
+
622
+ - `delegation`
623
+ - `web_browsing`
624
+ - GitHub / PR CLI access
625
+ - ticket-system access
626
+ - repo-local docs access
627
+ - relevant attached artifacts
628
+
629
+ For workflow-global capabilities such as `delegation` and `web_browsing`, this should align with the v2 capability-observation model rather than inventing a custom side channel. Where useful, this likely means using existing patterns such as `wr.templates.capability_probe` and `wr.contracts.capability_observation`.
630
+
631
+ Discovery surfaces beyond first-class workflow capabilities, such as GitHub CLI, ticket systems, repo-local docs, or attached artifacts, should still be recorded durably as structured observations even if they are not modeled as top-level capability enums.
632
+
633
+ Each probed source should be recorded structurally as one of:
634
+
635
+ - `available`
636
+ - `unavailable`
637
+ - `not_attempted`
638
+ - `attempted_but_insufficient`
639
+
640
+ Where the final workflow authoring remains readable, the preferred implementation path is:
641
+
642
+ - `wr.features.capabilities` with collapsed probe visibility
643
+ - `wr.templates.capability_probe` for first-class capability checks
644
+ - artifact-backed recording via `wr.contracts.capability_observation`
645
+
646
+ ### 0.5 Classify
647
+
648
+ Classify based on:
649
+
650
+ - change size
651
+ - change shape
652
+ - change type
653
+ - risk level
654
+ - context completeness
655
+ - boundary confidence
656
+ - review-surface cleanliness
657
+
658
+ Set:
659
+
660
+ - `reviewMode`
661
+ - `shapeProfile`
662
+ - `riskLevel`
663
+ - `changeTypeProfile`
664
+ - `boundaryConfidence`
665
+ - `contextConfidence`
666
+ - `maxParallelism`
667
+ - `needsReviewerBundle`
668
+ - `needsSimulation`
669
+ - `needsBoundaryFollowup`
670
+ - `needsContextFollowup`
671
+ - `needsAuditorPass`
672
+
673
+ ## Phase 1: State Initial Review Hypothesis
674
+
675
+ This phase stays, but it should now be informed by:
676
+
677
+ - review boundary certainty
678
+ - source ledger findings
679
+ - discovered intent and acceptance criteria
680
+ - discovered policy context
681
+ - change-shape classification
682
+ - change-type classification
683
+
684
+ The agent should state:
685
+
686
+ - current recommendation direction
687
+ - primary concern area
688
+ - what evidence would most likely overturn the current view
689
+ - whether the largest risk is code correctness, review-boundary uncertainty, or missing context
690
+
691
+ ## Phase 2: Build Fact Packet and Review-Surface Ledger
692
+
693
+ The current fact-packet idea remains useful, but it should be expanded.
694
+
695
+ The workflow should build both:
696
+
697
+ - `reviewFactPacket`
698
+ - `reviewSurfaceLedger`
699
+
700
+ ### `reviewFactPacket`
701
+
702
+ Should include:
703
+
704
+ - MR title and purpose
705
+ - intended behavior change
706
+ - non-goals if discoverable
707
+ - ticket and doc-derived constraints
708
+ - repo and user policy constraints
709
+ - acceptance criteria
710
+ - affected modules, contracts, invariants, and consumers
711
+ - tests, rollout expectations, and migration expectations
712
+ - unresolved unknowns
713
+
714
+ ### `reviewSurfaceLedger`
715
+
716
+ Should include:
717
+
718
+ - exact review boundary description
719
+ - included commits
720
+ - excluded inherited commits
721
+ - core review surface files
722
+ - generated files
723
+ - mechanical churn
724
+ - rename-only / move-only files
725
+ - low-signal or out-of-scope files
726
+ - review-scope warnings
727
+
728
+ This step should also initialize a stronger coverage model and decide reviewer families using both change size and change type.
729
+
730
+ It should additionally record whether the review is operating with:
731
+
732
+ - strong boundary confidence
733
+ - weak boundary confidence
734
+ - strong intent/context confidence
735
+ - weak intent/context confidence
736
+
737
+ so later phases can adapt accordingly.
738
+
739
+ It should also persist whether policy-context confidence is:
740
+
741
+ - strong enough to evaluate against repo/user expectations
742
+ - weak enough that findings should be presented more cautiously
743
+
744
+ This phase is also a good place for an auditor-style context quality pass:
745
+
746
+ - a completeness-focused audit
747
+ - a depth-focused audit
748
+
749
+ If the workflow delegates these, they should audit the main agent's gathered packet rather than own the whole understanding phase.
750
+
751
+ ## Phase 3: Adaptive Reviewer-Family Bundle
752
+
753
+ Reviewer-family delegation should be selected using:
754
+
755
+ - `reviewMode`
756
+ - `riskLevel`
757
+ - `shapeProfile`
758
+ - `changeTypeProfile`
759
+ - `boundaryConfidence`
760
+ - `contextConfidence`
761
+
762
+ Examples:
763
+
764
+ - test-only change: lighter architecture scrutiny, stronger false-positive suppression
765
+ - migration change: stronger rollout, compatibility, and data-integrity scrutiny
766
+ - security-sensitive change: stronger runtime and adversarial review
767
+ - ambiguous boundary: stronger boundary-validation or context follow-up
768
+ - large mixed-shape change: stronger partitioning instincts and more cautious confidence
769
+ - mechanically noisy change: stronger noise suppression and lower appetite for style-only findings
770
+
771
+ Reviewer families should still be evidence producers, not decision makers.
772
+
773
+ The redesign should also distinguish between:
774
+
775
+ - reviewer-family execution work
776
+ - auditor-style critique of the current synthesis
777
+
778
+ Both are useful, but they are not the same cognitive unit.
779
+
780
+ The workflow should further strengthen:
781
+
782
+ - explicit pre-delegation hypothesis
783
+ - explicit post-delegation synthesis
784
+ - explicit rejection of weak or overreaching findings
785
+ - explicit handling of missed-issue and false-positive signals
786
+
787
+ This phase should explicitly consider use of:
788
+
789
+ - `routine-hypothesis-challenge` for adversarial reviewer challenge
790
+ - `routine-execution-simulation` when runtime behavior or branch-sensitive behavior is material
791
+ - `routine-philosophy-alignment` when policy-context is important enough to affect recommendation quality
792
+
793
+ ## Phase 4: Contradiction, Gap, and Boundary Resolution Loop
794
+
795
+ This should broaden the current contradiction loop into a more general resolution loop.
796
+
797
+ It should continue when there is material unresolved:
798
+
799
+ - reviewer disagreement
800
+ - coverage uncertainty
801
+ - false-positive risk
802
+ - boundary uncertainty
803
+ - context insufficiency
804
+
805
+ Targeted follow-up should be minimal and focused. The workflow should avoid re-running broad discovery unless it learns that the original boundary or context assumptions were wrong.
806
+
807
+ This loop is also where the workflow should reopen:
808
+
809
+ - merge-base reasoning when ancestry assumptions were weak
810
+ - ticket/doc discovery when missing context materially affects recommendation quality
811
+
812
+ ## Phase 5: Final Validation
813
+
814
+ The current final validation idea remains useful, but it should explicitly validate:
815
+
816
+ - recommendation strength
817
+ - severity calibration
818
+ - evidence quality
819
+ - operational / rollout concerns
820
+ - compatibility / migration risk
821
+ - whether unresolved context or boundary issues materially weaken the recommendation
822
+
823
+ Final validation should also ensure the handoff reflects uncertainty honestly instead of over-stating confidence.
824
+
825
+ The current WorkRail routine catalog suggests the redesign should strongly consider `routine-final-verification` as either:
826
+
827
+ - a delegated verifier
828
+ - an injected routine template
829
+ - or a direct-execution structure borrowed into the final validation phase
830
+
831
+ ## Phase 6: Final Handoff and Environment Status
832
+
833
+ The final handoff should include both the review result and an explicit status report about the review environment.
834
+
835
+ ### Review result
836
+
837
+ Include:
838
+
839
+ - recommendation
840
+ - confidence band
841
+ - top findings
842
+ - rationale
843
+ - remaining uncertainties
844
+ - summary of review surface and excluded noise
845
+ - validation outcomes
846
+
847
+ ### Review environment status
848
+
849
+ Include:
850
+
851
+ - what the workflow accessed successfully
852
+ - what it attempted but could not access
853
+ - what it never attempted
854
+ - impact on review quality
855
+ - suggested environment improvements for future reviews
856
+
857
+ This should be informative, not accusatory and not blocking.
858
+
859
+ It should also explicitly state:
860
+
861
+ - whether the workflow found the actual PR/MR
862
+ - whether the workflow found ticket context
863
+ - whether the workflow found supporting docs
864
+ - whether the workflow is confident it reviewed the correct ancestor-relative surface
865
+
866
+ ## New Core Concepts
867
+
868
+ ## Review Source Ledger
869
+
870
+ The workflow should maintain a structured ledger describing where review context came from.
871
+
872
+ Suggested fields:
873
+
874
+ - `reviewTargetSource`
875
+ - `boundarySource`
876
+ - `mrMetadataSource`
877
+ - `ticketSource`
878
+ - `docSourcesFound`
879
+ - `docSourcesMissing`
880
+ - `policySourcesFound`
881
+ - `policySourcesMissing`
882
+ - `capabilityObservations`
883
+ - `contextGaps`
884
+
885
+ This ledger exists to improve both reasoning quality and final transparency.
886
+
887
+ It is still open whether this should be represented primarily as:
888
+
889
+ - explicit context keys
890
+ - a dedicated structured artifact
891
+ - or both, with context carrying only the routing-critical subset
892
+
893
+ If a dedicated artifact is used, the workflow should still keep routing-critical fields in context so conditions, loops, and later phases remain deterministic and lightweight.
894
+
895
+ ## Boundary Confidence Model
896
+
897
+ The workflow should model review-boundary certainty explicitly rather than burying it in prose.
898
+
899
+ Suggested fields:
900
+
901
+ - `baseCandidate`
902
+ - `mergeBaseConfidence`
903
+ - `stackedBranchSuspected`
904
+ - `reviewBoundaryConfidence`
905
+ - `boundaryResolutionMethod`
906
+ - `reviewScopeWarnings`
907
+ - `baseResolutionFailed`
908
+
909
+ This is likely one of the strongest predictors of whether the workflow will rival strong human review.
910
+
911
+ ## Change Type Profile
912
+
913
+ The workflow should classify the change into a structured profile rather than using only size/risk heuristics.
914
+
915
+ Suggested categories:
916
+
917
+ - `api_contract_change`
918
+ - `data_model_or_migration`
919
+ - `refactor`
920
+ - `infra_or_config`
921
+ - `test_only`
922
+ - `docs_only`
923
+ - `security_sensitive`
924
+ - `performance_sensitive`
925
+ - `ui_only`
926
+ - `mechanical_or_generated`
927
+
928
+ This profile should influence reviewer-family selection, simulation choices, and validation depth.
929
+
930
+ ## Shape Profile
931
+
932
+ The workflow should classify MR shape separately from MR type.
933
+
934
+ Suggested categories:
935
+
936
+ - `tiny_isolated_change`
937
+ - `medium_localized_change`
938
+ - `broad_crosscutting_change`
939
+ - `stacked_branch_change`
940
+ - `mechanically_noisy_change`
941
+ - `mixed_signal_change`
942
+ - `migration_heavy_change`
943
+
944
+ This profile should influence:
945
+
946
+ - review partitioning strategy
947
+ - boundary follow-up depth
948
+ - reviewer-family breadth
949
+ - confidence calibration
950
+ - false-positive suppression
951
+
952
+ ## Review Surface Hygiene Model
953
+
954
+ The workflow should explicitly separate:
955
+
956
+ - `core_review_surface`
957
+ - `generated_files`
958
+ - `mechanical_churn`
959
+ - `rename_or_move_only`
960
+ - `likely_inherited_changes`
961
+ - `out_of_scope_or_noise`
962
+
963
+ Without this, large reviews will continue to waste attention and overproduce low-value findings.
964
+
965
+ ## Capability Observation Model
966
+
967
+ Capability probing should produce durable observations rather than vague narrative.
968
+
969
+ Suggested recorded dimensions:
970
+
971
+ - source name
972
+ - status
973
+ - attempt method
974
+ - limitation reason
975
+ - whether the limitation materially reduced review quality
976
+
977
+ For first-class workflow capabilities, the redesign should prefer the existing v2 capability-observation path rather than inventing a bespoke mechanism.
978
+
979
+ For non-capability discovery surfaces, the main requirement is still durable structured observation, but the exact storage form remains an open authoring decision.
980
+
981
+ ## Suggested Top-Level Capability Direction
982
+
983
+ At the workflow level, the redesign likely wants:
984
+
985
+ ```json
986
+ {
987
+ "capabilities": {
988
+ "delegation": "preferred",
989
+ "web_browsing": "preferred"
990
+ }
991
+ }
992
+ ```
993
+
994
+ The workflow should still treat GitHub CLI, ticket systems, and repo-local docs as discovery surfaces to probe rather than first-class capability enums.
995
+
996
+ The final workflow should likely also use feature config intentionally, not just capability declarations alone.
997
+
998
+ Example direction:
999
+
1000
+ - `wr.features.capabilities` to standardize probing behavior
1001
+ - `wr.features.output_contracts` to standardize enforcement and disclosure behavior
1002
+
1003
+ ## Acceptance Criteria for the Redesign
1004
+
1005
+ The redesign should be considered successful if the future workflow:
1006
+
1007
+ 1. attempts to discover the actual MR/PR when possible
1008
+ 2. attempts to determine the true review boundary and exposes confidence in that boundary
1009
+ 3. records discovery-source availability and insufficiency durably
1010
+ 4. separates core review surface from noise before deep review
1011
+ 5. adapts reviewer selection using change shape as well as size/risk
1012
+ 6. uses final handoff to disclose access limits and their effect on confidence
1013
+ 7. remains non-blocking unless correctness truly requires user input or unavailable artifacts
1014
+ 8. keeps notes/context as workflow truth rather than making a review doc canonical
1015
+ 9. attempts merge-base / ancestor resolution even for stale or stacked branches
1016
+ 10. explicitly says when it is not confident it reviewed the correct surface
1017
+ 11. attempts to recover repo/user rules, conventions, and coding philosophy when available
1018
+ 12. uses policy-context confidence to calibrate how strongly it frames findings and recommendations
1019
+
1020
+ ## Risks and Tensions
1021
+
1022
+ ### Risk: overloaded Phase 0
1023
+
1024
+ This redesign puts a lot into the first phase.
1025
+
1026
+ Mitigation:
1027
+
1028
+ - keep the phase internally structured
1029
+ - use explicit sub-steps
1030
+ - require durable structured outputs, not just longer prose
1031
+ - use routines, templates, and auditors selectively so structure does not collapse into one giant handwritten prompt
1032
+
1033
+ ### Risk: environment-probing noise
1034
+
1035
+ Capability and source probing can become verbose or distracting.
1036
+
1037
+ Mitigation:
1038
+
1039
+ - probe lazily
1040
+ - record compactly
1041
+ - summarize cleanly in the final handoff
1042
+ - prefer collapsed capability probes and reusable probe templates where authoring stays readable
1043
+
1044
+ ### Risk: false precision in boundary confidence
1045
+
1046
+ The workflow may pretend certainty it does not actually have.
1047
+
1048
+ Mitigation:
1049
+
1050
+ - require explicit reasoning for boundary confidence
1051
+ - record warnings when ancestry remains ambiguous
1052
+ - allow confidence downgrade without blocking
1053
+
1054
+ ### Risk: review-quality theater
1055
+
1056
+ The workflow could produce a polished review that looks rigorous while still lacking enough context to justify its confidence.
1057
+
1058
+ Mitigation:
1059
+
1060
+ - tie recommendation confidence to boundary confidence and context confidence
1061
+ - require the final handoff to name important unavailable sources
1062
+ - prefer explicit uncertainty over polished but misleading certainty
1063
+
1064
+ ### Risk: policy-context mismatch
1065
+
1066
+ The workflow could produce findings that are locally reasonable but misaligned with the user’s rules, repo conventions, or architectural philosophy.
1067
+
1068
+ Mitigation:
1069
+
1070
+ - discover policy context explicitly
1071
+ - record missing policy sources as confidence-relevant gaps
1072
+ - present findings more cautiously when policy context is weak
1073
+
1074
+ ### Risk: underusing WorkRail-native structure
1075
+
1076
+ The redesign could be conceptually strong but still author the final workflow as mostly handwritten prompts, leaving reuse, determinism, and customization power on the table.
1077
+
1078
+ Mitigation:
1079
+
1080
+ - prefer promptBlocks over long freeform prompts
1081
+ - use refs for repeated canonical guidance
1082
+ - use routines deliberately
1083
+ - use extension points only for bounded high-value seams
1084
+
1085
+ ## Assessment of the Proposed Shape
1086
+
1087
+ ### Is this the best shape?
1088
+
1089
+ This is the best next shape I would currently recommend, but probably not the final best possible shape.
1090
+
1091
+ It addresses the most important structural weaknesses in the current workflow:
1092
+
1093
+ - wrong-surface review risk
1094
+ - weak intent reconstruction
1095
+ - insufficient graceful degradation
1096
+ - under-specified environment transparency
1097
+
1098
+ ### Will the review be thorough and useful?
1099
+
1100
+ Yes, this redesign should produce much more thorough and useful reviews than the current workflow, especially when the environment has enough discovery surfaces to enrich the review.
1101
+
1102
+ ### Will it rival the best human engineers?
1103
+
1104
+ Not reliably yet.
1105
+
1106
+ It should get much closer, but the best human reviewers still outperform in:
1107
+
1108
+ - nuanced severity judgment
1109
+ - historical and organizational context reconstruction
1110
+ - large-change decomposition
1111
+ - subtle product and rollout reasoning under ambiguity
1112
+
1113
+ ### Is it adaptable to the size of the changes?
1114
+
1115
+ Yes, and more importantly, the redesign makes it adaptable to both size and change shape.
1116
+
1117
+ That is a meaningful improvement over the current design, which is still too size/risk-centric.
1118
+
1119
+ ### Does it properly identify the correct ancestor?
1120
+
1121
+ Not yet in the current workflow.
1122
+
1123
+ In the redesigned workflow, ancestor and merge-base handling must become a required attempted behavior, with explicit confidence reporting when the result is uncertain.
1124
+
1125
+ ### Risk: overfitting reviewer families to categories
1126
+
1127
+ Too much change-type routing could make the workflow brittle.
1128
+
1129
+ Mitigation:
1130
+
1131
+ - keep a small change-type taxonomy
1132
+ - use it to influence, not fully determine, reviewer choice
1133
+
1134
+ ## Open Questions
1135
+
1136
+ ### Workflow-authoring questions
1137
+
1138
+ - Should capability observations use `wr.templates.capability_probe` / `wr.contracts.capability_observation` directly in the final workflow, or should some probes stay handwritten for readability?
1139
+ - Should non-capability source observations share the same artifact style, or live in a separate review-source ledger?
1140
+ - Should the review-source ledger be a dedicated artifact, explicit context fields, or both?
1141
+ - Should boundary-confidence handling live entirely inside Phase 0, or also have a reusable template/routine?
1142
+
1143
+ ### Product questions
1144
+
1145
+ - Should missing PR metadata reduce confidence mildly or strongly?
1146
+ - Should a very low boundary-confidence result reopen discovery automatically, or only surface a warning?
1147
+ - How strong should the end-of-workflow tooling recommendations be before they feel noisy rather than helpful?
1148
+
1149
+ ### Scope questions
1150
+
1151
+ - Should large-MR partitioning be part of this redesign, or explicitly deferred?
1152
+ - Should historical reasoning from prior commits or nearby blame/history be added now, or later?
1153
+ - Should severity-calibration improvements be bundled with this redesign, or follow after the boundary/context work lands?
1154
+
1155
+ ## Recommended Next Step
1156
+
1157
+ Use this document as the working source of truth until the design stabilizes.
1158
+
1159
+ Once the open questions are narrowed, the next step should be a second-pass revision of this document that:
1160
+
1161
+ - decides which fields are required durable context
1162
+ - decides which fields should be artifact-backed
1163
+ - defines the exact reviewer-family routing logic by change type
1164
+ - defines what Phase 0 must output before the workflow can advance
1165
+
1166
+ Only after that should `mr-review-workflow.agentic.v2.json` be updated.
1167
+
1168
+ ## Next Implementation Slice (Recommended)
1169
+
1170
+ The best next implementation pass should be **narrow, high-leverage, and compatibility-aware**.
1171
+
1172
+ Do **not** try to land the entire future-state redesign at once.
1173
+
1174
+ The recommended slice is:
1175
+
1176
+ 1. **replace the current Phase 0 with a real Locate / Bound / Enrich / Classify front phase**
1177
+ 2. **strengthen the fact packet so it includes review surface and discovered intent**
1178
+ 3. **add final environment-status disclosure**
1179
+ 4. **add minimal shape/type-aware routing where it clearly changes reviewer-family choice**
1180
+
1181
+ This slice should intentionally **not** try to solve every remaining elite-review gap in the same pass.
1182
+
1183
+ This section should be treated as the **canonical near-term implementation plan** for the redesign. If it conflicts with earlier broad design material, prefer this narrower slice for the next real workflow update.
1184
+
1185
+ ### Why this slice is best
1186
+
1187
+ It directly addresses the largest correctness and usefulness gaps:
1188
+
1189
+ - wrong review-boundary risk
1190
+ - weak MR/ticket/doc discovery
1191
+ - insufficient graceful degradation visibility
1192
+ - under-specified review-surface hygiene
1193
+
1194
+ without forcing the workflow into a giant taxonomy or a schema-fighting rewrite.
1195
+
1196
+ ## Engine vs Agent Responsibility Split
1197
+
1198
+ The next implementation slice should use WorkRail for **control, durability, and accountability**, while leaving investigation and synthesis flexible.
1199
+
1200
+ ### Engine should enforce
1201
+
1202
+ - phase boundaries
1203
+ - minimum required outputs
1204
+ - confidence/follow-up routing
1205
+ - durable disclosure of missing context and uncertainty
1206
+ - final handoff structure
1207
+
1208
+ ### Agent should own
1209
+
1210
+ - discovery path
1211
+ - source prioritization
1212
+ - tool choice
1213
+ - evidence synthesis
1214
+ - non-obvious issue detection
1215
+
1216
+ This redesign should constrain **what must be established or admitted**, not prescribe exact commands or investigative choreography.
1217
+
1218
+ ## Minimum Phase 0 Contract
1219
+
1220
+ The next pass should stop treating Phase 0 as mostly prose and make it a compact execution contract.
1221
+
1222
+ ### Minimum required context before advancing
1223
+
1224
+ These fields should always be set before Phase 0 completes:
1225
+
1226
+ - `reviewTargetKind` - enum-like classification such as `pr`, `branch`, `diff`, `patch`, or `unknown`
1227
+ - `reviewTargetSource` - short provenance label such as `user_link`, `gh_discovery`, `git_branch`, or `local_diff`
1228
+ - `reviewSurfaceSummary` - one compact prose summary of what is actually being reviewed
1229
+ - `reviewMode`
1230
+ - `riskLevel`
1231
+ - `boundaryConfidence`
1232
+ - `contextConfidence`
1233
+ - `shapeProfile`
1234
+ - `changeTypeProfile`
1235
+ - `needsReviewerBundle`
1236
+ - `needsBoundaryFollowup`
1237
+ - `needsContextFollowup`
1238
+ - `reviewScopeWarnings` - short list of warnings, not a large ledger
1239
+
1240
+ ### Required-if-known context
1241
+
1242
+ These fields should be set when they can be discovered without blocking:
1243
+
1244
+ - `prUrl`
1245
+ - `prNumber`
1246
+ - `baseCandidate`
1247
+ - `mergeBaseRef`
1248
+ - `ticketRefs` - list of ticket/issue identifiers when discoverable
1249
+ - `supportingDocsFound` - compact list of discovered supporting docs, not a boolean
1250
+ - `policySourcesFound` - compact list of repo/user policy sources, not a boolean
1251
+
1252
+ ### Minimum review-surface outputs
1253
+
1254
+ Phase 0 should also classify the visible change into a minimal review-surface shape.
1255
+
1256
+ It does **not** need a large ledger in the next slice, but it should at least distinguish:
1257
+
1258
+ - `coreReviewSurface`
1259
+ - `likelyNoiseOrMechanicalChurn`
1260
+ - `likelyInheritedOrOutOfScopeChanges`
1261
+
1262
+ This can stay compact. The important thing is that the workflow does not treat every visible changed file as equally worthy of deep review by default.
1263
+
1264
+ ### Minimum advance rule
1265
+
1266
+ Phase 0 may advance when all of these are true:
1267
+
1268
+ - there is a meaningful review target
1269
+ - there is inspectable material
1270
+ - `boundaryConfidence` is set
1271
+ - `contextConfidence` is set
1272
+ - `shapeProfile` is set
1273
+ - `changeTypeProfile` is set
1274
+ - the workflow has explicitly recorded whether boundary/context follow-up is needed
1275
+
1276
+ This keeps the workflow **non-blocking by default** while still making the phase structurally complete.
1277
+
1278
+ ### Explicit fallback behavior
1279
+
1280
+ The next implementation slice should make these fallback behaviors explicit:
1281
+
1282
+ | Situation | Expected Phase 0 behavior |
1283
+ |---|---|
1284
+ | PR/MR not found, but branch/diff is inspectable | continue with branch/diff review, lower context confidence, disclose missing PR context later |
1285
+ | branch exists, but merge-base / ancestor is ambiguous | continue with downgraded boundary confidence, record boundary follow-up need, disclose the uncertainty later |
1286
+ | no ticket or supporting docs found | continue, lower context confidence, avoid overclaiming intent-sensitive findings |
1287
+ | only a patch/diff is available | continue if inspectable, but keep lower confidence on intent/boundary-dependent conclusions |
1288
+ | inspectable target is missing entirely | ask for the missing review artifact and stop |
1289
+
1290
+ ## Narrow Artifact vs Context Decision
1291
+
1292
+ The next implementation slice should prefer **context-first** rather than introducing multiple new artifacts immediately.
1293
+
1294
+ ### Keep in context
1295
+
1296
+ Use context for routing-critical state:
1297
+
1298
+ - boundary confidence
1299
+ - context confidence
1300
+ - shape/type classification
1301
+ - follow-up triggers
1302
+ - base candidate / merge-base outcome
1303
+ - whether PR/ticket/docs were found
1304
+
1305
+ ### Add at most one optional artifact
1306
+
1307
+ If the workflow needs a human-readable artifact in the next pass, add only one optional artifact:
1308
+
1309
+ - `boundary-analysis`
1310
+
1311
+ Do **not** add multiple ledgers in the next implementation slice unless they clearly improve execution quality.
1312
+
1313
+ ## Minimal Routing Matrix
1314
+
1315
+ The next pass should use a **small routing table**, not a giant decision taxonomy.
1316
+
1317
+ For the next slice, keep these classifications intentionally small:
1318
+
1319
+ - `shapeProfile`: `isolated_change`, `crosscutting_change`, `mechanically_noisy_change`, `ambiguous_boundary`
1320
+ - `changeTypeProfile`: `api_contract_change`, `data_model_or_migration`, `security_sensitive`, `test_only`, `general_code_change`
1321
+
1322
+ Anything more detailed should stay out of the workflow until it clearly changes behavior enough to justify itself.
1323
+
1324
+ | Situation | Reviewer / follow-up bias |
1325
+ |---|---|
1326
+ | `boundaryConfidence = Low` | boundary/context follow-up before strong recommendation confidence |
1327
+ | `changeTypeProfile = api_contract_change` | stronger contract/consumer/backward-compatibility scrutiny |
1328
+ | `changeTypeProfile = data_model_or_migration` | stronger rollout / compatibility / simulation lens |
1329
+ | `changeTypeProfile = security_sensitive` | stronger adversarial/runtime-risk scrutiny and lower tolerance for weak evidence |
1330
+ | `changeTypeProfile = test_only` | lighter architecture scrutiny, stronger false-positive suppression |
1331
+ | `shapeProfile = mechanically_noisy_change` | stronger noise filtering, lower appetite for style-only findings |
1332
+ | `criticalSurfaceTouched = true` | include runtime/production-risk reviewer path |
1333
+
1334
+ If a classification does not change behavior at least this much, it should stay out of the workflow.
1335
+
1336
+ ## Minimal Confidence Rules
1337
+
1338
+ The next implementation slice should use a small **self-assessment matrix** so the agent can gauge confidence dimension by dimension instead of picking a vague overall feeling.
1339
+
1340
+ ### Use a 5-dimension confidence assessment in the prompt
1341
+
1342
+ The next pass does **not** need a scoring engine. A compact assessment block in the prompt is enough for the near-term workflow update.
1343
+
1344
+ Longer term, this is a strong candidate for a native WorkRail **assessment / decision gate** primitive so the engine, not the agent, can apply the aggregation and routing rules deterministically.
1345
+
1346
+ This assessment should be treated as **decision support**, not automatic truth. It should help the agent gauge uncertainty consistently, cap or guide conclusions, and trigger follow-up when needed, but it should not replace synthesis judgment.
1347
+
1348
+ Recommended dimensions:
1349
+
1350
+ - `boundaryConfidence`
1351
+ - `intentConfidence`
1352
+ - `evidenceConfidence`
1353
+ - `coverageConfidence`
1354
+ - `consensusConfidence`
1355
+
1356
+ Recommended levels:
1357
+
1358
+ - `High`
1359
+ - `Medium`
1360
+ - `Low`
1361
+
1362
+ Recommended prompt shape:
1363
+
1364
+ - rate each dimension
1365
+ - explain each in one sentence
1366
+ - apply the aggregation rules
1367
+ - state final recommendation confidence and why
1368
+
1369
+ ### Anchors for each dimension
1370
+
1371
+ - **High**: strong direct support, little ambiguity
1372
+ - **Medium**: partial support, some important uncertainty
1373
+ - **Low**: weak support, major ambiguity, or likely missing context
1374
+
1375
+ ### Aggregation rules
1376
+
1377
+ - if `boundaryConfidence = Low`, final recommendation confidence = `Low`
1378
+ - else if `evidenceConfidence = Low`, final recommendation confidence = `Low`
1379
+ - else if 2 or more dimensions are `Medium`, final recommendation confidence = `Medium`
1380
+ - else if all key dimensions are `High`, final recommendation confidence = `High`
1381
+ - unresolved disagreement can only lower confidence, never raise it
1382
+
1383
+ ### Hard rules around the assessment
1384
+
1385
+ - if important supporting sources were unavailable, the final handoff must say so explicitly
1386
+ - if intent confidence is weak and a finding depends heavily on inferred intent, prefer the lower-confidence interpretation
1387
+ - if coverage confidence is weak, findings should be framed as more tentative even when local code evidence looks strong
1388
+ - if evidence later proves stronger or weaker than the initial read, synthesis may lower confidence further, but should not exceed what the assessment justifies without an explicit reason
1389
+
1390
+ ### Follow-up triggers
1391
+
1392
+ - low boundary confidence -> boundary follow-up
1393
+ - low intent confidence with intent-sensitive findings -> context follow-up
1394
+ - low evidence confidence on a serious finding -> more validation or follow-up
1395
+ - low coverage confidence -> targeted coverage-expansion follow-up
1396
+ - unresolved contradictory reviewer output -> synthesis/follow-up loop
1397
+
1398
+ This should remain a **prompt-level self-assessment** in the next slice, not a large new workflow-state subsystem.
1399
+
1400
+ ## Explicit Deferrals
1401
+
1402
+ The next implementation slice should **defer** these items unless the user explicitly wants a broader redesign:
1403
+
1404
+ - large-MR partitioning
1405
+ - historical reasoning / blame-style context
1406
+ - a full severity calibration framework
1407
+ - broad extension-point authoring
1408
+ - a multi-artifact ledger system
1409
+ - feature-config / capability-probe authoring that depends on unsupported schema/compiler support
1410
+
1411
+ These are still good ideas, but they are not the highest-leverage next landing.
1412
+
1413
+ ## Current Engine-Compatibility Constraint
1414
+
1415
+ The redesign doc is still intentionally ahead of the current engine in some places.
1416
+
1417
+ For the next real workflow update, avoid depending on authoring constructs that the current schema/compiler does not yet accept cleanly.
1418
+
1419
+ In particular:
1420
+
1421
+ - do not assume top-level `capabilities` is available unless schema support is added first
1422
+ - do not assume every v2 feature flag discussed in docs is currently accepted by the compiler
1423
+
1424
+ The next implementation slice should prefer current-schema-compatible prompts, context fields, and supported contracts unless the user explicitly wants engine/schema work as part of the same change.
1425
+
1426
+ ## Notes Quality Requirement
1427
+
1428
+ The workflow's notes should be useful for both:
1429
+
1430
+ - the user reading what happened
1431
+ - another agent resuming the review later
1432
+
1433
+ That means notes should read like compact **decision memos**, not raw logs or vague diary entries.
1434
+
1435
+ At minimum, the notes for important phases should make clear:
1436
+
1437
+ - what was learned
1438
+ - what was decided
1439
+ - what remains uncertain
1440
+ - what should happen next
1441
+
1442
+ ## Success Criteria for the Next Pass
1443
+
1444
+ The next MR-review workflow update should be considered successful if it lands all of these:
1445
+
1446
+ 1. it attempts to find the real PR/MR when available
1447
+ 2. it attempts merge-base / ancestor reasoning and reports confidence honestly
1448
+ 3. it discovers ticket/docs/policy context opportunistically without blocking by default
1449
+ 4. it separates review surface from obvious noise at least at a basic level
1450
+ 5. it adds a final environment-status summary that explains what was and was not accessible
1451
+ 6. it does all of the above without introducing bureaucratic over-structure or schema-incompatible authoring