@exaudeus/workrail 3.27.0 → 3.29.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/dist/console/assets/{index-FtTaDku8.js → index-BZ6HkxGf.js} +1 -1
- package/dist/console/index.html +1 -1
- package/dist/manifest.json +3 -3
- package/docs/README.md +57 -0
- package/docs/adrs/001-hybrid-storage-backend.md +38 -0
- package/docs/adrs/002-four-layer-context-classification.md +38 -0
- package/docs/adrs/003-checkpoint-trigger-strategy.md +35 -0
- package/docs/adrs/004-opt-in-encryption-strategy.md +36 -0
- package/docs/adrs/005-agent-first-workflow-execution-tokens.md +105 -0
- package/docs/adrs/006-append-only-session-run-event-log.md +76 -0
- package/docs/adrs/007-resume-and-checkpoint-only-sessions.md +51 -0
- package/docs/adrs/008-blocked-nodes-architectural-upgrade.md +178 -0
- package/docs/adrs/009-bridge-mode-single-instance-mcp.md +195 -0
- package/docs/adrs/010-release-pipeline.md +89 -0
- package/docs/architecture/README.md +7 -0
- package/docs/architecture/refactor-audit.md +364 -0
- package/docs/authoring-v2.md +527 -0
- package/docs/authoring.md +873 -0
- package/docs/changelog-recent.md +201 -0
- package/docs/configuration.md +505 -0
- package/docs/ctc-mcp-proposal.md +518 -0
- package/docs/design/README.md +22 -0
- package/docs/design/agent-cascade-protocol.md +96 -0
- package/docs/design/autonomous-console-design-candidates.md +253 -0
- package/docs/design/autonomous-console-design-review.md +111 -0
- package/docs/design/autonomous-platform-mvp-discovery.md +525 -0
- package/docs/design/claude-code-source-deep-dive.md +713 -0
- package/docs/design/console-cyberpunk-ui-discovery.md +504 -0
- package/docs/design/console-execution-trace-candidates-final.md +160 -0
- package/docs/design/console-execution-trace-candidates.md +211 -0
- package/docs/design/console-execution-trace-design-candidates-v2.md +113 -0
- package/docs/design/console-execution-trace-design-review.md +74 -0
- package/docs/design/console-execution-trace-discovery.md +394 -0
- package/docs/design/console-execution-trace-final-review.md +77 -0
- package/docs/design/console-execution-trace-review.md +92 -0
- package/docs/design/console-performance-discovery.md +415 -0
- package/docs/design/console-ui-backlog.md +280 -0
- package/docs/design/daemon-architecture-discovery.md +853 -0
- package/docs/design/daemon-design-candidates.md +318 -0
- package/docs/design/daemon-design-review-findings.md +119 -0
- package/docs/design/daemon-engine-design-candidates.md +210 -0
- package/docs/design/daemon-engine-design-review.md +131 -0
- package/docs/design/daemon-execution-engine-discovery.md +280 -0
- package/docs/design/daemon-gap-analysis.md +554 -0
- package/docs/design/daemon-owns-console-plan.md +168 -0
- package/docs/design/daemon-owns-console-review.md +91 -0
- package/docs/design/daemon-owns-console.md +195 -0
- package/docs/design/data-model-erd.md +11 -0
- package/docs/design/design-candidates-consolidate-dev-staleness.md +98 -0
- package/docs/design/design-candidates-walk-cache-depth-limit.md +80 -0
- package/docs/design/design-review-consolidate-dev-staleness.md +54 -0
- package/docs/design/design-review-walk-cache-depth-limit.md +48 -0
- package/docs/design/implementation-plan-consolidate-dev-staleness.md +142 -0
- package/docs/design/implementation-plan-walk-cache-depth-limit.md +141 -0
- package/docs/design/layer3b-ghost-nodes-design-candidates.md +229 -0
- package/docs/design/layer3b-ghost-nodes-design-review.md +93 -0
- package/docs/design/layer3b-ghost-nodes-implementation-plan.md +219 -0
- package/docs/design/list-workflows-latency-fix-plan.md +128 -0
- package/docs/design/list-workflows-latency-fix-review.md +55 -0
- package/docs/design/list-workflows-latency-fix.md +109 -0
- package/docs/design/native-context-management-api.md +11 -0
- package/docs/design/performance-sweep-2026-04.md +96 -0
- package/docs/design/routines-guide.md +219 -0
- package/docs/design/sequence-diagrams.md +11 -0
- package/docs/design/subagent-design-principles.md +220 -0
- package/docs/design/temporal-patterns-design-candidates.md +312 -0
- package/docs/design/temporal-patterns-design-review-findings.md +163 -0
- package/docs/design/test-isolation-from-config-file.md +335 -0
- package/docs/design/v2-core-design-locks.md +2746 -0
- package/docs/design/v2-lock-registry.json +734 -0
- package/docs/design/workflow-authoring-v2.md +1044 -0
- package/docs/design/workflow-docs-spec.md +218 -0
- package/docs/design/workflow-extension-points.md +687 -0
- package/docs/design/workrail-auto-trigger-system.md +359 -0
- package/docs/design/workrail-config-file-discovery.md +513 -0
- package/docs/docker.md +110 -0
- package/docs/generated/v2-lock-closure-plan.md +26 -0
- package/docs/generated/v2-lock-coverage.json +797 -0
- package/docs/generated/v2-lock-coverage.md +177 -0
- package/docs/ideas/backlog.md +3927 -0
- package/docs/ideas/design-candidates-mcp-resilience.md +208 -0
- package/docs/ideas/design-review-findings-mcp-resilience.md +119 -0
- package/docs/ideas/implementation_plan.md +249 -0
- package/docs/ideas/third-party-workflow-setup-design-thinking.md +1948 -0
- package/docs/implementation/02-architecture.md +316 -0
- package/docs/implementation/04-testing-strategy.md +124 -0
- package/docs/implementation/09-simple-workflow-guide.md +835 -0
- package/docs/implementation/13-advanced-validation-guide.md +874 -0
- package/docs/implementation/README.md +21 -0
- package/docs/integrations/claude-code.md +300 -0
- package/docs/integrations/firebender.md +315 -0
- package/docs/migration/v0.1.0.md +147 -0
- package/docs/naming-conventions.md +45 -0
- package/docs/planning/README.md +104 -0
- package/docs/planning/github-ticketing-playbook.md +195 -0
- package/docs/plans/README.md +24 -0
- package/docs/plans/agent-managed-ticketing-design.md +605 -0
- package/docs/plans/agentic-orchestration-roadmap.md +112 -0
- package/docs/plans/assessment-gates-engine-handoff.md +536 -0
- package/docs/plans/content-coherence-and-references.md +151 -0
- package/docs/plans/library-extraction-plan.md +340 -0
- package/docs/plans/mr-review-workflow-redesign.md +1451 -0
- package/docs/plans/native-context-management-epic.md +11 -0
- package/docs/plans/perf-fixes-design-candidates.md +225 -0
- package/docs/plans/perf-fixes-design-review-findings.md +61 -0
- package/docs/plans/perf-fixes-new-issues-candidates.md +264 -0
- package/docs/plans/perf-fixes-new-issues-review.md +110 -0
- package/docs/plans/prompt-fragments.md +53 -0
- package/docs/plans/ui-ux-workflow-design-candidates.md +120 -0
- package/docs/plans/ui-ux-workflow-discovery.md +100 -0
- package/docs/plans/ui-ux-workflow-review.md +48 -0
- package/docs/plans/v2-followup-enhancements.md +587 -0
- package/docs/plans/workflow-categories-candidates.md +105 -0
- package/docs/plans/workflow-categories-discovery.md +110 -0
- package/docs/plans/workflow-categories-review.md +51 -0
- package/docs/plans/workflow-discovery-model-candidates.md +94 -0
- package/docs/plans/workflow-discovery-model-discovery.md +74 -0
- package/docs/plans/workflow-discovery-model-review.md +48 -0
- package/docs/plans/workflow-source-setup-phase-1.md +245 -0
- package/docs/plans/workflow-source-setup-phase-2.md +361 -0
- package/docs/plans/workflow-staleness-detection-candidates.md +104 -0
- package/docs/plans/workflow-staleness-detection-review.md +58 -0
- package/docs/plans/workflow-staleness-detection.md +80 -0
- package/docs/plans/workflow-v2-design.md +69 -0
- package/docs/plans/workflow-v2-roadmap.md +74 -0
- package/docs/plans/workflow-validation-design.md +98 -0
- package/docs/plans/workflow-validation-roadmap.md +108 -0
- package/docs/plans/workrail-platform-vision.md +420 -0
- package/docs/reference/agent-context-cleaner-snippet.md +94 -0
- package/docs/reference/agent-context-guidance.md +140 -0
- package/docs/reference/context-optimization.md +284 -0
- package/docs/reference/example-workflow-repository-template/.github/workflows/validate.yml +125 -0
- package/docs/reference/example-workflow-repository-template/README.md +268 -0
- package/docs/reference/example-workflow-repository-template/workflows/example-workflow.json +80 -0
- package/docs/reference/external-workflow-repositories.md +916 -0
- package/docs/reference/feature-flags-architecture.md +472 -0
- package/docs/reference/feature-flags.md +349 -0
- package/docs/reference/god-tier-workflow-validation.md +272 -0
- package/docs/reference/loop-optimization.md +209 -0
- package/docs/reference/loop-validation.md +176 -0
- package/docs/reference/loops.md +465 -0
- package/docs/reference/mcp-platform-constraints.md +59 -0
- package/docs/reference/recovery.md +88 -0
- package/docs/reference/releases.md +177 -0
- package/docs/reference/troubleshooting.md +105 -0
- package/docs/reference/workflow-execution-contract.md +998 -0
- package/docs/roadmap/README.md +22 -0
- package/docs/roadmap/legacy-planning-status.md +103 -0
- package/docs/roadmap/now-next-later.md +70 -0
- package/docs/roadmap/open-work-inventory.md +389 -0
- package/docs/tickets/README.md +39 -0
- package/docs/tickets/next-up.md +76 -0
- package/docs/workflow-management.md +317 -0
- package/docs/workflow-templates.md +423 -0
- package/docs/workflow-validation.md +184 -0
- package/docs/workflows.md +254 -0
- package/package.json +3 -1
- package/spec/authoring-spec.json +61 -16
- package/workflows/workflow-for-workflows.json +252 -93
- package/workflows/workflow-for-workflows.v2.json +188 -77
|
@@ -0,0 +1,1451 @@
|
|
|
1
|
+
# MR Review Workflow Redesign
|
|
2
|
+
|
|
3
|
+
## Status
|
|
4
|
+
|
|
5
|
+
This is a working design document for redesigning `workflows/mr-review-workflow.agentic.v2.json`.
|
|
6
|
+
|
|
7
|
+
It is intentionally ahead of the current workflow JSON. The goal is to converge on the right workflow shape here first, then update the bundled workflow once the design feels solid.
|
|
8
|
+
|
|
9
|
+
## Problem Statement
|
|
10
|
+
|
|
11
|
+
The current MR review workflow is meaningfully better than the older review flows, but it is still not in its best shape.
|
|
12
|
+
|
|
13
|
+
It is strong on:
|
|
14
|
+
|
|
15
|
+
- reviewer-family parallelism
|
|
16
|
+
- contradiction-aware synthesis
|
|
17
|
+
- notes-first durability
|
|
18
|
+
- final validation and human-facing handoff
|
|
19
|
+
|
|
20
|
+
It is weaker than elite human review practice in the areas that determine whether the review is even operating on the right target and the right context:
|
|
21
|
+
|
|
22
|
+
- identifying the actual MR / PR rather than just a diff
|
|
23
|
+
- finding the true review boundary and merge base
|
|
24
|
+
- handling stacked branches and inherited changes
|
|
25
|
+
- reconstructing author intent from tickets and docs
|
|
26
|
+
- filtering noise and low-signal churn before deep review begins
|
|
27
|
+
- degrading gracefully when discovery surfaces are unavailable
|
|
28
|
+
- explaining to the user what the workflow could and could not access
|
|
29
|
+
|
|
30
|
+
## Design Goal
|
|
31
|
+
|
|
32
|
+
Redesign the workflow so that it behaves more like a top-tier human reviewer:
|
|
33
|
+
|
|
34
|
+
1. find the real review target
|
|
35
|
+
2. find the real review boundary
|
|
36
|
+
3. gather all realistically available context
|
|
37
|
+
4. adapt review rigor to the shape of the change
|
|
38
|
+
5. disclose confidence, uncertainty, and environment limitations clearly
|
|
39
|
+
|
|
40
|
+
## Product Decisions
|
|
41
|
+
|
|
42
|
+
This section records the current recommended product direction for the redesign. These are stronger than brainstorming notes, but still revisable if later review reveals a better shape.
|
|
43
|
+
|
|
44
|
+
### The workflow should try to find the actual MR, not just inspect local changes
|
|
45
|
+
|
|
46
|
+
The workflow should explicitly prefer discovering the actual MR / PR when possible.
|
|
47
|
+
|
|
48
|
+
Why:
|
|
49
|
+
|
|
50
|
+
- the PR often contains intent, scope, linked issues, reviewer discussion, and branch context that a raw diff does not
|
|
51
|
+
- PR metadata can help identify the correct review boundary
|
|
52
|
+
- strong human reviewers usually review the authored change in its review context, not just the visible patch
|
|
53
|
+
|
|
54
|
+
Recommended behavior:
|
|
55
|
+
|
|
56
|
+
- if the user already provides a PR URL or identifier, use that as the first-class review target
|
|
57
|
+
- if the user provides only a branch, patch, or vague review request, attempt to discover the corresponding MR / PR
|
|
58
|
+
- if no MR / PR can be found, continue with diff-based review rather than block
|
|
59
|
+
- if the review remains diff-only, lower context confidence and disclose the limitation in final handoff
|
|
60
|
+
|
|
61
|
+
### The workflow should try to discover ticket and document context
|
|
62
|
+
|
|
63
|
+
The workflow should attempt to recover ticket and supporting-document context when available.
|
|
64
|
+
|
|
65
|
+
Priority sources include:
|
|
66
|
+
|
|
67
|
+
- linked issues or tickets from the PR or commit messages
|
|
68
|
+
- repo-local docs such as BRDs, RFCs, specs, rollout docs, migration docs, and product notes
|
|
69
|
+
- external systems reachable through installed CLI tools or MCPs
|
|
70
|
+
- web-retrieved docs when browsing is available and genuinely useful
|
|
71
|
+
|
|
72
|
+
Why:
|
|
73
|
+
|
|
74
|
+
- code review quality depends heavily on whether the implementation matches intended behavior, non-goals, constraints, and rollout expectations
|
|
75
|
+
- strong reviewers compare code against intent, not just against syntax and local correctness
|
|
76
|
+
|
|
77
|
+
Recommended behavior:
|
|
78
|
+
|
|
79
|
+
- attempt discovery opportunistically and non-blockingly
|
|
80
|
+
- extract intent, acceptance criteria, non-goals, risks, and rollout expectations into durable workflow context
|
|
81
|
+
- if ticket/doc context is missing, continue the review but lower context confidence
|
|
82
|
+
|
|
83
|
+
### Missing enrichment sources should not block by default
|
|
84
|
+
|
|
85
|
+
The workflow should degrade gracefully when discovery surfaces are missing or insufficient.
|
|
86
|
+
|
|
87
|
+
Recommended behavior:
|
|
88
|
+
|
|
89
|
+
- do not block merely because `gh`, ticket MCPs, or web browsing are unavailable
|
|
90
|
+
- only block when the review target itself is missing or when the workflow cannot inspect enough material to review anything meaningful
|
|
91
|
+
- record what was accessible, what failed, what was not attempted, and why
|
|
92
|
+
- surface improvement suggestions in the final handoff, not as noisy mid-workflow nags unless the missing source is critical
|
|
93
|
+
|
|
94
|
+
### The workflow should be never-stop by default for enrichment and confidence gaps
|
|
95
|
+
|
|
96
|
+
The redesign should make "continue with degraded confidence" the default behavior.
|
|
97
|
+
|
|
98
|
+
That means the workflow should not stop merely because it cannot:
|
|
99
|
+
|
|
100
|
+
- find the PR/MR
|
|
101
|
+
- find the ticket
|
|
102
|
+
- find supporting docs
|
|
103
|
+
- use `gh`
|
|
104
|
+
- use web browsing
|
|
105
|
+
- use ticket-system MCPs
|
|
106
|
+
- confidently establish the merge base on the first attempt
|
|
107
|
+
|
|
108
|
+
It should stop only when:
|
|
109
|
+
|
|
110
|
+
- there is no meaningful review target
|
|
111
|
+
- there is no inspectable material to review
|
|
112
|
+
- the user must provide a missing artifact that cannot be recovered any other way
|
|
113
|
+
|
|
114
|
+
This should be treated as a core product requirement, not a soft preference.
|
|
115
|
+
|
|
116
|
+
### The final handoff should include an explicit environment-status section
|
|
117
|
+
|
|
118
|
+
The workflow should end with a user-visible summary of review environment quality.
|
|
119
|
+
|
|
120
|
+
That section should explain:
|
|
121
|
+
|
|
122
|
+
- what the workflow successfully accessed
|
|
123
|
+
- what it attempted but could not access
|
|
124
|
+
- what it never attempted
|
|
125
|
+
- how those gaps affected boundary confidence, context confidence, or final recommendation confidence
|
|
126
|
+
- what tooling or workflow habits would improve future reviews
|
|
127
|
+
|
|
128
|
+
This is important because graceful degradation only helps the user if they can actually see what degraded.
|
|
129
|
+
|
|
130
|
+
### Ancestor and merge-base handling must become first-class
|
|
131
|
+
|
|
132
|
+
The redesign should explicitly treat review-boundary detection as a core responsibility, not a hidden subtask.
|
|
133
|
+
|
|
134
|
+
Recommended behavior:
|
|
135
|
+
|
|
136
|
+
- determine candidate base branch or parent branch
|
|
137
|
+
- attempt to find the true merge base / ancestor
|
|
138
|
+
- detect stacked branches, stale branches, and divergent branches
|
|
139
|
+
- separate branch-specific changes from inherited or upstream changes
|
|
140
|
+
- produce an explicit boundary-confidence assessment
|
|
141
|
+
|
|
142
|
+
If confidence remains low:
|
|
143
|
+
|
|
144
|
+
- continue with best effort
|
|
145
|
+
- lower recommendation confidence when boundary uncertainty materially affects findings
|
|
146
|
+
- disclose the uncertainty clearly in final handoff
|
|
147
|
+
|
|
148
|
+
### The workflow should adapt by change type, not just size
|
|
149
|
+
|
|
150
|
+
The redesign should keep QUICK / STANDARD / THOROUGH, but add structured change-shape adaptation.
|
|
151
|
+
|
|
152
|
+
This should influence:
|
|
153
|
+
|
|
154
|
+
- reviewer-family selection
|
|
155
|
+
- validation depth
|
|
156
|
+
- whether simulation or rollout analysis is needed
|
|
157
|
+
- how much false-positive suppression is required
|
|
158
|
+
- how much boundary follow-up is warranted
|
|
159
|
+
|
|
160
|
+
### The workflow should account for repo rules, user preferences, and coding philosophy
|
|
161
|
+
|
|
162
|
+
The redesign should treat review-policy context as first-class review input.
|
|
163
|
+
|
|
164
|
+
That includes:
|
|
165
|
+
|
|
166
|
+
- repo conventions
|
|
167
|
+
- user-specified rules and preferences
|
|
168
|
+
- architecture guidance
|
|
169
|
+
- coding philosophy
|
|
170
|
+
- review-specific instructions
|
|
171
|
+
- any explicit team or project constraints discoverable from docs, workflow guidance, or user-provided rules
|
|
172
|
+
|
|
173
|
+
Why:
|
|
174
|
+
|
|
175
|
+
- a technically sharp review can still be wrong for the repo if it ignores the user’s rules and architectural preferences
|
|
176
|
+
- strong human reviewers adapt their evaluation to the standards of the codebase they are reviewing
|
|
177
|
+
|
|
178
|
+
Recommended behavior:
|
|
179
|
+
|
|
180
|
+
- attempt to discover policy context alongside ticket/doc context
|
|
181
|
+
- extract durable summaries of rules, conventions, and constraints into workflow context
|
|
182
|
+
- use this policy context to calibrate findings, recommendations, and false-positive suppression
|
|
183
|
+
|
|
184
|
+
### This redesign is necessary, but not sufficient for elite human parity
|
|
185
|
+
|
|
186
|
+
This redesign would make the workflow materially stronger and much more trustworthy.
|
|
187
|
+
|
|
188
|
+
However, even after these changes, it will likely still lag the best human reviewers in:
|
|
189
|
+
|
|
190
|
+
- severity calibration
|
|
191
|
+
- historical reasoning
|
|
192
|
+
- large-review partitioning
|
|
193
|
+
- subtle product judgment under weak evidence
|
|
194
|
+
|
|
195
|
+
That means this redesign should be treated as a major step toward high-quality review, not the final endpoint.
|
|
196
|
+
|
|
197
|
+
## Non-Goals
|
|
198
|
+
|
|
199
|
+
- changing engine behavior
|
|
200
|
+
- adding new MCP tool implementations in this pass
|
|
201
|
+
- forcing the workflow to block whenever enrichment sources are missing
|
|
202
|
+
- making the human-facing review doc canonical workflow state
|
|
203
|
+
- fully solving severity calibration, historical reasoning, or large-MR partitioning in the same pass
|
|
204
|
+
|
|
205
|
+
## WorkRail-Native Authoring Opportunities We Should Use
|
|
206
|
+
|
|
207
|
+
The redesign should take fuller advantage of WorkRail's current v2 surface area instead of expressing everything as plain prompt prose.
|
|
208
|
+
|
|
209
|
+
### Features with config, not just simple toggles
|
|
210
|
+
|
|
211
|
+
The workflow should likely use:
|
|
212
|
+
|
|
213
|
+
- `wr.features.mode_guidance`
|
|
214
|
+
- `wr.features.durable_recap_guidance`
|
|
215
|
+
- `wr.features.capabilities`
|
|
216
|
+
- `wr.features.output_contracts`
|
|
217
|
+
|
|
218
|
+
And it should consider using the configurable form where helpful, especially for:
|
|
219
|
+
|
|
220
|
+
- collapsed capability probes
|
|
221
|
+
- artifact-backed capability observations
|
|
222
|
+
- consistent enforcement of output contracts across blocking and never-stop modes
|
|
223
|
+
|
|
224
|
+
### Template-anchored capability probes
|
|
225
|
+
|
|
226
|
+
Where the workflow needs to learn whether `delegation` or `web_browsing` is actually usable, it should prefer explicit template-anchored probing over handwritten duplicated probe prose.
|
|
227
|
+
|
|
228
|
+
The clearest existing fit is:
|
|
229
|
+
|
|
230
|
+
- `wr.templates.capability_probe`
|
|
231
|
+
|
|
232
|
+
paired with:
|
|
233
|
+
|
|
234
|
+
- `wr.contracts.capability_observation`
|
|
235
|
+
|
|
236
|
+
This is especially relevant for the early enrichment phase.
|
|
237
|
+
|
|
238
|
+
### Prompt refs instead of duplicated guidance
|
|
239
|
+
|
|
240
|
+
The redesign should plan to use `wr.refs.*` snippets for repeated canonical guidance rather than copying the same durable-state or synthesis instructions into many steps.
|
|
241
|
+
|
|
242
|
+
High-value likely uses include:
|
|
243
|
+
|
|
244
|
+
- notes-first durability guidance
|
|
245
|
+
- synthesis-under-disagreement guidance
|
|
246
|
+
- parallelize-cognition / serialize-synthesis guidance
|
|
247
|
+
- adversarial challenge guidance
|
|
248
|
+
|
|
249
|
+
### PromptBlocks as the default step shape
|
|
250
|
+
|
|
251
|
+
The workflow should favor `promptBlocks` over large single-string prompts for major phases.
|
|
252
|
+
|
|
253
|
+
That makes it easier to:
|
|
254
|
+
|
|
255
|
+
- keep the prompts deterministic
|
|
256
|
+
- expose clear `goal`, `constraints`, `procedure`, `outputRequired`, and `verify` structure
|
|
257
|
+
- attach reusable references and future feature injections cleanly
|
|
258
|
+
|
|
259
|
+
### Conditions and loop contracts as first-class control flow
|
|
260
|
+
|
|
261
|
+
The redesign already wants loop-based contradiction and follow-up handling. It should lean into current v2 patterns more explicitly by:
|
|
262
|
+
|
|
263
|
+
- defining named conditions where that improves clarity
|
|
264
|
+
- using `wr.contracts.loop_control` consistently for loop decisions
|
|
265
|
+
- treating loop continuation as data, not prose
|
|
266
|
+
|
|
267
|
+
### Decision-trace and never-stop semantics awareness
|
|
268
|
+
|
|
269
|
+
The workflow should be written with WorkRail's durable `blocked` / `gap_recorded` semantics in mind.
|
|
270
|
+
|
|
271
|
+
That means:
|
|
272
|
+
|
|
273
|
+
- blocking vs never-stop should be intentional per capability/input requirement
|
|
274
|
+
- missing preferred capabilities should degrade with durable disclosure
|
|
275
|
+
- important confidence-relevant misses should be representable as explicit gaps, not only narrative caveats
|
|
276
|
+
|
|
277
|
+
### Auditor-style delegation, not only executor-style delegation
|
|
278
|
+
|
|
279
|
+
The subagent design docs strongly support the auditor model.
|
|
280
|
+
|
|
281
|
+
The MR review redesign should make fuller use of that by treating many delegations as:
|
|
282
|
+
|
|
283
|
+
- audits of the main agent's gathered context
|
|
284
|
+
- challenges to the main agent's current hypothesis
|
|
285
|
+
- verification of the current recommendation
|
|
286
|
+
|
|
287
|
+
rather than always delegating broad independent ownership of a phase.
|
|
288
|
+
|
|
289
|
+
This is especially valuable for:
|
|
290
|
+
|
|
291
|
+
- context completeness / depth audits
|
|
292
|
+
- boundary-confidence audits
|
|
293
|
+
- philosophy-alignment audits
|
|
294
|
+
- final recommendation validation
|
|
295
|
+
|
|
296
|
+
### Routine reuse should be explicit
|
|
297
|
+
|
|
298
|
+
The redesign currently references a few routines conceptually, but it should make clearer use of the current routine catalog.
|
|
299
|
+
|
|
300
|
+
High-value candidates include:
|
|
301
|
+
|
|
302
|
+
- `routine-context-gathering`
|
|
303
|
+
- `routine-hypothesis-challenge`
|
|
304
|
+
- `routine-execution-simulation`
|
|
305
|
+
- `routine-philosophy-alignment`
|
|
306
|
+
- `routine-final-verification`
|
|
307
|
+
|
|
308
|
+
These should be treated as current reusable building blocks, not future ideas.
|
|
309
|
+
|
|
310
|
+
### Direct execution vs delegation vs injection should be chosen deliberately
|
|
311
|
+
|
|
312
|
+
The routines guide gives three valid consumption modes:
|
|
313
|
+
|
|
314
|
+
- delegation to a WorkRail Executor
|
|
315
|
+
- direct execution by the current agent
|
|
316
|
+
- compile-time injection via routine templates
|
|
317
|
+
|
|
318
|
+
The redesign should decide per use case:
|
|
319
|
+
|
|
320
|
+
- delegate when independent cognitive perspective is valuable
|
|
321
|
+
- execute directly when overhead is unnecessary
|
|
322
|
+
- inject when step visibility, confirmation behavior, and session traceability matter
|
|
323
|
+
|
|
324
|
+
### Extension points can improve customization without weakening orchestration
|
|
325
|
+
|
|
326
|
+
The redesign previously deferred extension points for readability. That was reasonable, but the current WorkRail extension-point model is strong enough that we should explicitly plan where bounded customization would add value.
|
|
327
|
+
|
|
328
|
+
The best candidates appear to be:
|
|
329
|
+
|
|
330
|
+
- reviewer-family bundle policy
|
|
331
|
+
- philosophy-alignment review
|
|
332
|
+
- final verification
|
|
333
|
+
|
|
334
|
+
The parent workflow should still own sequencing, loop control, and canonical synthesis.
|
|
335
|
+
|
|
336
|
+
### AgentRole is underused
|
|
337
|
+
|
|
338
|
+
The redesign should consider a stronger workflow-level `agentRole` and selective step-level overrides for:
|
|
339
|
+
|
|
340
|
+
- boundary detective mode
|
|
341
|
+
- evidence-first synthesizer mode
|
|
342
|
+
- adversarial validator mode
|
|
343
|
+
- philosophy auditor mode
|
|
344
|
+
|
|
345
|
+
This is lower leverage than control flow and routines, but still worth using intentionally.
|
|
346
|
+
|
|
347
|
+
## Structure-Balance Framework
|
|
348
|
+
|
|
349
|
+
The redesign should optimize for structured freedom rather than either extreme:
|
|
350
|
+
|
|
351
|
+
- not a loose "trust the model" review flow
|
|
352
|
+
- not a rigid form-filling bureaucracy
|
|
353
|
+
|
|
354
|
+
The workflow should be rigid where determinism, safety, or honesty matter, and adaptive where LLM reasoning quality matters most.
|
|
355
|
+
|
|
356
|
+
### Keep rigid
|
|
357
|
+
|
|
358
|
+
These are the parts that should stay explicitly structured and hard to skip:
|
|
359
|
+
|
|
360
|
+
- phase boundaries
|
|
361
|
+
- minimum required outputs before advancing
|
|
362
|
+
- confidence reporting
|
|
363
|
+
- loop / follow-up triggers
|
|
364
|
+
- blocked vs never-stop semantics
|
|
365
|
+
- final handoff sections
|
|
366
|
+
- explicit disclosure of gaps and unknowns
|
|
367
|
+
- the rule that reviewer/subagent output is evidence, not canonical truth
|
|
368
|
+
|
|
369
|
+
These are the workflow invariants. They prevent omission, hidden drift, and fake certainty.
|
|
370
|
+
|
|
371
|
+
### Keep semi-structured
|
|
372
|
+
|
|
373
|
+
These should have strong guidance and matrices, but not exhaustive decision automation:
|
|
374
|
+
|
|
375
|
+
- shape/type routing
|
|
376
|
+
- confidence combination rules
|
|
377
|
+
- severity calibration
|
|
378
|
+
- artifact vs context split
|
|
379
|
+
- when to delegate vs inject vs execute directly
|
|
380
|
+
- when policy-context should materially affect findings
|
|
381
|
+
|
|
382
|
+
These are the parts where structured heuristics help, but judgment still matters.
|
|
383
|
+
|
|
384
|
+
### Keep adaptive
|
|
385
|
+
|
|
386
|
+
These should deliberately leave room for model creativity and non-obvious reasoning:
|
|
387
|
+
|
|
388
|
+
- exploration order
|
|
389
|
+
- which evidence sources seem most promising first
|
|
390
|
+
- how to connect clues across PR, code, docs, history, and repo patterns
|
|
391
|
+
- how to synthesize multiple weak signals into a coherent concern
|
|
392
|
+
- how to phrase findings for maximum clarity and usefulness
|
|
393
|
+
- when an unusual MR deserves extra scrutiny beyond the default routing heuristics
|
|
394
|
+
|
|
395
|
+
These are the parts where LLMs can outperform rigid scripts.
|
|
396
|
+
|
|
397
|
+
### Matrix and field admission rule
|
|
398
|
+
|
|
399
|
+
A matrix, field, or ledger element earns its place only if it does at least one of these:
|
|
400
|
+
|
|
401
|
+
- prevents a real recurring failure mode
|
|
402
|
+
- improves deterministic control flow or resumability
|
|
403
|
+
- improves user-visible honesty or explainability
|
|
404
|
+
- changes routing or review depth in a meaningful way
|
|
405
|
+
|
|
406
|
+
If it does none of those, it should be removed or downgraded to advisory guidance.
|
|
407
|
+
|
|
408
|
+
### Preferred design bias
|
|
409
|
+
|
|
410
|
+
When in doubt:
|
|
411
|
+
|
|
412
|
+
- constrain outcomes, not cognition
|
|
413
|
+
- require explicit state, not rigid thought order
|
|
414
|
+
- keep taxonomies small
|
|
415
|
+
- prefer a few high-value matrices over many low-value classifications
|
|
416
|
+
- use structure to prevent omission, not to suppress intelligent exploration
|
|
417
|
+
|
|
418
|
+
### Practical consequence for this redesign
|
|
419
|
+
|
|
420
|
+
This means:
|
|
421
|
+
|
|
422
|
+
- keep the confidence matrix
|
|
423
|
+
- keep the gap / non-blocking matrix
|
|
424
|
+
- keep the shape/type routing matrix
|
|
425
|
+
- keep the artifact vs context split
|
|
426
|
+
- avoid exploding shape/type categories beyond what actually changes behavior
|
|
427
|
+
- avoid adding ledgers or flags that do not affect routing, honesty, or final quality
|
|
428
|
+
|
|
429
|
+
## Current Workflow Gaps
|
|
430
|
+
|
|
431
|
+
### Review boundary correctness
|
|
432
|
+
|
|
433
|
+
The current workflow does not make review-boundary detection a first-class responsibility.
|
|
434
|
+
|
|
435
|
+
Missing or under-specified behavior:
|
|
436
|
+
|
|
437
|
+
- determine the actual PR/MR when possible
|
|
438
|
+
- identify candidate base branches
|
|
439
|
+
- find the true merge base / ancestor
|
|
440
|
+
- detect stacked branches
|
|
441
|
+
- detect stale or divergent branches
|
|
442
|
+
- separate branch-specific changes from inherited changes
|
|
443
|
+
- explain confidence in the chosen review surface
|
|
444
|
+
|
|
445
|
+
This is the highest-priority gap because a review can be thorough and still be wrong if it reviews the wrong surface.
|
|
446
|
+
|
|
447
|
+
### Source discovery and context enrichment
|
|
448
|
+
|
|
449
|
+
The current workflow asks for MR purpose and ticket context, but it does not strongly instruct the agent to discover them from all available sources.
|
|
450
|
+
|
|
451
|
+
Missing or under-specified behavior:
|
|
452
|
+
|
|
453
|
+
- discover the actual PR body and metadata when available
|
|
454
|
+
- discover linked ticket / issue context
|
|
455
|
+
- discover repo-local specs, RFCs, design docs, rollout docs, and acceptance criteria
|
|
456
|
+
- search commit messages, branch names, and nearby docs for intent clues
|
|
457
|
+
- use web or other external sources only when available and useful
|
|
458
|
+
|
|
459
|
+
### Capability-aware graceful degradation
|
|
460
|
+
|
|
461
|
+
The current workflow assumes tool-driven discovery in spirit, but it does not explicitly model discovery-surface availability or insufficiency.
|
|
462
|
+
|
|
463
|
+
Missing or under-specified behavior:
|
|
464
|
+
|
|
465
|
+
- probe whether GitHub CLI access is available
|
|
466
|
+
- probe whether ticket-system access exists
|
|
467
|
+
- probe whether web browsing is available
|
|
468
|
+
- probe whether repo-local docs exist and are discoverable
|
|
469
|
+
- record unavailable or insufficient sources without failing the whole review
|
|
470
|
+
|
|
471
|
+
### Review-surface hygiene
|
|
472
|
+
|
|
473
|
+
The current workflow moves from context gathering to review too quickly.
|
|
474
|
+
|
|
475
|
+
Missing or under-specified behavior:
|
|
476
|
+
|
|
477
|
+
- classify generated files
|
|
478
|
+
- classify mechanical churn
|
|
479
|
+
- classify rename-only or move-only changes
|
|
480
|
+
- classify likely inherited upstream changes
|
|
481
|
+
- classify out-of-scope or low-signal material
|
|
482
|
+
- focus the fact packet on the true review surface rather than all visible changes equally
|
|
483
|
+
|
|
484
|
+
### Adaptation by change shape
|
|
485
|
+
|
|
486
|
+
The current workflow adapts mostly by review size and risk.
|
|
487
|
+
|
|
488
|
+
Missing or under-specified behavior:
|
|
489
|
+
|
|
490
|
+
- adapt reviewer-family selection by change type
|
|
491
|
+
- distinguish API changes from migrations, refactors, config edits, test-only changes, docs-only changes, security-sensitive changes, and performance-sensitive changes
|
|
492
|
+
- increase boundary rigor when ancestry is ambiguous
|
|
493
|
+
- reduce over-review for clearly mechanical or low-risk changes
|
|
494
|
+
|
|
495
|
+
### Final disclosure
|
|
496
|
+
|
|
497
|
+
The current final handoff does not strongly require the workflow to explain:
|
|
498
|
+
|
|
499
|
+
- what it successfully accessed
|
|
500
|
+
- what it attempted but could not access
|
|
501
|
+
- what it never attempted
|
|
502
|
+
- how those limits affected review quality
|
|
503
|
+
- what environment improvements would make future reviews stronger
|
|
504
|
+
|
|
505
|
+
## Target Design Principles
|
|
506
|
+
|
|
507
|
+
### Correctness before depth
|
|
508
|
+
|
|
509
|
+
A shallow review on the right boundary is better than a deep review on the wrong boundary.
|
|
510
|
+
|
|
511
|
+
### Discover first, ask second
|
|
512
|
+
|
|
513
|
+
The workflow should aggressively use available tools and sources before asking the user for missing information.
|
|
514
|
+
|
|
515
|
+
### Degrade gracefully
|
|
516
|
+
|
|
517
|
+
Missing enrichment sources should lower confidence and be disclosed, not automatically block the workflow.
|
|
518
|
+
|
|
519
|
+
### Evidence over assumptions
|
|
520
|
+
|
|
521
|
+
The workflow should explicitly distinguish:
|
|
522
|
+
|
|
523
|
+
- directly observed facts
|
|
524
|
+
- inferred context
|
|
525
|
+
- missing evidence
|
|
526
|
+
- contradictory evidence
|
|
527
|
+
|
|
528
|
+
### Human-readable truth, workflow-owned truth
|
|
529
|
+
|
|
530
|
+
Human-facing artifacts are useful, but durable workflow truth remains in notes and explicit context fields.
|
|
531
|
+
|
|
532
|
+
### Review the change that matters
|
|
533
|
+
|
|
534
|
+
The workflow should separate core review surface from noise before deep analysis begins.
|
|
535
|
+
|
|
536
|
+
### Honest confidence over false certainty
|
|
537
|
+
|
|
538
|
+
The workflow should prefer saying "I could not confidently establish the boundary" over quietly pretending it found the right ancestor.
|
|
539
|
+
|
|
540
|
+
## Proposed Workflow Shape
|
|
541
|
+
|
|
542
|
+
## Phase 0: Locate, Bound, Enrich, and Classify
|
|
543
|
+
|
|
544
|
+
This phase replaces the current front-half flow.
|
|
545
|
+
|
|
546
|
+
It should execute five structured sub-steps.
|
|
547
|
+
|
|
548
|
+
### 0.1 Locate the review target
|
|
549
|
+
|
|
550
|
+
Determine, when possible:
|
|
551
|
+
|
|
552
|
+
- PR/MR URL or number
|
|
553
|
+
- branch name
|
|
554
|
+
- HEAD SHA
|
|
555
|
+
- diff source type
|
|
556
|
+
- whether the user provided:
|
|
557
|
+
- PR URL
|
|
558
|
+
- branch
|
|
559
|
+
- patch
|
|
560
|
+
- local diff
|
|
561
|
+
- only a vague review request
|
|
562
|
+
|
|
563
|
+
Recommended decision:
|
|
564
|
+
|
|
565
|
+
- if a discoverable PR/MR exists, treat it as the primary review target
|
|
566
|
+
- if no PR/MR exists or can be found, fall back to branch or diff review without blocking
|
|
567
|
+
|
|
568
|
+
### 0.2 Find the true review boundary
|
|
569
|
+
|
|
570
|
+
Attempt to determine:
|
|
571
|
+
|
|
572
|
+
- candidate base branch
|
|
573
|
+
- merge base / ancestor
|
|
574
|
+
- whether the branch is stacked
|
|
575
|
+
- whether the branch is stale or divergent
|
|
576
|
+
- exact commits under review
|
|
577
|
+
- exact files under review
|
|
578
|
+
- inherited changes to exclude
|
|
579
|
+
- why the workflow believes this is the correct review surface
|
|
580
|
+
|
|
581
|
+
If the workflow cannot establish this confidently, it should:
|
|
582
|
+
|
|
583
|
+
- continue with best effort
|
|
584
|
+
- lower boundary confidence
|
|
585
|
+
- record warnings
|
|
586
|
+
- disclose the uncertainty in final handoff
|
|
587
|
+
|
|
588
|
+
This phase should be considered incomplete if it does not at least attempt merge-base / ancestor reasoning.
|
|
589
|
+
|
|
590
|
+
### 0.3 Discover enrichments
|
|
591
|
+
|
|
592
|
+
Attempt to discover:
|
|
593
|
+
|
|
594
|
+
- PR metadata and body
|
|
595
|
+
- ticket / issue context
|
|
596
|
+
- repo-local product or design docs
|
|
597
|
+
- repo-local rules, conventions, and project guidance
|
|
598
|
+
- RFCs and specs
|
|
599
|
+
- rollout or migration docs
|
|
600
|
+
- acceptance criteria
|
|
601
|
+
- product risks and non-goals
|
|
602
|
+
|
|
603
|
+
The workflow should explicitly prefer recovering this context itself before asking the user for it.
|
|
604
|
+
|
|
605
|
+
Preferred discovery order:
|
|
606
|
+
|
|
607
|
+
1. direct CLI / MCP surfaces
|
|
608
|
+
2. repo-local docs and links
|
|
609
|
+
3. branch names and commit messages
|
|
610
|
+
4. PR body and issue links
|
|
611
|
+
5. nearby documentation by naming convention
|
|
612
|
+
6. web or browser access when available
|
|
613
|
+
|
|
614
|
+
The workflow should treat missing enrichments as confidence-relevant, not as automatic failure conditions.
|
|
615
|
+
|
|
616
|
+
It should treat policy-context discovery as part of enrichment, not as a separate optional nicety.
|
|
617
|
+
|
|
618
|
+
### 0.4 Probe capability availability lazily
|
|
619
|
+
|
|
620
|
+
Without blocking unless correctness requires it, attempt to determine availability or insufficiency of:
|
|
621
|
+
|
|
622
|
+
- `delegation`
|
|
623
|
+
- `web_browsing`
|
|
624
|
+
- GitHub / PR CLI access
|
|
625
|
+
- ticket-system access
|
|
626
|
+
- repo-local docs access
|
|
627
|
+
- relevant attached artifacts
|
|
628
|
+
|
|
629
|
+
For workflow-global capabilities such as `delegation` and `web_browsing`, this should align with the v2 capability-observation model rather than inventing a custom side channel. Where useful, this likely means using existing patterns such as `wr.templates.capability_probe` and `wr.contracts.capability_observation`.
|
|
630
|
+
|
|
631
|
+
Discovery surfaces beyond first-class workflow capabilities, such as GitHub CLI, ticket systems, repo-local docs, or attached artifacts, should still be recorded durably as structured observations even if they are not modeled as top-level capability enums.
|
|
632
|
+
|
|
633
|
+
Each probed source should be recorded structurally as one of:
|
|
634
|
+
|
|
635
|
+
- `available`
|
|
636
|
+
- `unavailable`
|
|
637
|
+
- `not_attempted`
|
|
638
|
+
- `attempted_but_insufficient`
|
|
639
|
+
|
|
640
|
+
Where the final workflow authoring remains readable, the preferred implementation path is:
|
|
641
|
+
|
|
642
|
+
- `wr.features.capabilities` with collapsed probe visibility
|
|
643
|
+
- `wr.templates.capability_probe` for first-class capability checks
|
|
644
|
+
- artifact-backed recording via `wr.contracts.capability_observation`
|
|
645
|
+
|
|
646
|
+
### 0.5 Classify
|
|
647
|
+
|
|
648
|
+
Classify based on:
|
|
649
|
+
|
|
650
|
+
- change size
|
|
651
|
+
- change shape
|
|
652
|
+
- change type
|
|
653
|
+
- risk level
|
|
654
|
+
- context completeness
|
|
655
|
+
- boundary confidence
|
|
656
|
+
- review-surface cleanliness
|
|
657
|
+
|
|
658
|
+
Set:
|
|
659
|
+
|
|
660
|
+
- `reviewMode`
|
|
661
|
+
- `shapeProfile`
|
|
662
|
+
- `riskLevel`
|
|
663
|
+
- `changeTypeProfile`
|
|
664
|
+
- `boundaryConfidence`
|
|
665
|
+
- `contextConfidence`
|
|
666
|
+
- `maxParallelism`
|
|
667
|
+
- `needsReviewerBundle`
|
|
668
|
+
- `needsSimulation`
|
|
669
|
+
- `needsBoundaryFollowup`
|
|
670
|
+
- `needsContextFollowup`
|
|
671
|
+
- `needsAuditorPass`
|
|
672
|
+
|
|
673
|
+
## Phase 1: State Initial Review Hypothesis
|
|
674
|
+
|
|
675
|
+
This phase stays, but it should now be informed by:
|
|
676
|
+
|
|
677
|
+
- review boundary certainty
|
|
678
|
+
- source ledger findings
|
|
679
|
+
- discovered intent and acceptance criteria
|
|
680
|
+
- discovered policy context
|
|
681
|
+
- change-shape classification
|
|
682
|
+
- change-type classification
|
|
683
|
+
|
|
684
|
+
The agent should state:
|
|
685
|
+
|
|
686
|
+
- current recommendation direction
|
|
687
|
+
- primary concern area
|
|
688
|
+
- what evidence would most likely overturn the current view
|
|
689
|
+
- whether the largest risk is code correctness, review-boundary uncertainty, or missing context
|
|
690
|
+
|
|
691
|
+
## Phase 2: Build Fact Packet and Review-Surface Ledger
|
|
692
|
+
|
|
693
|
+
The current fact-packet idea remains useful, but it should be expanded.
|
|
694
|
+
|
|
695
|
+
The workflow should build both:
|
|
696
|
+
|
|
697
|
+
- `reviewFactPacket`
|
|
698
|
+
- `reviewSurfaceLedger`
|
|
699
|
+
|
|
700
|
+
### `reviewFactPacket`
|
|
701
|
+
|
|
702
|
+
Should include:
|
|
703
|
+
|
|
704
|
+
- MR title and purpose
|
|
705
|
+
- intended behavior change
|
|
706
|
+
- non-goals if discoverable
|
|
707
|
+
- ticket and doc-derived constraints
|
|
708
|
+
- repo and user policy constraints
|
|
709
|
+
- acceptance criteria
|
|
710
|
+
- affected modules, contracts, invariants, and consumers
|
|
711
|
+
- tests, rollout expectations, and migration expectations
|
|
712
|
+
- unresolved unknowns
|
|
713
|
+
|
|
714
|
+
### `reviewSurfaceLedger`
|
|
715
|
+
|
|
716
|
+
Should include:
|
|
717
|
+
|
|
718
|
+
- exact review boundary description
|
|
719
|
+
- included commits
|
|
720
|
+
- excluded inherited commits
|
|
721
|
+
- core review surface files
|
|
722
|
+
- generated files
|
|
723
|
+
- mechanical churn
|
|
724
|
+
- rename-only / move-only files
|
|
725
|
+
- low-signal or out-of-scope files
|
|
726
|
+
- review-scope warnings
|
|
727
|
+
|
|
728
|
+
This step should also initialize a stronger coverage model and decide reviewer families using both change size and change type.
|
|
729
|
+
|
|
730
|
+
It should additionally record whether the review is operating with:
|
|
731
|
+
|
|
732
|
+
- strong boundary confidence
|
|
733
|
+
- weak boundary confidence
|
|
734
|
+
- strong intent/context confidence
|
|
735
|
+
- weak intent/context confidence
|
|
736
|
+
|
|
737
|
+
so later phases can adapt accordingly.
|
|
738
|
+
|
|
739
|
+
It should also persist whether policy-context confidence is:
|
|
740
|
+
|
|
741
|
+
- strong enough to evaluate against repo/user expectations
|
|
742
|
+
- weak enough that findings should be presented more cautiously
|
|
743
|
+
|
|
744
|
+
This phase is also a good place for an auditor-style context quality pass:
|
|
745
|
+
|
|
746
|
+
- a completeness-focused audit
|
|
747
|
+
- a depth-focused audit
|
|
748
|
+
|
|
749
|
+
If the workflow delegates these, they should audit the main agent's gathered packet rather than own the whole understanding phase.
|
|
750
|
+
|
|
751
|
+
## Phase 3: Adaptive Reviewer-Family Bundle
|
|
752
|
+
|
|
753
|
+
Reviewer-family delegation should be selected using:
|
|
754
|
+
|
|
755
|
+
- `reviewMode`
|
|
756
|
+
- `riskLevel`
|
|
757
|
+
- `shapeProfile`
|
|
758
|
+
- `changeTypeProfile`
|
|
759
|
+
- `boundaryConfidence`
|
|
760
|
+
- `contextConfidence`
|
|
761
|
+
|
|
762
|
+
Examples:
|
|
763
|
+
|
|
764
|
+
- test-only change: lighter architecture scrutiny, stronger false-positive suppression
|
|
765
|
+
- migration change: stronger rollout, compatibility, and data-integrity scrutiny
|
|
766
|
+
- security-sensitive change: stronger runtime and adversarial review
|
|
767
|
+
- ambiguous boundary: stronger boundary-validation or context follow-up
|
|
768
|
+
- large mixed-shape change: stronger partitioning instincts and more cautious confidence
|
|
769
|
+
- mechanically noisy change: stronger noise suppression and lower appetite for style-only findings
|
|
770
|
+
|
|
771
|
+
Reviewer families should still be evidence producers, not decision makers.
|
|
772
|
+
|
|
773
|
+
The redesign should also distinguish between:
|
|
774
|
+
|
|
775
|
+
- reviewer-family execution work
|
|
776
|
+
- auditor-style critique of the current synthesis
|
|
777
|
+
|
|
778
|
+
Both are useful, but they are not the same cognitive unit.
|
|
779
|
+
|
|
780
|
+
The workflow should further strengthen:
|
|
781
|
+
|
|
782
|
+
- explicit pre-delegation hypothesis
|
|
783
|
+
- explicit post-delegation synthesis
|
|
784
|
+
- explicit rejection of weak or overreaching findings
|
|
785
|
+
- explicit handling of missed-issue and false-positive signals
|
|
786
|
+
|
|
787
|
+
This phase should explicitly consider use of:
|
|
788
|
+
|
|
789
|
+
- `routine-hypothesis-challenge` for adversarial reviewer challenge
|
|
790
|
+
- `routine-execution-simulation` when runtime behavior or branch-sensitive behavior is material
|
|
791
|
+
- `routine-philosophy-alignment` when policy-context is important enough to affect recommendation quality
|
|
792
|
+
|
|
793
|
+
## Phase 4: Contradiction, Gap, and Boundary Resolution Loop
|
|
794
|
+
|
|
795
|
+
This should broaden the current contradiction loop into a more general resolution loop.
|
|
796
|
+
|
|
797
|
+
It should continue when there is material unresolved:
|
|
798
|
+
|
|
799
|
+
- reviewer disagreement
|
|
800
|
+
- coverage uncertainty
|
|
801
|
+
- false-positive risk
|
|
802
|
+
- boundary uncertainty
|
|
803
|
+
- context insufficiency
|
|
804
|
+
|
|
805
|
+
Targeted follow-up should be minimal and focused. The workflow should avoid re-running broad discovery unless it learns that the original boundary or context assumptions were wrong.
|
|
806
|
+
|
|
807
|
+
This loop is also where the workflow should reopen:
|
|
808
|
+
|
|
809
|
+
- merge-base reasoning when ancestry assumptions were weak
|
|
810
|
+
- ticket/doc discovery when missing context materially affects recommendation quality
|
|
811
|
+
|
|
812
|
+
## Phase 5: Final Validation
|
|
813
|
+
|
|
814
|
+
The current final validation idea remains useful, but it should explicitly validate:
|
|
815
|
+
|
|
816
|
+
- recommendation strength
|
|
817
|
+
- severity calibration
|
|
818
|
+
- evidence quality
|
|
819
|
+
- operational / rollout concerns
|
|
820
|
+
- compatibility / migration risk
|
|
821
|
+
- whether unresolved context or boundary issues materially weaken the recommendation
|
|
822
|
+
|
|
823
|
+
Final validation should also ensure the handoff reflects uncertainty honestly instead of over-stating confidence.
|
|
824
|
+
|
|
825
|
+
The current WorkRail routine catalog suggests the redesign should strongly consider `routine-final-verification` as either:
|
|
826
|
+
|
|
827
|
+
- a delegated verifier
|
|
828
|
+
- an injected routine template
|
|
829
|
+
- or a direct-execution structure borrowed into the final validation phase
|
|
830
|
+
|
|
831
|
+
## Phase 6: Final Handoff and Environment Status
|
|
832
|
+
|
|
833
|
+
The final handoff should include both the review result and an explicit status report about the review environment.
|
|
834
|
+
|
|
835
|
+
### Review result
|
|
836
|
+
|
|
837
|
+
Include:
|
|
838
|
+
|
|
839
|
+
- recommendation
|
|
840
|
+
- confidence band
|
|
841
|
+
- top findings
|
|
842
|
+
- rationale
|
|
843
|
+
- remaining uncertainties
|
|
844
|
+
- summary of review surface and excluded noise
|
|
845
|
+
- validation outcomes
|
|
846
|
+
|
|
847
|
+
### Review environment status
|
|
848
|
+
|
|
849
|
+
Include:
|
|
850
|
+
|
|
851
|
+
- what the workflow accessed successfully
|
|
852
|
+
- what it attempted but could not access
|
|
853
|
+
- what it never attempted
|
|
854
|
+
- impact on review quality
|
|
855
|
+
- suggested environment improvements for future reviews
|
|
856
|
+
|
|
857
|
+
This should be informative, not accusatory and not blocking.
|
|
858
|
+
|
|
859
|
+
It should also explicitly state:
|
|
860
|
+
|
|
861
|
+
- whether the workflow found the actual PR/MR
|
|
862
|
+
- whether the workflow found ticket context
|
|
863
|
+
- whether the workflow found supporting docs
|
|
864
|
+
- whether the workflow is confident it reviewed the correct ancestor-relative surface
|
|
865
|
+
|
|
866
|
+
## New Core Concepts
|
|
867
|
+
|
|
868
|
+
## Review Source Ledger
|
|
869
|
+
|
|
870
|
+
The workflow should maintain a structured ledger describing where review context came from.
|
|
871
|
+
|
|
872
|
+
Suggested fields:
|
|
873
|
+
|
|
874
|
+
- `reviewTargetSource`
|
|
875
|
+
- `boundarySource`
|
|
876
|
+
- `mrMetadataSource`
|
|
877
|
+
- `ticketSource`
|
|
878
|
+
- `docSourcesFound`
|
|
879
|
+
- `docSourcesMissing`
|
|
880
|
+
- `policySourcesFound`
|
|
881
|
+
- `policySourcesMissing`
|
|
882
|
+
- `capabilityObservations`
|
|
883
|
+
- `contextGaps`
|
|
884
|
+
|
|
885
|
+
This ledger exists to improve both reasoning quality and final transparency.
|
|
886
|
+
|
|
887
|
+
It is still open whether this should be represented primarily as:
|
|
888
|
+
|
|
889
|
+
- explicit context keys
|
|
890
|
+
- a dedicated structured artifact
|
|
891
|
+
- or both, with context carrying only the routing-critical subset
|
|
892
|
+
|
|
893
|
+
If a dedicated artifact is used, the workflow should still keep routing-critical fields in context so conditions, loops, and later phases remain deterministic and lightweight.
|
|
894
|
+
|
|
895
|
+
## Boundary Confidence Model
|
|
896
|
+
|
|
897
|
+
The workflow should model review-boundary certainty explicitly rather than burying it in prose.
|
|
898
|
+
|
|
899
|
+
Suggested fields:
|
|
900
|
+
|
|
901
|
+
- `baseCandidate`
|
|
902
|
+
- `mergeBaseConfidence`
|
|
903
|
+
- `stackedBranchSuspected`
|
|
904
|
+
- `reviewBoundaryConfidence`
|
|
905
|
+
- `boundaryResolutionMethod`
|
|
906
|
+
- `reviewScopeWarnings`
|
|
907
|
+
- `baseResolutionFailed`
|
|
908
|
+
|
|
909
|
+
This is likely one of the strongest predictors of whether the workflow will rival strong human review.
|
|
910
|
+
|
|
911
|
+
## Change Type Profile
|
|
912
|
+
|
|
913
|
+
The workflow should classify the change into a structured profile rather than using only size/risk heuristics.
|
|
914
|
+
|
|
915
|
+
Suggested categories:
|
|
916
|
+
|
|
917
|
+
- `api_contract_change`
|
|
918
|
+
- `data_model_or_migration`
|
|
919
|
+
- `refactor`
|
|
920
|
+
- `infra_or_config`
|
|
921
|
+
- `test_only`
|
|
922
|
+
- `docs_only`
|
|
923
|
+
- `security_sensitive`
|
|
924
|
+
- `performance_sensitive`
|
|
925
|
+
- `ui_only`
|
|
926
|
+
- `mechanical_or_generated`
|
|
927
|
+
|
|
928
|
+
This profile should influence reviewer-family selection, simulation choices, and validation depth.
|
|
929
|
+
|
|
930
|
+
## Shape Profile
|
|
931
|
+
|
|
932
|
+
The workflow should classify MR shape separately from MR type.
|
|
933
|
+
|
|
934
|
+
Suggested categories:
|
|
935
|
+
|
|
936
|
+
- `tiny_isolated_change`
|
|
937
|
+
- `medium_localized_change`
|
|
938
|
+
- `broad_crosscutting_change`
|
|
939
|
+
- `stacked_branch_change`
|
|
940
|
+
- `mechanically_noisy_change`
|
|
941
|
+
- `mixed_signal_change`
|
|
942
|
+
- `migration_heavy_change`
|
|
943
|
+
|
|
944
|
+
This profile should influence:
|
|
945
|
+
|
|
946
|
+
- review partitioning strategy
|
|
947
|
+
- boundary follow-up depth
|
|
948
|
+
- reviewer-family breadth
|
|
949
|
+
- confidence calibration
|
|
950
|
+
- false-positive suppression
|
|
951
|
+
|
|
952
|
+
## Review Surface Hygiene Model
|
|
953
|
+
|
|
954
|
+
The workflow should explicitly separate:
|
|
955
|
+
|
|
956
|
+
- `core_review_surface`
|
|
957
|
+
- `generated_files`
|
|
958
|
+
- `mechanical_churn`
|
|
959
|
+
- `rename_or_move_only`
|
|
960
|
+
- `likely_inherited_changes`
|
|
961
|
+
- `out_of_scope_or_noise`
|
|
962
|
+
|
|
963
|
+
Without this, large reviews will continue to waste attention and overproduce low-value findings.
|
|
964
|
+
|
|
965
|
+
## Capability Observation Model
|
|
966
|
+
|
|
967
|
+
Capability probing should produce durable observations rather than vague narrative.
|
|
968
|
+
|
|
969
|
+
Suggested recorded dimensions:
|
|
970
|
+
|
|
971
|
+
- source name
|
|
972
|
+
- status
|
|
973
|
+
- attempt method
|
|
974
|
+
- limitation reason
|
|
975
|
+
- whether the limitation materially reduced review quality
|
|
976
|
+
|
|
977
|
+
For first-class workflow capabilities, the redesign should prefer the existing v2 capability-observation path rather than inventing a bespoke mechanism.
|
|
978
|
+
|
|
979
|
+
For non-capability discovery surfaces, the main requirement is still durable structured observation, but the exact storage form remains an open authoring decision.
|
|
980
|
+
|
|
981
|
+
## Suggested Top-Level Capability Direction
|
|
982
|
+
|
|
983
|
+
At the workflow level, the redesign likely wants:
|
|
984
|
+
|
|
985
|
+
```json
|
|
986
|
+
{
|
|
987
|
+
"capabilities": {
|
|
988
|
+
"delegation": "preferred",
|
|
989
|
+
"web_browsing": "preferred"
|
|
990
|
+
}
|
|
991
|
+
}
|
|
992
|
+
```
|
|
993
|
+
|
|
994
|
+
The workflow should still treat GitHub CLI, ticket systems, and repo-local docs as discovery surfaces to probe rather than first-class capability enums.
|
|
995
|
+
|
|
996
|
+
The final workflow should likely also use feature config intentionally, not just capability declarations alone.
|
|
997
|
+
|
|
998
|
+
Example direction:
|
|
999
|
+
|
|
1000
|
+
- `wr.features.capabilities` to standardize probing behavior
|
|
1001
|
+
- `wr.features.output_contracts` to standardize enforcement and disclosure behavior
|
|
1002
|
+
|
|
1003
|
+
## Acceptance Criteria for the Redesign
|
|
1004
|
+
|
|
1005
|
+
The redesign should be considered successful if the future workflow:
|
|
1006
|
+
|
|
1007
|
+
1. attempts to discover the actual MR/PR when possible
|
|
1008
|
+
2. attempts to determine the true review boundary and exposes confidence in that boundary
|
|
1009
|
+
3. records discovery-source availability and insufficiency durably
|
|
1010
|
+
4. separates core review surface from noise before deep review
|
|
1011
|
+
5. adapts reviewer selection using change shape as well as size/risk
|
|
1012
|
+
6. uses final handoff to disclose access limits and their effect on confidence
|
|
1013
|
+
7. remains non-blocking unless correctness truly requires user input or unavailable artifacts
|
|
1014
|
+
8. keeps notes/context as workflow truth rather than making a review doc canonical
|
|
1015
|
+
9. attempts merge-base / ancestor resolution even for stale or stacked branches
|
|
1016
|
+
10. explicitly says when it is not confident it reviewed the correct surface
|
|
1017
|
+
11. attempts to recover repo/user rules, conventions, and coding philosophy when available
|
|
1018
|
+
12. uses policy-context confidence to calibrate how strongly it frames findings and recommendations
|
|
1019
|
+
|
|
1020
|
+
## Risks and Tensions
|
|
1021
|
+
|
|
1022
|
+
### Risk: overloaded Phase 0
|
|
1023
|
+
|
|
1024
|
+
This redesign puts a lot into the first phase.
|
|
1025
|
+
|
|
1026
|
+
Mitigation:
|
|
1027
|
+
|
|
1028
|
+
- keep the phase internally structured
|
|
1029
|
+
- use explicit sub-steps
|
|
1030
|
+
- require durable structured outputs, not just longer prose
|
|
1031
|
+
- use routines, templates, and auditors selectively so structure does not collapse into one giant handwritten prompt
|
|
1032
|
+
|
|
1033
|
+
### Risk: environment-probing noise
|
|
1034
|
+
|
|
1035
|
+
Capability and source probing can become verbose or distracting.
|
|
1036
|
+
|
|
1037
|
+
Mitigation:
|
|
1038
|
+
|
|
1039
|
+
- probe lazily
|
|
1040
|
+
- record compactly
|
|
1041
|
+
- summarize cleanly in the final handoff
|
|
1042
|
+
- prefer collapsed capability probes and reusable probe templates where authoring stays readable
|
|
1043
|
+
|
|
1044
|
+
### Risk: false precision in boundary confidence
|
|
1045
|
+
|
|
1046
|
+
The workflow may pretend certainty it does not actually have.
|
|
1047
|
+
|
|
1048
|
+
Mitigation:
|
|
1049
|
+
|
|
1050
|
+
- require explicit reasoning for boundary confidence
|
|
1051
|
+
- record warnings when ancestry remains ambiguous
|
|
1052
|
+
- allow confidence downgrade without blocking
|
|
1053
|
+
|
|
1054
|
+
### Risk: review-quality theater
|
|
1055
|
+
|
|
1056
|
+
The workflow could produce a polished review that looks rigorous while still lacking enough context to justify its confidence.
|
|
1057
|
+
|
|
1058
|
+
Mitigation:
|
|
1059
|
+
|
|
1060
|
+
- tie recommendation confidence to boundary confidence and context confidence
|
|
1061
|
+
- require the final handoff to name important unavailable sources
|
|
1062
|
+
- prefer explicit uncertainty over polished but misleading certainty
|
|
1063
|
+
|
|
1064
|
+
### Risk: policy-context mismatch
|
|
1065
|
+
|
|
1066
|
+
The workflow could produce findings that are locally reasonable but misaligned with the user’s rules, repo conventions, or architectural philosophy.
|
|
1067
|
+
|
|
1068
|
+
Mitigation:
|
|
1069
|
+
|
|
1070
|
+
- discover policy context explicitly
|
|
1071
|
+
- record missing policy sources as confidence-relevant gaps
|
|
1072
|
+
- present findings more cautiously when policy context is weak
|
|
1073
|
+
|
|
1074
|
+
### Risk: underusing WorkRail-native structure
|
|
1075
|
+
|
|
1076
|
+
The redesign could be conceptually strong but still author the final workflow as mostly handwritten prompts, leaving reuse, determinism, and customization power on the table.
|
|
1077
|
+
|
|
1078
|
+
Mitigation:
|
|
1079
|
+
|
|
1080
|
+
- prefer promptBlocks over long freeform prompts
|
|
1081
|
+
- use refs for repeated canonical guidance
|
|
1082
|
+
- use routines deliberately
|
|
1083
|
+
- use extension points only for bounded high-value seams
|
|
1084
|
+
|
|
1085
|
+
## Assessment of the Proposed Shape
|
|
1086
|
+
|
|
1087
|
+
### Is this the best shape?
|
|
1088
|
+
|
|
1089
|
+
This is the best next shape I would currently recommend, but probably not the final best possible shape.
|
|
1090
|
+
|
|
1091
|
+
It addresses the most important structural weaknesses in the current workflow:
|
|
1092
|
+
|
|
1093
|
+
- wrong-surface review risk
|
|
1094
|
+
- weak intent reconstruction
|
|
1095
|
+
- insufficient graceful degradation
|
|
1096
|
+
- under-specified environment transparency
|
|
1097
|
+
|
|
1098
|
+
### Will the review be thorough and useful?
|
|
1099
|
+
|
|
1100
|
+
Yes, this redesign should produce much more thorough and useful reviews than the current workflow, especially when the environment has enough discovery surfaces to enrich the review.
|
|
1101
|
+
|
|
1102
|
+
### Will it rival the best human engineers?
|
|
1103
|
+
|
|
1104
|
+
Not reliably yet.
|
|
1105
|
+
|
|
1106
|
+
It should get much closer, but the best human reviewers still outperform in:
|
|
1107
|
+
|
|
1108
|
+
- nuanced severity judgment
|
|
1109
|
+
- historical and organizational context reconstruction
|
|
1110
|
+
- large-change decomposition
|
|
1111
|
+
- subtle product and rollout reasoning under ambiguity
|
|
1112
|
+
|
|
1113
|
+
### Is it adaptable to the size of the changes?
|
|
1114
|
+
|
|
1115
|
+
Yes, and more importantly, the redesign makes it adaptable to both size and change shape.
|
|
1116
|
+
|
|
1117
|
+
That is a meaningful improvement over the current design, which is still too size/risk-centric.
|
|
1118
|
+
|
|
1119
|
+
### Does it properly identify the correct ancestor?
|
|
1120
|
+
|
|
1121
|
+
Not yet in the current workflow.
|
|
1122
|
+
|
|
1123
|
+
In the redesigned workflow, ancestor and merge-base handling must become a required attempted behavior, with explicit confidence reporting when the result is uncertain.
|
|
1124
|
+
|
|
1125
|
+
### Risk: overfitting reviewer families to categories
|
|
1126
|
+
|
|
1127
|
+
Too much change-type routing could make the workflow brittle.
|
|
1128
|
+
|
|
1129
|
+
Mitigation:
|
|
1130
|
+
|
|
1131
|
+
- keep a small change-type taxonomy
|
|
1132
|
+
- use it to influence, not fully determine, reviewer choice
|
|
1133
|
+
|
|
1134
|
+
## Open Questions
|
|
1135
|
+
|
|
1136
|
+
### Workflow-authoring questions
|
|
1137
|
+
|
|
1138
|
+
- Should capability observations use `wr.templates.capability_probe` / `wr.contracts.capability_observation` directly in the final workflow, or should some probes stay handwritten for readability?
|
|
1139
|
+
- Should non-capability source observations share the same artifact style, or live in a separate review-source ledger?
|
|
1140
|
+
- Should the review-source ledger be a dedicated artifact, explicit context fields, or both?
|
|
1141
|
+
- Should boundary-confidence handling live entirely inside Phase 0, or also have a reusable template/routine?
|
|
1142
|
+
|
|
1143
|
+
### Product questions
|
|
1144
|
+
|
|
1145
|
+
- Should missing PR metadata reduce confidence mildly or strongly?
|
|
1146
|
+
- Should a very low boundary-confidence result reopen discovery automatically, or only surface a warning?
|
|
1147
|
+
- How strong should the end-of-workflow tooling recommendations be before they feel noisy rather than helpful?
|
|
1148
|
+
|
|
1149
|
+
### Scope questions
|
|
1150
|
+
|
|
1151
|
+
- Should large-MR partitioning be part of this redesign, or explicitly deferred?
|
|
1152
|
+
- Should historical reasoning from prior commits or nearby blame/history be added now, or later?
|
|
1153
|
+
- Should severity-calibration improvements be bundled with this redesign, or follow after the boundary/context work lands?
|
|
1154
|
+
|
|
1155
|
+
## Recommended Next Step
|
|
1156
|
+
|
|
1157
|
+
Use this document as the working source of truth until the design stabilizes.
|
|
1158
|
+
|
|
1159
|
+
Once the open questions are narrowed, the next step should be a second-pass revision of this document that:
|
|
1160
|
+
|
|
1161
|
+
- decides which fields are required durable context
|
|
1162
|
+
- decides which fields should be artifact-backed
|
|
1163
|
+
- defines the exact reviewer-family routing logic by change type
|
|
1164
|
+
- defines what Phase 0 must output before the workflow can advance
|
|
1165
|
+
|
|
1166
|
+
Only after that should `mr-review-workflow.agentic.v2.json` be updated.
|
|
1167
|
+
|
|
1168
|
+
## Next Implementation Slice (Recommended)
|
|
1169
|
+
|
|
1170
|
+
The best next implementation pass should be **narrow, high-leverage, and compatibility-aware**.
|
|
1171
|
+
|
|
1172
|
+
Do **not** try to land the entire future-state redesign at once.
|
|
1173
|
+
|
|
1174
|
+
The recommended slice is:
|
|
1175
|
+
|
|
1176
|
+
1. **replace the current Phase 0 with a real Locate / Bound / Enrich / Classify front phase**
|
|
1177
|
+
2. **strengthen the fact packet so it includes review surface and discovered intent**
|
|
1178
|
+
3. **add final environment-status disclosure**
|
|
1179
|
+
4. **add minimal shape/type-aware routing where it clearly changes reviewer-family choice**
|
|
1180
|
+
|
|
1181
|
+
This slice should intentionally **not** try to solve every remaining elite-review gap in the same pass.
|
|
1182
|
+
|
|
1183
|
+
This section should be treated as the **canonical near-term implementation plan** for the redesign. If it conflicts with earlier broad design material, prefer this narrower slice for the next real workflow update.
|
|
1184
|
+
|
|
1185
|
+
### Why this slice is best
|
|
1186
|
+
|
|
1187
|
+
It directly addresses the largest correctness and usefulness gaps:
|
|
1188
|
+
|
|
1189
|
+
- wrong review-boundary risk
|
|
1190
|
+
- weak MR/ticket/doc discovery
|
|
1191
|
+
- insufficient graceful degradation visibility
|
|
1192
|
+
- under-specified review-surface hygiene
|
|
1193
|
+
|
|
1194
|
+
without forcing the workflow into a giant taxonomy or a schema-fighting rewrite.
|
|
1195
|
+
|
|
1196
|
+
## Engine vs Agent Responsibility Split
|
|
1197
|
+
|
|
1198
|
+
The next implementation slice should use WorkRail for **control, durability, and accountability**, while leaving investigation and synthesis flexible.
|
|
1199
|
+
|
|
1200
|
+
### Engine should enforce
|
|
1201
|
+
|
|
1202
|
+
- phase boundaries
|
|
1203
|
+
- minimum required outputs
|
|
1204
|
+
- confidence/follow-up routing
|
|
1205
|
+
- durable disclosure of missing context and uncertainty
|
|
1206
|
+
- final handoff structure
|
|
1207
|
+
|
|
1208
|
+
### Agent should own
|
|
1209
|
+
|
|
1210
|
+
- discovery path
|
|
1211
|
+
- source prioritization
|
|
1212
|
+
- tool choice
|
|
1213
|
+
- evidence synthesis
|
|
1214
|
+
- non-obvious issue detection
|
|
1215
|
+
|
|
1216
|
+
This redesign should constrain **what must be established or admitted**, not prescribe exact commands or investigative choreography.
|
|
1217
|
+
|
|
1218
|
+
## Minimum Phase 0 Contract
|
|
1219
|
+
|
|
1220
|
+
The next pass should stop treating Phase 0 as mostly prose and make it a compact execution contract.
|
|
1221
|
+
|
|
1222
|
+
### Minimum required context before advancing
|
|
1223
|
+
|
|
1224
|
+
These fields should always be set before Phase 0 completes:
|
|
1225
|
+
|
|
1226
|
+
- `reviewTargetKind` - enum-like classification such as `pr`, `branch`, `diff`, `patch`, or `unknown`
|
|
1227
|
+
- `reviewTargetSource` - short provenance label such as `user_link`, `gh_discovery`, `git_branch`, or `local_diff`
|
|
1228
|
+
- `reviewSurfaceSummary` - one compact prose summary of what is actually being reviewed
|
|
1229
|
+
- `reviewMode`
|
|
1230
|
+
- `riskLevel`
|
|
1231
|
+
- `boundaryConfidence`
|
|
1232
|
+
- `contextConfidence`
|
|
1233
|
+
- `shapeProfile`
|
|
1234
|
+
- `changeTypeProfile`
|
|
1235
|
+
- `needsReviewerBundle`
|
|
1236
|
+
- `needsBoundaryFollowup`
|
|
1237
|
+
- `needsContextFollowup`
|
|
1238
|
+
- `reviewScopeWarnings` - short list of warnings, not a large ledger
|
|
1239
|
+
|
|
1240
|
+
### Required-if-known context
|
|
1241
|
+
|
|
1242
|
+
These fields should be set when they can be discovered without blocking:
|
|
1243
|
+
|
|
1244
|
+
- `prUrl`
|
|
1245
|
+
- `prNumber`
|
|
1246
|
+
- `baseCandidate`
|
|
1247
|
+
- `mergeBaseRef`
|
|
1248
|
+
- `ticketRefs` - list of ticket/issue identifiers when discoverable
|
|
1249
|
+
- `supportingDocsFound` - compact list of discovered supporting docs, not a boolean
|
|
1250
|
+
- `policySourcesFound` - compact list of repo/user policy sources, not a boolean
|
|
1251
|
+
|
|
1252
|
+
### Minimum review-surface outputs
|
|
1253
|
+
|
|
1254
|
+
Phase 0 should also classify the visible change into a minimal review-surface shape.
|
|
1255
|
+
|
|
1256
|
+
It does **not** need a large ledger in the next slice, but it should at least distinguish:
|
|
1257
|
+
|
|
1258
|
+
- `coreReviewSurface`
|
|
1259
|
+
- `likelyNoiseOrMechanicalChurn`
|
|
1260
|
+
- `likelyInheritedOrOutOfScopeChanges`
|
|
1261
|
+
|
|
1262
|
+
This can stay compact. The important thing is that the workflow does not treat every visible changed file as equally worthy of deep review by default.
|
|
1263
|
+
|
|
1264
|
+
### Minimum advance rule
|
|
1265
|
+
|
|
1266
|
+
Phase 0 may advance when all of these are true:
|
|
1267
|
+
|
|
1268
|
+
- there is a meaningful review target
|
|
1269
|
+
- there is inspectable material
|
|
1270
|
+
- `boundaryConfidence` is set
|
|
1271
|
+
- `contextConfidence` is set
|
|
1272
|
+
- `shapeProfile` is set
|
|
1273
|
+
- `changeTypeProfile` is set
|
|
1274
|
+
- the workflow has explicitly recorded whether boundary/context follow-up is needed
|
|
1275
|
+
|
|
1276
|
+
This keeps the workflow **non-blocking by default** while still making the phase structurally complete.
|
|
1277
|
+
|
|
1278
|
+
### Explicit fallback behavior
|
|
1279
|
+
|
|
1280
|
+
The next implementation slice should make these fallback behaviors explicit:
|
|
1281
|
+
|
|
1282
|
+
| Situation | Expected Phase 0 behavior |
|
|
1283
|
+
|---|---|
|
|
1284
|
+
| PR/MR not found, but branch/diff is inspectable | continue with branch/diff review, lower context confidence, disclose missing PR context later |
|
|
1285
|
+
| branch exists, but merge-base / ancestor is ambiguous | continue with downgraded boundary confidence, record boundary follow-up need, disclose the uncertainty later |
|
|
1286
|
+
| no ticket or supporting docs found | continue, lower context confidence, avoid overclaiming intent-sensitive findings |
|
|
1287
|
+
| only a patch/diff is available | continue if inspectable, but keep lower confidence on intent/boundary-dependent conclusions |
|
|
1288
|
+
| inspectable target is missing entirely | ask for the missing review artifact and stop |
|
|
1289
|
+
|
|
1290
|
+
## Narrow Artifact vs Context Decision
|
|
1291
|
+
|
|
1292
|
+
The next implementation slice should prefer **context-first** rather than introducing multiple new artifacts immediately.
|
|
1293
|
+
|
|
1294
|
+
### Keep in context
|
|
1295
|
+
|
|
1296
|
+
Use context for routing-critical state:
|
|
1297
|
+
|
|
1298
|
+
- boundary confidence
|
|
1299
|
+
- context confidence
|
|
1300
|
+
- shape/type classification
|
|
1301
|
+
- follow-up triggers
|
|
1302
|
+
- base candidate / merge-base outcome
|
|
1303
|
+
- whether PR/ticket/docs were found
|
|
1304
|
+
|
|
1305
|
+
### Add at most one optional artifact
|
|
1306
|
+
|
|
1307
|
+
If the workflow needs a human-readable artifact in the next pass, add only one optional artifact:
|
|
1308
|
+
|
|
1309
|
+
- `boundary-analysis`
|
|
1310
|
+
|
|
1311
|
+
Do **not** add multiple ledgers in the next implementation slice unless they clearly improve execution quality.
|
|
1312
|
+
|
|
1313
|
+
## Minimal Routing Matrix
|
|
1314
|
+
|
|
1315
|
+
The next pass should use a **small routing table**, not a giant decision taxonomy.
|
|
1316
|
+
|
|
1317
|
+
For the next slice, keep these classifications intentionally small:
|
|
1318
|
+
|
|
1319
|
+
- `shapeProfile`: `isolated_change`, `crosscutting_change`, `mechanically_noisy_change`, `ambiguous_boundary`
|
|
1320
|
+
- `changeTypeProfile`: `api_contract_change`, `data_model_or_migration`, `security_sensitive`, `test_only`, `general_code_change`
|
|
1321
|
+
|
|
1322
|
+
Anything more detailed should stay out of the workflow until it clearly changes behavior enough to justify itself.
|
|
1323
|
+
|
|
1324
|
+
| Situation | Reviewer / follow-up bias |
|
|
1325
|
+
|---|---|
|
|
1326
|
+
| `boundaryConfidence = Low` | boundary/context follow-up before strong recommendation confidence |
|
|
1327
|
+
| `changeTypeProfile = api_contract_change` | stronger contract/consumer/backward-compatibility scrutiny |
|
|
1328
|
+
| `changeTypeProfile = data_model_or_migration` | stronger rollout / compatibility / simulation lens |
|
|
1329
|
+
| `changeTypeProfile = security_sensitive` | stronger adversarial/runtime-risk scrutiny and lower tolerance for weak evidence |
|
|
1330
|
+
| `changeTypeProfile = test_only` | lighter architecture scrutiny, stronger false-positive suppression |
|
|
1331
|
+
| `shapeProfile = mechanically_noisy_change` | stronger noise filtering, lower appetite for style-only findings |
|
|
1332
|
+
| `criticalSurfaceTouched = true` | include runtime/production-risk reviewer path |
|
|
1333
|
+
|
|
1334
|
+
If a classification does not change behavior at least this much, it should stay out of the workflow.
|
|
1335
|
+
|
|
1336
|
+
## Minimal Confidence Rules
|
|
1337
|
+
|
|
1338
|
+
The next implementation slice should use a small **self-assessment matrix** so the agent can gauge confidence dimension by dimension instead of picking a vague overall feeling.
|
|
1339
|
+
|
|
1340
|
+
### Use a 5-dimension confidence assessment in the prompt
|
|
1341
|
+
|
|
1342
|
+
The next pass does **not** need a scoring engine. A compact assessment block in the prompt is enough for the near-term workflow update.
|
|
1343
|
+
|
|
1344
|
+
Longer term, this is a strong candidate for a native WorkRail **assessment / decision gate** primitive so the engine, not the agent, can apply the aggregation and routing rules deterministically.
|
|
1345
|
+
|
|
1346
|
+
This assessment should be treated as **decision support**, not automatic truth. It should help the agent gauge uncertainty consistently, cap or guide conclusions, and trigger follow-up when needed, but it should not replace synthesis judgment.
|
|
1347
|
+
|
|
1348
|
+
Recommended dimensions:
|
|
1349
|
+
|
|
1350
|
+
- `boundaryConfidence`
|
|
1351
|
+
- `intentConfidence`
|
|
1352
|
+
- `evidenceConfidence`
|
|
1353
|
+
- `coverageConfidence`
|
|
1354
|
+
- `consensusConfidence`
|
|
1355
|
+
|
|
1356
|
+
Recommended levels:
|
|
1357
|
+
|
|
1358
|
+
- `High`
|
|
1359
|
+
- `Medium`
|
|
1360
|
+
- `Low`
|
|
1361
|
+
|
|
1362
|
+
Recommended prompt shape:
|
|
1363
|
+
|
|
1364
|
+
- rate each dimension
|
|
1365
|
+
- explain each in one sentence
|
|
1366
|
+
- apply the aggregation rules
|
|
1367
|
+
- state final recommendation confidence and why
|
|
1368
|
+
|
|
1369
|
+
### Anchors for each dimension
|
|
1370
|
+
|
|
1371
|
+
- **High**: strong direct support, little ambiguity
|
|
1372
|
+
- **Medium**: partial support, some important uncertainty
|
|
1373
|
+
- **Low**: weak support, major ambiguity, or likely missing context
|
|
1374
|
+
|
|
1375
|
+
### Aggregation rules
|
|
1376
|
+
|
|
1377
|
+
- if `boundaryConfidence = Low`, final recommendation confidence = `Low`
|
|
1378
|
+
- else if `evidenceConfidence = Low`, final recommendation confidence = `Low`
|
|
1379
|
+
- else if 2 or more dimensions are `Medium`, final recommendation confidence = `Medium`
|
|
1380
|
+
- else if all key dimensions are `High`, final recommendation confidence = `High`
|
|
1381
|
+
- unresolved disagreement can only lower confidence, never raise it
|
|
1382
|
+
|
|
1383
|
+
### Hard rules around the assessment
|
|
1384
|
+
|
|
1385
|
+
- if important supporting sources were unavailable, the final handoff must say so explicitly
|
|
1386
|
+
- if intent confidence is weak and a finding depends heavily on inferred intent, prefer the lower-confidence interpretation
|
|
1387
|
+
- if coverage confidence is weak, findings should be framed as more tentative even when local code evidence looks strong
|
|
1388
|
+
- if evidence later proves stronger or weaker than the initial read, synthesis may lower confidence further, but should not exceed what the assessment justifies without an explicit reason
|
|
1389
|
+
|
|
1390
|
+
### Follow-up triggers
|
|
1391
|
+
|
|
1392
|
+
- low boundary confidence -> boundary follow-up
|
|
1393
|
+
- low intent confidence with intent-sensitive findings -> context follow-up
|
|
1394
|
+
- low evidence confidence on a serious finding -> more validation or follow-up
|
|
1395
|
+
- low coverage confidence -> targeted coverage-expansion follow-up
|
|
1396
|
+
- unresolved contradictory reviewer output -> synthesis/follow-up loop
|
|
1397
|
+
|
|
1398
|
+
This should remain a **prompt-level self-assessment** in the next slice, not a large new workflow-state subsystem.
|
|
1399
|
+
|
|
1400
|
+
## Explicit Deferrals
|
|
1401
|
+
|
|
1402
|
+
The next implementation slice should **defer** these items unless the user explicitly wants a broader redesign:
|
|
1403
|
+
|
|
1404
|
+
- large-MR partitioning
|
|
1405
|
+
- historical reasoning / blame-style context
|
|
1406
|
+
- a full severity calibration framework
|
|
1407
|
+
- broad extension-point authoring
|
|
1408
|
+
- a multi-artifact ledger system
|
|
1409
|
+
- feature-config / capability-probe authoring that depends on unsupported schema/compiler support
|
|
1410
|
+
|
|
1411
|
+
These are still good ideas, but they are not the highest-leverage next landing.
|
|
1412
|
+
|
|
1413
|
+
## Current Engine-Compatibility Constraint
|
|
1414
|
+
|
|
1415
|
+
The redesign doc is still intentionally ahead of the current engine in some places.
|
|
1416
|
+
|
|
1417
|
+
For the next real workflow update, avoid depending on authoring constructs that the current schema/compiler does not yet accept cleanly.
|
|
1418
|
+
|
|
1419
|
+
In particular:
|
|
1420
|
+
|
|
1421
|
+
- do not assume top-level `capabilities` is available unless schema support is added first
|
|
1422
|
+
- do not assume every v2 feature flag discussed in docs is currently accepted by the compiler
|
|
1423
|
+
|
|
1424
|
+
The next implementation slice should prefer current-schema-compatible prompts, context fields, and supported contracts unless the user explicitly wants engine/schema work as part of the same change.
|
|
1425
|
+
|
|
1426
|
+
## Notes Quality Requirement
|
|
1427
|
+
|
|
1428
|
+
The workflow's notes should be useful for both:
|
|
1429
|
+
|
|
1430
|
+
- the user reading what happened
|
|
1431
|
+
- another agent resuming the review later
|
|
1432
|
+
|
|
1433
|
+
That means notes should read like compact **decision memos**, not raw logs or vague diary entries.
|
|
1434
|
+
|
|
1435
|
+
At minimum, the notes for important phases should make clear:
|
|
1436
|
+
|
|
1437
|
+
- what was learned
|
|
1438
|
+
- what was decided
|
|
1439
|
+
- what remains uncertain
|
|
1440
|
+
- what should happen next
|
|
1441
|
+
|
|
1442
|
+
## Success Criteria for the Next Pass
|
|
1443
|
+
|
|
1444
|
+
The next MR-review workflow update should be considered successful if it lands all of these:
|
|
1445
|
+
|
|
1446
|
+
1. it attempts to find the real PR/MR when available
|
|
1447
|
+
2. it attempts merge-base / ancestor reasoning and reports confidence honestly
|
|
1448
|
+
3. it discovers ticket/docs/policy context opportunistically without blocking by default
|
|
1449
|
+
4. it separates review surface from obvious noise at least at a basic level
|
|
1450
|
+
5. it adds a final environment-status summary that explains what was and was not accessible
|
|
1451
|
+
6. it does all of the above without introducing bureaucratic over-structure or schema-incompatible authoring
|