@event4u/agent-config 1.14.0 → 1.15.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (106) hide show
  1. package/.agent-src/commands/agent-handoff.md +1 -1
  2. package/.agent-src/commands/bug-fix.md +2 -2
  3. package/.agent-src/commands/chat-history-checkpoint.md +2 -2
  4. package/.agent-src/commands/chat-history-clear.md +1 -1
  5. package/.agent-src/commands/chat-history-resume.md +2 -2
  6. package/.agent-src/commands/chat-history.md +2 -2
  7. package/.agent-src/commands/check-current-md.md +43 -32
  8. package/.agent-src/commands/commit-in-chunks.md +43 -23
  9. package/.agent-src/commands/compress.md +34 -2
  10. package/.agent-src/commands/feature-roadmap.md +2 -2
  11. package/.agent-src/commands/fix-portability.md +2 -2
  12. package/.agent-src/commands/onboard.md +14 -5
  13. package/.agent-src/commands/optimize-augmentignore.md +9 -0
  14. package/.agent-src/commands/refine-ticket.md +9 -7
  15. package/.agent-src/commands/review-changes.md +35 -8
  16. package/.agent-src/commands/roadmap-create.md +13 -2
  17. package/.agent-src/commands/roadmap-execute.md +9 -7
  18. package/.agent-src/commands/set-cost-profile.md +8 -0
  19. package/.agent-src/commands/sync-agent-settings.md +9 -0
  20. package/.agent-src/commands/tests-execute.md +2 -3
  21. package/.agent-src/rules/artifact-engagement-recording.md +1 -1
  22. package/.agent-src/rules/augment-portability.md +56 -37
  23. package/.agent-src/rules/chat-history-cadence.md +109 -0
  24. package/.agent-src/rules/chat-history-ownership.md +123 -0
  25. package/.agent-src/rules/chat-history-visibility.md +96 -0
  26. package/.agent-src/rules/cli-output-handling.md +1 -1
  27. package/.agent-src/rules/command-suggestion.md +3 -2
  28. package/.agent-src/rules/commit-policy.md +44 -34
  29. package/.agent-src/rules/direct-answers.md +1 -1
  30. package/.agent-src/rules/language-and-tone.md +19 -15
  31. package/.agent-src/rules/non-destructive-by-default.md +18 -18
  32. package/.agent-src/rules/roadmap-progress-sync.md +133 -74
  33. package/.agent-src/rules/role-mode-adherence.md +1 -1
  34. package/.agent-src/rules/size-enforcement.md +2 -1
  35. package/.agent-src/rules/user-interaction.md +28 -4
  36. package/.agent-src/scripts/update_roadmap_progress.py +56 -4
  37. package/.agent-src/skills/blade-ui/SKILL.md +29 -10
  38. package/.agent-src/skills/command-writing/SKILL.md +15 -4
  39. package/.agent-src/skills/existing-ui-audit/SKILL.md +24 -9
  40. package/.agent-src/skills/fe-design/SKILL.md +20 -15
  41. package/.agent-src/skills/file-editor/SKILL.md +9 -0
  42. package/.agent-src/skills/livewire/SKILL.md +26 -7
  43. package/.agent-src/skills/refine-ticket/SKILL.md +30 -24
  44. package/.agent-src/skills/roadmap-management/SKILL.md +22 -16
  45. package/.agent-src/skills/skill-writing/SKILL.md +3 -3
  46. package/.agent-src/skills/upstream-contribute/SKILL.md +2 -2
  47. package/.agent-src/templates/agent-settings.md +1 -1
  48. package/.agent-src/templates/roadmaps.md +9 -8
  49. package/.agent-src/templates/scripts/memory_lookup.py +1 -1
  50. package/.agent-src/templates/scripts/work_engine/__init__.py +2 -2
  51. package/.agent-src/templates/scripts/work_engine/cli.py +64 -461
  52. package/.agent-src/templates/scripts/work_engine/cli_args.py +116 -0
  53. package/.agent-src/templates/scripts/work_engine/delivery_state.py +3 -3
  54. package/.agent-src/templates/scripts/work_engine/directives/backend/__init__.py +1 -1
  55. package/.agent-src/templates/scripts/work_engine/directives/backend/implement.py +1 -1
  56. package/.agent-src/templates/scripts/work_engine/directives/backend/memory.py +1 -1
  57. package/.agent-src/templates/scripts/work_engine/directives/backend/plan.py +1 -1
  58. package/.agent-src/templates/scripts/work_engine/directives/backend/report.py +1 -1
  59. package/.agent-src/templates/scripts/work_engine/dispatcher.py +1 -1
  60. package/.agent-src/templates/scripts/work_engine/emitters.py +43 -0
  61. package/.agent-src/templates/scripts/work_engine/errors.py +19 -0
  62. package/.agent-src/templates/scripts/work_engine/hook_bootstrap.py +76 -0
  63. package/.agent-src/templates/scripts/work_engine/input_builders.py +163 -0
  64. package/.agent-src/templates/scripts/work_engine/migration/v0_to_v1.py +34 -2
  65. package/.agent-src/templates/scripts/work_engine/persona_policy.py +1 -1
  66. package/.agent-src/templates/scripts/work_engine/resolvers/prompt.py +1 -1
  67. package/.agent-src/templates/scripts/work_engine/state_io.py +202 -0
  68. package/.claude-plugin/marketplace.json +1 -1
  69. package/AGENTS.md +6 -4
  70. package/CHANGELOG.md +83 -8
  71. package/README.md +24 -23
  72. package/docs/MIGRATION.md +122 -0
  73. package/docs/architecture.md +83 -34
  74. package/docs/contracts/STABILITY.md +95 -0
  75. package/docs/contracts/adr-chat-history-split.md +132 -0
  76. package/docs/contracts/adr-command-suggestion.md +146 -0
  77. package/docs/contracts/adr-implement-ticket-runtime.md +122 -0
  78. package/docs/contracts/adr-product-ui-track.md +384 -0
  79. package/docs/contracts/adr-prompt-driven-execution.md +187 -0
  80. package/docs/contracts/agent-memory-contract.md +149 -0
  81. package/docs/contracts/artifact-engagement-flow.md +262 -0
  82. package/docs/contracts/command-clusters.md +126 -0
  83. package/docs/contracts/command-suggestion-flow.md +148 -0
  84. package/docs/contracts/implement-ticket-flow.md +628 -0
  85. package/docs/contracts/linear-ai-rules-inclusion.md +143 -0
  86. package/docs/contracts/linear-ai-three-layers.md +131 -0
  87. package/docs/contracts/rule-interactions.md +107 -0
  88. package/docs/contracts/rule-interactions.yml +142 -0
  89. package/docs/contracts/ui-stack-extension.md +236 -0
  90. package/docs/contracts/ui-track-flow.md +338 -0
  91. package/docs/getting-started.md +2 -2
  92. package/docs/installation.md +42 -6
  93. package/docs/migrations/commands-1.15.0.md +112 -0
  94. package/docs/ui-track-mental-model.md +121 -0
  95. package/package.json +1 -1
  96. package/scripts/build_linear_digest.py +4 -4
  97. package/scripts/check_portability.py +2 -0
  98. package/scripts/check_public_links.py +185 -0
  99. package/scripts/check_references.py +1 -0
  100. package/scripts/lint_no_new_atomic_commands.py +179 -0
  101. package/scripts/lint_rule_interactions.py +149 -0
  102. package/scripts/memory_lookup.py +1 -1
  103. package/scripts/release.py +297 -64
  104. package/scripts/skill_linter.py +14 -0
  105. package/scripts/update_counts.py +10 -0
  106. package/.agent-src/rules/chat-history.md +0 -200
@@ -0,0 +1,628 @@
1
+ ---
2
+ stability: beta
3
+ ---
4
+
5
+ # `/implement-ticket` — Flow Contract
6
+
7
+ > Technical contracts for the delivery orchestrator shipped under
8
+ > [`road-to-implement-ticket.md`](../../agents/roadmaps/road-to-implement-ticket.md).
9
+ > This document is the stable reference; the roadmap tracks phased
10
+ > delivery.
11
+ >
12
+ > - **Created:** 2026-04-22
13
+ > - **Status:** Phase 1 shipped 2026-04-23 — `DeliveryState` +
14
+ > dispatcher live under
15
+ > [`.agent-src.uncompressed/templates/scripts/implement_ticket/`](../../.agent-src.uncompressed/templates/scripts/implement_ticket/).
16
+ > Step wiring (Phase 2) still open. Schema **v1** envelope
17
+ > (`work_engine.state` / `work_engine.migration.v0_to_v1`) shipped
18
+ > 2026-04-27 as R1 Phase 2 — see [State schema v1](#state-schema-v1)
19
+ > below. R1 Phase 6 replay harness shipped 2026-04-28 — see
20
+ > [Replay protocol](#replay-protocol--strict-verb-comparison-r1-phase-6).
21
+ > - **Runtime:** Python 3.10+ (see
22
+ > [`adr-implement-ticket-runtime.md`](adr-implement-ticket-runtime.md)).
23
+ > This doc stays shape-focused; implementation details belong to
24
+ > the code and the ADR.
25
+
26
+ ## What this doc is
27
+
28
+ The **shape** of the flow: states, steps, outcomes, report schema,
29
+ metrics. The roadmap tracks what ships when. The runtime is chosen
30
+ in Phase 0 and recorded in a short ADR once decided.
31
+
32
+ ## What this doc is *not*
33
+
34
+ - A runtime implementation spec.
35
+ - A DSL for user-authored flows (explicitly out of scope).
36
+ - A memory retrieval spec — that lives in
37
+ [`agent-memory-contract.md`](agent-memory-contract.md).
38
+
39
+ ## The linear flow
40
+
41
+ ```
42
+ refine → memory → analyze → plan → implement → test → verify → report
43
+ ```
44
+
45
+ Eight steps, fixed order, no branching. Each step is a thin
46
+ composition over existing skills. `/implement-ticket` adds no new
47
+ logic — it adds **sequencing + state + block semantics**.
48
+
49
+ ## `DeliveryState` — the only shared object
50
+
51
+ Every step reads and writes this shape. No hidden state, no side
52
+ channels. Fields (runtime may be dataclass, typed dict, JSON
53
+ document — the shape is normative, the container is not):
54
+
55
+ | Field | Purpose |
56
+ |---|---|
57
+ | `ticket` | ID, title, body, acceptance criteria, source system |
58
+ | `persona` | Resolved from `.agent-settings.yml` `roles.active_role` |
59
+ | `memory` | Up to 12 hits across four allowed types |
60
+ | `plan` | Structured plan from `feature-plan` |
61
+ | `changes` | List of proposed / applied edits with file:line refs |
62
+ | `tests` | What ran, verdicts, durations |
63
+ | `verify` | `review-changes` verdict + `verify-before-complete` gate |
64
+ | `outcomes` | Per-step `success | blocked | partial` + message |
65
+ | `questions` | Pending numbered questions when blocked |
66
+ | `report` | Final delivery report (populated by `report` step) |
67
+
68
+ No step may invent fields not declared here. Extensions require a
69
+ roadmap amendment + this doc updated.
70
+
71
+ ## State schema v1
72
+
73
+ R1 Phase 2 introduces the **wire-format envelope** that lets the
74
+ engine accept inputs other than tickets in later releases without
75
+ another schema bump. The envelope wraps the legacy slice without
76
+ moving any of its fields:
77
+
78
+ | Field | Type | Purpose |
79
+ |---|---|---|
80
+ | `version` | int (`1`) | Integer schema version. Loader rejects any other value. |
81
+ | `input.kind` | string | Typed input variant. `"ticket"` (R1) and `"prompt"` (R2). |
82
+ | `input.data` | object | Payload. Carries the v0 `ticket` dict verbatim, OR for `kind="prompt"` the prompt envelope (`raw`, `reconstructed_ac`, `assumptions`, optional `confidence`). |
83
+ | `intent` | string | Coarse intent label. Default `"backend-coding"`. |
84
+ | `directive_set` | string | Directive bundle name. One of `backend`, `ui`, `ui-trivial`, `mixed`. Only `backend` is wired in R1/R2. |
85
+ | _legacy slice_ | … | `persona`, `memory`, `plan`, `changes`, `tests`, `verify`, `outcomes`, `questions`, `report` keep their v0 names and meaning. |
86
+
87
+ Canonical filename is `.work-state.json` (was
88
+ `.implement-ticket-state.json` in v0). Field order on disk is fixed
89
+ — envelope first, legacy slice second — so state-snapshot diffs
90
+ across re-runs and across the freeze-guard replay stay readable.
91
+
92
+ The schema is **strict** on the envelope (unknown `input.kind` or
93
+ `directive_set` raise `SchemaError`) and **additive** on top-level
94
+ keys (unknown extras are dropped on load, not re-emitted on dump).
95
+ A reader that pre-dates a future field cannot crash on it, but
96
+ also cannot silently relay it forward without an explicit upgrade.
97
+
98
+ ### Migration v0 → v1
99
+
100
+ A v0 file (no `version` key, ticket under flat `ticket`) is
101
+ upgraded by `work_engine.migration.v0_to_v1`:
102
+
103
+ ```bash
104
+ python3 -m work_engine.migration.v0_to_v1 .implement-ticket-state.json
105
+ ```
106
+
107
+ The migration:
108
+
109
+ 1. Wraps `state.ticket` into `input = {"kind": "ticket", "data": <ticket>}`.
110
+ 2. Fills `intent = "backend-coding"` and `directive_set = "backend"`
111
+ (the only working directive bundle in R1).
112
+ 3. Writes the v1 file as `.work-state.json` next to the source.
113
+ 4. Renames the v0 file to `.implement-ticket-state.json.bak`
114
+ (override with `--no-backup`). If a `.bak` already exists it
115
+ rotates to `.bak.1`, `.bak.2`, … so prior rollback surfaces are
116
+ never overwritten silently. Hard-fails after 999 rotated slots.
117
+ 5. Refuses to overwrite an existing destination — accidental
118
+ double-migration on CI fails loud.
119
+
120
+ End-to-end migration UX (default state-file rename, legacy-file
121
+ detection on load, rollback recipe) lives in the user-facing
122
+ [`docs/MIGRATION.md`](../MIGRATION.md).
123
+
124
+ `migrate_payload` is **idempotent** on v1 input and **rejects** any
125
+ declared `version` other than `0` (absent) or `1`. The library
126
+ contract is covered by `tests/work_engine/test_v0_to_v1_migration.py`,
127
+ which exercises three real Phase 1 baseline snapshots (GT-1
128
+ cycle 1, GT-3 cycle 4, GT-5 cycle 5) so the migrator is proven
129
+ against actual engine output rather than synthetic fixtures.
130
+
131
+ ## Prompt envelopes and confidence bands (R2)
132
+
133
+ Prompt envelopes (`input.kind="prompt"`) carry a free-form goal
134
+ instead of a refined ticket. The `refine` step routes on shape
135
+ (presence of `raw` key) and on the first pass emits an
136
+ `@agent-directive: refine-prompt` halt — the agent runs the
137
+ matching skill, which reconstructs `acceptance_criteria` +
138
+ `assumptions`. On the rebound, `scoring/confidence.py` produces a
139
+ frozen `ConfidenceScore(band, score, dimensions, reasons,
140
+ ui_intent)` and the dispatcher branches on `band`:
141
+
142
+ | Band | Threshold | Trigger | What the user sees | Release |
143
+ |---|---|---|---|---|
144
+ | `high` | `score ≥ 0.8` | All five rubric dimensions clean; no UI keywords | Silent proceed — reconstructed AC + assumptions land in the delivery report under "Confidence" | None — flows straight into `analyze` |
145
+ | `medium` | `0.5 ≤ score < 0.8` | One or two weak dimensions; assumptions inferred | `PARTIAL` halt with the reconstructed AC + numbered assumptions; user picks `1. Confirm` / `2. Edit assumptions` / `3. Abort` | Agent sets `state.ticket['confidence_confirmed'] = True`; refine re-runs and emits `SUCCESS` |
146
+ | `low` | `score < 0.5` | Multiple weak dimensions or destructive scope | `BLOCKED` halt with **one** clarifying question targeted at the weakest dimension (`ask-when-uncertain` Iron Law) | User answers → agent rewrites the prompt → re-score; **`confidence_confirmed` cannot release low band** |
147
+
148
+ Independently, `ui_intent=True` (UI keyword in `raw`) short-circuits
149
+ into a `BLOCKED` halt with a pointer to `road-to-product-ui-track`
150
+ (R3), regardless of the numeric band.
151
+
152
+ **AC projection contract.** On every `SUCCESS` exit from
153
+ `_run_prompt`, the dispatcher mirrors `data['reconstructed_ac']`
154
+ into `data['acceptance_criteria']`. Downstream gates (`analyze`,
155
+ `plan`, `implement`) read the legacy slot and stay shape-agnostic
156
+ about the upstream input variant. The mirror is a **list copy**, not
157
+ a reference share — mutating the legacy slot does not corrupt the
158
+ prompt slot, and vice versa. Regression locked by
159
+ `test_refine_prompt_dispatch.py::TestHighBand::test_mirrors_reconstructed_ac_to_acceptance_criteria`
160
+ plus the parallel medium-confirmed assertion.
161
+
162
+ **Refreshing band thresholds.** `BAND_HIGH_MIN = 0.8` and
163
+ `BAND_MEDIUM_MIN = 0.5` are module constants in
164
+ `scripts/work_engine/scoring/confidence.py`. The skill, the ADR
165
+ (`adr-prompt-driven-execution.md`, R2 Phase 6), and this doc cite
166
+ that module — there is no second source of truth. Tuning thresholds
167
+ therefore requires:
168
+
169
+ 1. Update the constants in `confidence.py`.
170
+ 2. Re-run the per-dimension fixtures
171
+ (`tests/work_engine/test_scoring_confidence.py`).
172
+ 3. Re-capture GT-P1..GT-P4 if a fixture's band assignment shifts
173
+ (`python3 -m tests.golden.capture`); review the diff before
174
+ locking. If GT-P3 (low) or GT-P4 (UI rejection) flip bands the
175
+ threshold change is rejected — those fixtures are pinned to
176
+ their bands by design.
177
+ 4. Update the band-threshold table above.
178
+
179
+ ## `Step` contract
180
+
181
+ Each step is a function over `DeliveryState` that returns one of:
182
+
183
+ - `success` — populates its slice, continues to the next step.
184
+ - `blocked` — populates `questions` with numbered options, stops
185
+ the flow. Orchestrator emits the questions and exits.
186
+ - `partial` — populates its slice AND `questions`; user chooses to
187
+ continue or stop. Orchestrator asks explicitly.
188
+
189
+ Steps **must** declare, in their skill frontmatter, the
190
+ ambiguities they surface — so `blocked` never comes as a surprise.
191
+
192
+ ## Agent directives
193
+
194
+ Some steps cannot run from pure Python — `implement` performs edits,
195
+ `test` and `verify` drive long-running subprocesses, `report` only
196
+ renders once all prior slices are populated. These steps halt with
197
+ `blocked` and carry an **agent directive** as the first entry of
198
+ `questions`:
199
+
200
+ ```
201
+ questions = [
202
+ "@agent-directive: implement-plan", # index 0 — agent-facing
203
+ "> Ticket TICKET-42 — 3 files touched, plan in state.plan.",
204
+ "> 1. Continue — changes applied per plan",
205
+ "> 2. Abort — plan is wrong",
206
+ ]
207
+ ```
208
+
209
+ Contract:
210
+
211
+ - The directive **is always** `questions[0]` when present.
212
+ - The prefix `@agent-directive:` is public contract — changing it is
213
+ a breaking change.
214
+ - The directive verb (e.g. `implement-plan`, `run-tests`) names the
215
+ skill or command the agent should invoke next.
216
+ - Optional `key=value` pairs follow the verb on the same line. Rich
217
+ payloads belong on `DeliveryState`, not in the directive.
218
+ - Numbered user options follow the directive and behave exactly per
219
+ the `user-interaction` rule — the user still decides after the
220
+ agent reports back.
221
+
222
+ Helpers `agent_directive(name, **payload)` and
223
+ `is_agent_directive(line)` live alongside `DeliveryState` in the
224
+ `implement_ticket` package.
225
+
226
+ ## Resume semantics
227
+
228
+ The dispatcher is idempotent on already-completed steps. When a
229
+ step's name is already marked `success` in `state.outcomes`, the
230
+ dispatcher **skips** it and continues. This is how Option-A
231
+ delegation works end-to-end:
232
+
233
+ 1. Dispatcher runs until an agent-directive step halts with
234
+ `blocked`.
235
+ 2. Orchestrator reads the directive, invokes the matching skill,
236
+ captures the result onto the matching `DeliveryState` slice
237
+ (`state.changes`, `state.tests`, etc.).
238
+ 3. Orchestrator sets `state.outcomes[step] = "success"` and
239
+ re-invokes `dispatch(state, steps)`.
240
+ 4. Dispatcher skips completed steps and resumes at the first one
241
+ still pending.
242
+
243
+ Only the exact string `"success"` triggers the skip. A `"blocked"`
244
+ or `"partial"` marker from a prior run **reruns** the step so the
245
+ current state is re-evaluated rather than trusting stale evidence.
246
+
247
+ ## Hook lifecycle (side-channel)
248
+
249
+ The dispatcher and CLI emit lifecycle hooks at fixed points
250
+ (`before_step`, `after_step`, `on_halt`, `on_error` on the
251
+ dispatcher side; `before_load`, `after_load`, `before_dispatch`,
252
+ `after_dispatch`, `before_save`, `after_save` on the CLI side).
253
+ Hooks are **observers** — they may halt the engine via `HookHalt`
254
+ but never replace step logic. Default-off (`hooks.enabled: false`);
255
+ golden-replay flows stay byte-stable when hooks do not register.
256
+ Built-in hooks cover trace, halt-surface audit, state-shape
257
+ validation, directive-set guard, and the four chat-history
258
+ boundary writes. Full reference: [`work-engine-hooks.md`](work-engine-hooks.md).
259
+
260
+ ## Memory retrieval contract
261
+
262
+ Bounded per the top-level roadmap rule:
263
+
264
+ - **Max 12 hits total.**
265
+ - **Four allowed types:** `domain-invariants`,
266
+ `architecture-decisions`, `incident-learnings`,
267
+ `historical-patterns`. All four exist in the
268
+ [templates directory](../../.agent-src.uncompressed/templates/agents/memory/).
269
+ - **Keys:** files touched by the plan, symbols referenced by the
270
+ ticket.
271
+ - **Decision-change rule:** a memory hit that did not change an
272
+ outcome is dropped from the report. If none changed an outcome,
273
+ the `memory` section of the report is omitted — not padded.
274
+ - Follows the retrieval shape in
275
+ [`agent-memory-contract.md`](agent-memory-contract.md).
276
+
277
+ ## Persona policies
278
+
279
+ Read from `.agent-settings.yml` `roles.active_role` and resolved
280
+ via `resolve_policy()` in
281
+ [`persona_policy.py`](../../.agent-src.uncompressed/templates/scripts/implement_ticket/persona_policy.py).
282
+ Policies live alongside the dispatcher so the flow can consume
283
+ them directly; the shared
284
+ [`role-contracts`](../../.agent-src.uncompressed/guidelines/agent-infra/role-contracts.md)
285
+ guideline remains the source of truth for persona behaviour in the
286
+ wider agent surface.
287
+
288
+ Three personas ship today:
289
+
290
+ | Persona | `allows_implement` | `allows_test` | `allows_verify` | `widen_tests` | `suggests_next_commands` |
291
+ |---|:-:|:-:|:-:|:-:|:-:|
292
+ | `senior-engineer` (default) | ✅ | ✅ | ✅ | ❌ | ✅ |
293
+ | `qa` | ✅ | ✅ | ✅ | ✅ | ✅ |
294
+ | `advisory` | ❌ | ❌ | ❌ | ❌ | ❌ |
295
+
296
+ Behaviour:
297
+
298
+ - `senior-engineer` — runs every step, targeted tests, full
299
+ delivery report with `/commit` + `/create-pr` suggestions when
300
+ verify succeeds.
301
+ - `qa` — identical to `senior-engineer` except the `run-tests`
302
+ directive carries `scope=full` so regressions outside the
303
+ changed paths are caught.
304
+ - `advisory` — plan-only mode. `implement`, `test`, and `verify`
305
+ short-circuit to SUCCESS without work; the delivery report
306
+ renders without a "Suggested next commands" section because
307
+ nothing was changed.
308
+
309
+ Unknown persona names fall back to `senior-engineer`. The policy
310
+ is frozen and cached per name — step handlers can call
311
+ `resolve_policy(state.persona)` as often as they need.
312
+
313
+ **No CLI flag overrides `/mode`.** That is the whole point of the
314
+ session-global persona contract — one source, readable by any
315
+ skill.
316
+
317
+ ## Block-on-ambiguity semantics
318
+
319
+ When a step returns `blocked`, the orchestrator:
320
+
321
+ 1. Emits the numbered questions per
322
+ [`user-interaction`](../../.agent-src.uncompressed/rules/user-interaction.md).
323
+ 2. Writes a partial report up to the last successful step.
324
+ 3. Exits with a `blocked` status — no guess, no fallback.
325
+
326
+ Resuming is not automatic. The user answers and re-invokes
327
+ `/implement-ticket` (or a follow-up command) with the answer in
328
+ the context. V1 explicitly does **not** attempt resumable sessions.
329
+
330
+ ### Declared ambiguity surfaces
331
+
332
+ Every step declares — in code — the conditions under which it
333
+ can return `blocked`. The declarations live as module-level
334
+ `AMBIGUITIES` tuples (see
335
+ [`directives/backend/__init__.py`](../../.agent-src.uncompressed/templates/scripts/work_engine/directives/backend/__init__.py)
336
+ `.all_ambiguities()`). The
337
+ [`test_ambiguity_coverage.py`](../../tests/implement_ticket/test_ambiguity_coverage.py)
338
+ suite locks the contract: adding a new `blocked` path without
339
+ declaring it fails the build.
340
+
341
+ | Step | Codes | Shape |
342
+ |---|---|---|
343
+ | `refine` | `missing_id`, `trivial_title`, `missing_or_vague_ac` | deterministic gate |
344
+ | `memory` | — | always succeeds (zero hits is valid) |
345
+ | `analyze` | `upstream_refine_failed`, `upstream_memory_failed`, `lost_ac` | deterministic gate |
346
+ | `plan` | `upstream_analyze_failed`, `empty_plan_delegate`, `malformed_plan` | delegation gate |
347
+ | `implement` | `upstream_plan_failed`, `empty_changes_delegate`, `malformed_changes` | delegation gate |
348
+ | `test` | `upstream_implement_failed`, `empty_tests_delegate`, `malformed_tests`, `bad_test_verdict` | delegation gate |
349
+ | `verify` | `upstream_test_failed`, `empty_verify_delegate`, `malformed_verify`, `bad_verify_verdict` | delegation gate |
350
+ | `report` | — | pure renderer, always succeeds |
351
+
352
+ Delegation-gate `empty_*_delegate` codes emit an `@agent-directive:`
353
+ so the orchestrator runs the matching skill (`feature-plan`,
354
+ `apply-plan`, `run-tests`, `review-changes`) and resumes. All
355
+ other codes halt the flow with numbered options for the user.
356
+
357
+ ## Delivery report schema
358
+
359
+ A copyable markdown block with fixed sections (any section may be
360
+ empty, but all headings are present unless explicitly marked
361
+ *droppable*):
362
+
363
+ 1. **Ticket** — one-line restatement.
364
+ 2. **Persona** — active role + policy summary.
365
+ 3. **Plan** — ordered steps actually executed.
366
+ 4. **Changes** — files, line ranges, one-sentence purpose each.
367
+ 5. **Tests** — what ran, verdicts, durations.
368
+ 6. **Verify** — review verdict + confidence level.
369
+ 7. **Memory that mattered** *(droppable)* — only hits that changed
370
+ an outcome. When no hit carries a `changed_outcome` marker the
371
+ entire section (heading included) is omitted so the reader
372
+ doesn't mistake "nothing influential" for "memory is broken".
373
+ 8. **Follow-ups** — deferred work, with file:line anchors.
374
+ 9. **Suggested next commands** *(droppable)* — `/commit`,
375
+ `/create-pr`, etc. Never run automatically. Advisory personas
376
+ produce a plan-only report and omit this section entirely
377
+ because nothing was changed.
378
+
379
+ Implementation: see
380
+ [`directives/backend/report.py`](../../.agent-src.uncompressed/templates/scripts/work_engine/directives/backend/report.py).
381
+ Section renderers are pure and deterministic; consumers can rely
382
+ on the heading order and on each section either rendering with
383
+ content or being omitted per the rules above.
384
+
385
+ ## Metrics
386
+
387
+ Emitted as structured JSON to a local log (location chosen in
388
+ Phase 0 spike) so the roadmap's metrics anchor (Q38) can be
389
+ measured without instrumentation sprawl:
390
+
391
+ - `time_to_verified_change_ms`
392
+ - `block_rate` (per 20-run window)
393
+ - `memory_decision_rate`
394
+ - `repeat_user_runs_per_week`
395
+ - `report_rejections`
396
+
397
+ ## Capture protocol — Golden Transcripts (R1 Phase 1)
398
+
399
+ The Universal Execution Engine roadmap (`R1`) freezes the engine's
400
+ observable behaviour before any refactor. The artefact that holds
401
+ that freeze is the **Capture Pack** under
402
+ `tests/golden/baseline/GT-{1..5}/`. This section is the operator
403
+ manual for producing and re-producing those packs.
404
+
405
+ ### Scenarios
406
+
407
+ | GT | Surface locked | Cycles |
408
+ |-----|------------------------------------------|--------|
409
+ | 1 | happy path (plan→apply→tests→review→report) | 5 |
410
+ | 2 | refine-step ambiguity halt (vague AC) | 1 |
411
+ | 3 | run-tests failed verdict + recovery | 6 |
412
+ | 4 | advisory persona — plan-only delivery | 2 |
413
+ | 5 | state-resume from disk between cycles | 5 |
414
+
415
+ ### Inputs
416
+
417
+ - Toy domain: `tests/golden/sandbox/repo/` — a 4-function
418
+ calculator (`add`, `subtract`, `power`-stub, `divide`) plus a
419
+ pytest config and tests. Deterministic, no I/O.
420
+ - Ticket fixtures: `tests/golden/sandbox/tickets/gt-{1..5}-*.json`.
421
+ Schema matches `implement_ticket`'s `ticket_loader`.
422
+ - Recipes: `tests/golden/sandbox/recipes/gt{1..5}_*.py`. Each
423
+ exposes `META` (gt_id, ticket fixture, persona, cycle cap) and
424
+ `build_recipe(workspace) -> {directive_verb: callable}`. The
425
+ recipe is the deterministic stand-in for the agent: every halt
426
+ is resolved by hard-coded edits + state-mutations.
427
+
428
+ ### Invocation
429
+
430
+ Each cycle is a fresh `./agent-config implement-ticket` subprocess
431
+ seeded from the persisted state file. The runner
432
+ (`tests/golden/sandbox/runner.py`) chains them:
433
+
434
+ ```bash
435
+ ./agent-config implement-ticket \
436
+ --ticket-file tests/golden/sandbox/tickets/gt-1-happy.json \
437
+ --state-file <workspace>/.agent-state/implement-ticket.json \
438
+ --workspace <workspace> \
439
+ --output-format json
440
+ # subsequent cycles drop --ticket-file; the engine loads the
441
+ # ticket from the saved state.
442
+ ```
443
+
444
+ The runner is invoked via the capture driver:
445
+
446
+ ```bash
447
+ python3 -m tests.golden.capture # all five GTs
448
+ python3 -m tests.golden.capture --scenarios GT-3
449
+ ```
450
+
451
+ ### Kill points & resume
452
+
453
+ The runner re-executes the engine on every cycle, so resume from
454
+ disk is exercised by **every** GT — not just GT-5. GT-5 simply
455
+ records the contract under a different operation (negate vs.
456
+ multiply) so byte-equal regression detection covers an additional
457
+ state shape. There is no "two-segment" runner mode; the segmentation
458
+ is implicit in the per-cycle subprocess fork.
459
+
460
+ ### Capture Pack layout
461
+
462
+ ```
463
+ tests/golden/baseline/GT-N/
464
+ ├── transcript.json # per-cycle stdout/stderr + exit codes
465
+ ├── state-snapshots/ # state file after each cycle (cycle-NN.json)
466
+ ├── halt-markers.json # extracted directives + numbered questions
467
+ ├── exit-codes.json # per-cycle exit codes only
468
+ ├── delivery-report.md # final report (or stub if flow halted)
469
+ └── fixture/ # frozen copy of the input ticket
470
+ ```
471
+
472
+ The driver also writes `tests/golden/baseline/summary.json` (one
473
+ row per GT: outcome, exit code, cycle count) and
474
+ `tests/golden/CHECKSUMS.txt` (sorted SHA256 of every file under
475
+ `tests/golden/baseline/` plus the input fixtures). Regeneration
476
+ recipe and relock policy: [`tests/golden/CAPTURING.md`](../../tests/golden/CAPTURING.md).
477
+
478
+ ### Determinism guarantees
479
+
480
+ - `PYTHONHASHSEED=0`, `PYTHONIOENCODING=utf-8`,
481
+ `LC_ALL=C.UTF-8`, `NO_COLOR=1` injected by the runner.
482
+ - Workspace is a fresh `tempfile.TemporaryDirectory` per scenario;
483
+ the toy repo is materialised into it before cycle 1.
484
+ - `agents/memory/` lookups resolve relative to the workspace, so
485
+ every run sees zero curated entries — no host-state leakage.
486
+ - Recipes never read the clock, the network, or unbound randomness.
487
+ - pytest verdict normalisation lives in
488
+ `tests/golden/sandbox/recipes/_helpers.py::run_pytest`
489
+ (exit 0 → success, exit 1/2 → failed, otherwise → mixed).
490
+
491
+ ### Regenerating the baseline
492
+
493
+ Only when the engine's observable behaviour intentionally changes:
494
+
495
+ ```bash
496
+ python3 -m tests.golden.capture
497
+ git diff tests/golden/baseline tests/golden/CHECKSUMS.txt
498
+ ```
499
+
500
+ Review the diff; it should match the documented behavioural change
501
+ in this file's revision history. Then commit. Drive-by changes to
502
+ the baseline are blocked by the freeze-guard CI workflow (added in
503
+ Phase 1 Step 7).
504
+
505
+ ### Anti-patterns
506
+
507
+ - Editing a Capture Pack file by hand. The pack is generated; edit
508
+ the engine or the recipe instead.
509
+ - Adding a sixth GT without amending the table above and the Phase-6
510
+ replay harness in lock-step.
511
+ - Reading from `agents/memory/` in a recipe. Recipes seed state
512
+ directly; memory belongs to the engine under test, not the test.
513
+ - Letting the `_helpers.run_pytest` verdict mapping drift from the
514
+ engine's `state.tests.verdict` contract — they are coupled.
515
+
516
+ ## Replay protocol — Strict-Verb comparison (R1 Phase 6)
517
+
518
+ The Capture Pack alone is a frozen artefact; the **replay harness**
519
+ under [`tests/golden/harness.py`](../../tests/golden/harness.py) is what
520
+ turns that artefact into a continuous behavioural contract. It loads
521
+ each baseline, drives the same recipe against the *live* `work_engine`,
522
+ and reports structural drift. Every PR that touches the engine,
523
+ recipes, or runner pays this gate.
524
+
525
+ ### What is locked vs. what may drift
526
+
527
+ The harness uses **Strict-Verb** comparison: the *shape* and *semantic
528
+ verbs* are normative, free-text wording inside that shape may drift.
529
+
530
+ | Surface | Comparison rule | Locked | May drift |
531
+ |---|---|---|---|
532
+ | Exit code per cycle | exact equality | the integer | — |
533
+ | `recipe_action` per cycle | exact equality | the action string | — |
534
+ | State snapshot per cycle | recursive *structure* match | key names, types, list lengths | leaf string contents |
535
+ | Halt-marker `questions` list | Strict-Verb classification | line count, per-line class (`directive` / `numbered` / `blockquote` / `text`), `@agent-directive:` verb identity, count of `> N.` options | wording after the verb, prose inside blockquotes, descriptive text after `> N. …` |
536
+ | Delivery report | ordered `^## ` heading list | section presence and order | section *bodies* |
537
+ | Manifest (`CHECKSUMS.txt`) | byte equality after path normalisation | every checksum | — |
538
+
539
+ The contract is intentionally tighter on *control surfaces* (verbs,
540
+ exit codes, state shape, headings, checksums) than on *free-text
541
+ fields* (numbered-option labels, report bodies, leaf strings). Refactors
542
+ that rename a field, drop an option, change a directive verb, or swap
543
+ an exit code FAIL the gate. Refactors that polish a description string
544
+ PASS — and that is the point.
545
+
546
+ ### Where the gate runs
547
+
548
+ - `task golden-replay` — local, named entry point, sub-second.
549
+ Invoked from `task ci` *before* `task test` for failure-first
550
+ ordering (Phase 6 Step 3).
551
+ - `.github/workflows/tests.yml` step **"Golden Replay (R1 engine
552
+ refactor freeze-guard)"** — runs before the full pytest sweep so
553
+ drift surfaces as a named PR check, not a buried test name.
554
+ - `.github/workflows/freeze-guard.yml` — independent integrity gate:
555
+ `manifest-integrity` re-checks `sha256sum -c CHECKSUMS.txt`,
556
+ `live-replay` re-runs the capture driver and diffs the manifest.
557
+ This catches drift the harness can't see (e.g. silent baseline
558
+ edits without engine changes).
559
+
560
+ The harness and freeze-guard are intentionally redundant. The harness
561
+ proves *engine behaviour matches baseline*; freeze-guard proves
562
+ *baseline matches what was committed*. Either failing means review.
563
+
564
+ ### Refreshing the baseline
565
+
566
+ Only when an engine change is **intentionally** behaviour-altering.
567
+ The PR description must justify each new checksum. Procedure:
568
+
569
+ ```bash
570
+ python3 -m tests.golden.capture # regenerate Capture Packs + manifest
571
+ git diff tests/golden/baseline tests/golden/CHECKSUMS.txt
572
+ ```
573
+
574
+ Then in the PR:
575
+
576
+ 1. Mention the rationale — which roadmap step / ADR drove the change,
577
+ what observable surface moved.
578
+ 2. Show the per-GT diff summary (which scenarios re-locked, which
579
+ stayed byte-equal).
580
+ 3. Update this section's revision history if a *contract column*
581
+ above moved (e.g. a new locked surface, a new drift-tolerated
582
+ field).
583
+
584
+ Drive-by baseline edits — even one-character whitespace tweaks —
585
+ are blocked by `freeze-guard.yml::manifest-integrity` at PR time.
586
+
587
+ ### Anti-patterns (replay)
588
+
589
+ - Loosening the harness comparator to "fix" a failing replay. The
590
+ harness is the contract; the engine is the variable.
591
+ - Re-running `python3 -m tests.golden.capture` to "make the diff go
592
+ away" without justification. The diff *is* the question.
593
+ - Treating the harness and freeze-guard as duplicate checks. They
594
+ catch different drift classes — both must stay green.
595
+ - Adding a sixth Capture Pack without adding a corresponding entry
596
+ to `RECIPE_MODULES` in `harness.py` and a parametrize row in
597
+ `test_replay.py`.
598
+
599
+ ## Non-goals
600
+
601
+ - Auto-commit / auto-push / auto-PR (belongs to `/commit`,
602
+ `/create-pr`).
603
+ - Multi-repo orchestration.
604
+ - User-authored custom flows.
605
+ - Parallel step execution.
606
+ - New memory types or retrieval shapes.
607
+
608
+ ## Revisit triggers
609
+
610
+ - First 10 real runs show `block_rate < 10%` → loosen
611
+ block-on-ambiguity (too timid).
612
+ - First 10 real runs show `block_rate > 60%` → tighten step
613
+ declarations (ambiguity noise).
614
+ - A second flow (`/implement-bug`) is proposed → amend this doc to
615
+ hoist shared contracts BEFORE drafting the second flow.
616
+
617
+ ## See also
618
+
619
+ - [`agents/roadmaps/road-to-implement-ticket.md`](../../agents/roadmaps/road-to-implement-ticket.md)
620
+ - [`agents/roadmaps/road-to-universal-execution-engine.md`](../../agents/roadmaps/road-to-universal-execution-engine.md)
621
+ - `tests/golden/` — capture sandbox, recipes, and Capture Packs
622
+ - [`../../tests/golden/harness.py`](../../tests/golden/harness.py) — Strict-Verb replay harness
623
+ - [`../../.github/workflows/freeze-guard.yml`](../../.github/workflows/freeze-guard.yml) — manifest-integrity + live-replay gates
624
+ - [`agent-memory-contract.md`](agent-memory-contract.md)
625
+ - [`../../.agent-src.uncompressed/guidelines/agent-infra/role-contracts.md`](../../.agent-src.uncompressed/guidelines/agent-infra/role-contracts.md)
626
+ - [`../../.agent-src.uncompressed/rules/user-interaction.md`](../../.agent-src.uncompressed/rules/user-interaction.md)
627
+ - [`../../.agent-src.uncompressed/rules/scope-control.md`](../../.agent-src.uncompressed/rules/scope-control.md)
628
+ - [`../../.agent-src.uncompressed/rules/minimal-safe-diff.md`](../../.agent-src.uncompressed/rules/minimal-safe-diff.md)