pi-goal-x 0.6.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,764 @@
1
+ # PRD: Agentic Goal Runtime
2
+
3
+ ## 1. Summary
4
+
5
+ `pi-goal` currently behaves like a growing workflow state machine: drafting gates, focus gates, continuation gates, tool visibility gates, lifecycle gates, completion auditing, compaction recovery, and multi-goal focus rules all interact inside the extension runtime. This has improved safety, but it also increases hidden coupling and makes the system brittle as more multi-agent behavior is added.
6
+
7
+ This PRD proposes a shift toward an agentic runtime modeled after `pi-autoresearch`: keep durable facts in append-only artifacts, use deterministic reconstruction for long-running context, and move most strategy into prompts, skills, and independent reviewer agents. The extension should enforce only the small set of invariants that protect irreversible state transitions, user ownership, path safety, and audit integrity.
8
+
9
+ The new system should feel less like a rigid state machine and more like an agent-operated workspace with reliable transaction logs and semantic review.
10
+
11
+ ## 2. Problem Statement
12
+
13
+ The current implementation has accumulated many hard runtime constraints that try to prevent the assistant from making mistakes at every stage. Examples include:
14
+
15
+ - Drafting requires a question-like tool before `propose_goal_draft`.
16
+ - Drafting blocks workhorse tools such as `read`, `bash`, `grep`, `find`, `write`, and `edit`.
17
+ - Active goals block drafting/question tools.
18
+ - Repeated `get_goal` calls can be blocked.
19
+ - Tool availability is repeatedly synchronized based on fine-grained lifecycle phase.
20
+ - Prompt text often references runtime gates, and tests assert exact rejection strings.
21
+
22
+ These constraints were added in response to real failures, but they create several product and engineering problems:
23
+
24
+ 1. **Complexity leaks into every feature.** A new lifecycle feature must reason about drafting, focus, continuation, compaction, post-stop behavior, and tool gating.
25
+ 2. **Agent behavior becomes over-constrained.** The model cannot use reasonable judgment in edge cases, such as doing minimal reconnaissance before drafting a better goal.
26
+ 3. **Tests encode machinery, not outcomes.** Experiments increasingly verify that the runtime blocked specific tool calls rather than verifying that the final goal behavior was correct.
27
+ 4. **State source of truth is fragmented.** Current state is reconstructed from active markdown files, focus entries, session entries, and runtime memory.
28
+ 5. **Multi-agent cooperation is bolted on.** The independent auditor is a good direction, but its results should become durable context rather than a one-off tool response.
29
+ 6. **Compaction resilience remains prompt-heavy.** Long-running goal context should be deterministic and artifact-backed, not dependent on chat history or LLM summarization.
30
+
31
+ ## 3. Product Goals
32
+
33
+ ### 3.1 Primary Goals
34
+
35
+ - Convert `pi-goal` from a hard state-machine-first system into an agentic lifecycle system.
36
+ - Introduce an append-only goal ledger as the durable factual record of lifecycle events.
37
+ - Use deterministic goal summaries for compaction and session recovery.
38
+ - Move strategy and behavioral guidance from runtime gates into prompts, skills, and reviewer agents.
39
+ - Preserve safety for irreversible transitions: goal creation, focus ownership, completion, archive/delete, stale continuations, and file path safety.
40
+ - Make independent auditor feedback durable and actionable across future turns.
41
+ - Reduce hidden coupling in `extensions/goal.ts` and make future behavior easier to evolve.
42
+
43
+ ### 3.2 Secondary Goals
44
+
45
+ - Make experiments evaluate outcomes instead of exact gate mechanics.
46
+ - Support future goal-specialized skills such as `goal-draft`, `goal-execute`, `goal-sisyphus`, and `goal-finalize`.
47
+ - Allow minimal agent reconnaissance during drafting when it improves the goal contract, without starting substantive execution before user confirmation.
48
+ - Improve debuggability: every important lifecycle transition should be inspectable in a log.
49
+ - Preserve backward compatibility with existing `.pi/goals/active_goal_*.md` files.
50
+
51
+ ## 4. Non-Goals
52
+
53
+ - Do not remove user confirmation before goal creation.
54
+ - Do not allow hidden/direct `create_goal` as a normal creation path.
55
+ - Do not remove the independent completion auditor.
56
+ - Do not let agents mark goals complete without auditor approval.
57
+ - Do not let agents autonomously switch human-owned focus.
58
+ - Do not reintroduce hard token-cost control or auto-continue-cap gates without explicit product approval.
59
+ - Do not redesign the entire pi extension framework.
60
+ - Do not require users to migrate existing goal files manually.
61
+ - Do not implement a separate Sisyphus step counter; Sisyphus remains prompt/criteria style.
62
+
63
+ ## 5. Users and Use Cases
64
+
65
+ ### 5.1 Primary User
66
+
67
+ A developer using pi for long-running coding, research, writing, release, or maintenance work. They want the assistant to keep track of goals durably without becoming brittle or bureaucratic.
68
+
69
+ ### 5.2 Secondary Users
70
+
71
+ - Extension maintainer reviewing behavior and debugging failures.
72
+ - Future skills or subagents that need stable goal context.
73
+ - Independent auditor agents evaluating completion quality.
74
+ - Experiment harnesses validating goal behavior.
75
+
76
+ ### 5.3 Core Use Cases
77
+
78
+ 1. **Start a goal from a clear request.** The agent drafts a concrete goal, shows a confirmation dialog, and starts work after user confirmation.
79
+ 2. **Start a goal from a vague request.** The agent asks clarifying questions, proposes a goal when clear enough, and avoids premature execution.
80
+ 3. **Continue a long-running goal.** The agent reads deterministic goal context and chooses the next concrete action.
81
+ 4. **Recover after compaction.** The agent resumes from ledger and deterministic summary, not from fragile chat memory.
82
+ 5. **Handle multiple open goals.** The user owns focus selection; the agent does not silently switch targets.
83
+ 6. **Pause or abort safely.** The lifecycle transition is logged and future prompts show why it happened.
84
+ 7. **Attempt completion.** The agent requests completion; an independent auditor approves or rejects; the result is durable context.
85
+ 8. **React to auditor rejection.** Future turns include the auditor objections and guide the agent to address them before retrying.
86
+
87
+ ## 6. Current State Overview
88
+
89
+ The current runtime uses several state sources and gates:
90
+
91
+ - Active goal markdown files in `.pi/goals/active_goal_*.md`.
92
+ - Archived files in `.pi/goals/archived/`.
93
+ - Session focus entries via `pi-goal-focus`.
94
+ - In-memory `goalsById`, `focusedGoalId`, `confirmationIntent`, `tweakDraftingFor`, `runningGoalId`, continuation queues, and accounting state.
95
+ - Tool visibility synchronization through `syncGoalTools()`.
96
+ - Runtime tool-call blocking for lifecycle transactions, post-stop same-turn behavior, stale continuation prompts, and strict completion auditing; goal confirmation is mostly prompt-guided.
97
+ - Independent completion auditor in `extensions/goal-auditor.ts`.
98
+ - Prompt guidance in `extensions/prompts/goal-prompts.ts`.
99
+
100
+ The system is safe but increasingly hard to reason about because many behaviors are encoded twice: once in prompt text and again as hard runtime gates.
101
+
102
+ ## 7. Desired Product Behavior
103
+
104
+ ### 7.1 Agentic Contract
105
+
106
+ The agent should be treated as the primary planner and executor. The extension should provide durable facts, transaction tools, prompts, and independent review, but should not micromanage ordinary reasoning steps.
107
+
108
+ The runtime should answer:
109
+
110
+ - What goals exist?
111
+ - Which goal is focused?
112
+ - What lifecycle events happened?
113
+ - What transaction is allowed right now?
114
+ - What irreversible action needs confirmation or review?
115
+
116
+ The agent should decide:
117
+
118
+ - Whether to ask a clarifying question.
119
+ - Whether minimal reconnaissance is needed to draft a better goal.
120
+ - What next action advances the goal.
121
+ - Whether the goal is plausibly complete enough to request auditing.
122
+ - How to respond to auditor feedback.
123
+
124
+ ### 7.2 Hard Invariants
125
+
126
+ The following behaviors remain runtime-enforced:
127
+
128
+ 1. **User-confirmed discussion creation.** A durable goal from `/goals` or `/sisyphus` is created only after `propose_goal_draft` confirmation.
129
+ 2. **Explicit direct set creation.** `/goals-set` and `/sisyphus-set` are user-only shortcuts that create immediately from the supplied objective.
130
+ 3. **No hidden direct creation.** `create_goal` remains rejected or unavailable as a normal agent path.
131
+ 4. **Mode consistency.** A draft proposal cannot silently change `/goals` into Sisyphus or `/sisyphus` into a regular goal.
132
+ 5. **Stale continuation protection.** A queued continuation for an old goal cannot perform work for a different current goal.
133
+ 6. **Human-owned focus.** The agent cannot silently switch focus between open goals.
134
+ 7. **Completion audit.** `update_goal(status="complete")` archives only if the independent auditor returns exactly one approving marker.
135
+ 8. **Path safety.** Goal files and archives must remain under expected `.pi/goals` paths.
136
+ 9. **Post-stop transaction boundary.** After pause, abort, approved completion, or applied tweak, the same turn should not continue substantive work.
137
+ 10. **No hard cost control/cap lifecycle.** Resource-control is outside this runtime; auto-continue uses semantic stop conditions and the empty-turn guard.
138
+ 11. **Archive/delete safety.** Terminal lifecycle operations must not destroy unrelated files or resurrect stale state.
139
+
140
+ ### 7.3 Soft Guidance
141
+
142
+ The following behaviors should move from hard runtime gates to prompt/skill guidance:
143
+
144
+ 1. **Drafting should usually ask one focused question.** But a fully specified request may proceed directly to draft proposal.
145
+ 2. **Drafting should usually avoid workhorse tools.** But minimal reconnaissance is allowed if it improves the goal contract and does not begin substantive execution.
146
+ 3. **Active goals should focus on work tools.** But asking the user a real clarifying question is allowed when blocked or ambiguous.
147
+ 4. **Repeated `get_goal` is discouraged.** Tool response can nudge the agent toward work, but should not hard-stop the turn.
148
+ 5. **Tweak drafting should avoid task execution.** But reading relevant context may be acceptable before applying a goal tweak.
149
+ 6. **Sisyphus discipline is prompt/criteria based.** It should not rely on step-count machinery.
150
+
151
+ ## 8. Proposed Architecture
152
+
153
+ ### 8.1 Layer 1: Goal Ledger
154
+
155
+ A new append-only JSONL ledger records lifecycle facts.
156
+
157
+ Candidate path:
158
+
159
+ ```text
160
+ .pi/goals/goal_events.jsonl
161
+ ```
162
+
163
+ Initial event types:
164
+
165
+ ```ts
166
+ type GoalLedgerEvent =
167
+ | { type: "goal_created"; goalId: string; objective: string; sisyphus: boolean; autoContinue: boolean; at: string }
168
+ | { type: "goal_focused"; goalId: string; reason: string; at: string }
169
+ | { type: "goal_unfocused"; reason: string; at: string }
170
+ | { type: "goal_paused"; goalId: string; reason: string; suggestedAction?: string; at: string }
171
+ | { type: "goal_resumed"; goalId: string; reason: string; at: string }
172
+ | { type: "goal_tweaked"; goalId: string; changeSummary: string; at: string }
173
+ | { type: "completion_requested"; goalId: string; summary?: string; at: string }
174
+ | { type: "audit_started"; goalId: string; provider?: string; model?: string; thinkingLevel?: string; at: string }
175
+ | { type: "audit_result"; goalId: string; verdict: "approved" | "disapproved" | "error"; report: string; at: string }
176
+ | { type: "goal_completed"; goalId: string; archivePath?: string; at: string }
177
+ | { type: "goal_aborted"; goalId: string; reason: string; archivePath?: string; at: string };
178
+ ```
179
+
180
+ Ledger requirements:
181
+
182
+ - Append-only writes.
183
+ - Tolerate missing ledger for legacy projects.
184
+ - Include enough information for debugging and deterministic summaries.
185
+ - Avoid storing secrets.
186
+ - Preserve ordering by file order and timestamp.
187
+ - Malformed lines must not crash normal use; they should be reported in diagnostics.
188
+
189
+ ### 8.2 Layer 2: Goal Documents
190
+
191
+ Existing active goal markdown files remain user- and agent-readable documents.
192
+
193
+ Their role changes from "primary state machine record" to "living goal document".
194
+
195
+ Each document should expose:
196
+
197
+ - Objective.
198
+ - Success criteria.
199
+ - Boundaries and constraints.
200
+ - Current lifecycle status.
201
+ - Progress notes.
202
+ - Latest pause/blocker, if any.
203
+ - Latest auditor feedback, if any.
204
+ - Next suggested action, if known.
205
+
206
+ Machine reconstruction should prefer ledger events and explicit frontmatter fields over prose.
207
+
208
+ ### 8.3 Layer 3: Runtime Transaction Tools
209
+
210
+ The runtime keeps tools for irreversible transitions:
211
+
212
+ - `propose_goal_draft`
213
+ - `get_goal`
214
+ - `update_goal`
215
+ - `pause_goal`
216
+ - `abort_goal`
217
+ - `apply_goal_tweak`
218
+ - legacy `create_goal` rejection
219
+ - legacy `step_complete` no-op
220
+
221
+ Tool implementations should be transaction-oriented:
222
+
223
+ - validate transaction identity;
224
+ - append ledger event;
225
+ - update goal file;
226
+ - update focus if needed;
227
+ - update UI;
228
+ - return concise result.
229
+
230
+ The runtime should not try to encode every strategy rule as a gate.
231
+
232
+ ### 8.4 Layer 4: Prompts and Skills
233
+
234
+ Prompts provide the agentic operating protocol:
235
+
236
+ - How to draft.
237
+ - How to execute.
238
+ - How to audit before requesting completion.
239
+ - How to handle Sisyphus style.
240
+ - How to respond to blockers.
241
+ - How to incorporate auditor rejection.
242
+
243
+ Future skills may split this strategy out of the core extension:
244
+
245
+ - `goal-draft`
246
+ - `goal-execute`
247
+ - `goal-sisyphus`
248
+ - `goal-finalize`
249
+ - `goal-auditor-review`
250
+
251
+ The extension should eventually have shorter prompts that reference these skills or embed concise strategy blocks.
252
+
253
+ ### 8.5 Layer 5: Independent Reviewer Agents
254
+
255
+ The existing auditor becomes the first reviewer agent.
256
+
257
+ Future reviewer roles could include:
258
+
259
+ - completion auditor;
260
+ - risk reviewer;
261
+ - release reviewer;
262
+ - stale-goal summarizer;
263
+ - plan critic.
264
+
265
+ Reviewer outputs should be written to the ledger and summarized into future prompts.
266
+
267
+ ## 9. Functional Requirements
268
+
269
+ ### FR1: Append Goal Lifecycle Events
270
+
271
+ The system must append durable ledger events for all important lifecycle transitions.
272
+
273
+ Minimum events for first implementation:
274
+
275
+ - goal created;
276
+ - focus changed;
277
+ - goal paused;
278
+ - goal resumed;
279
+ - goal tweaked;
280
+ - completion requested;
281
+ - auditor result;
282
+ - goal completed;
283
+ - goal aborted.
284
+
285
+ Acceptance criteria:
286
+
287
+ - Every successful lifecycle transaction has a corresponding JSONL event.
288
+ - Auditor rejection also writes an event.
289
+ - Existing behavior remains unchanged during the shadow-log phase.
290
+
291
+ ### FR2: Reconstruct Goal Context from Ledger
292
+
293
+ The system must provide a pure reconstruction function that turns ledger events into a summary state.
294
+
295
+ Acceptance criteria:
296
+
297
+ - Reconstruction identifies latest focus event.
298
+ - Reconstruction identifies latest auditor result per goal.
299
+ - Reconstruction identifies terminal events.
300
+ - Reconstruction works with empty or missing ledger.
301
+ - Reconstruction handles malformed lines according to documented policy.
302
+
303
+ ### FR3: Deterministic Compaction Summary
304
+
305
+ The system must provide a deterministic goal summary for compaction and post-compaction recovery.
306
+
307
+ Acceptance criteria:
308
+
309
+ - Summary includes focused goal, open goals, status, objective, latest lifecycle events, and latest auditor result.
310
+ - Summary does not depend on LLM summarization.
311
+ - Summary remains useful after long sessions.
312
+ - Tests cover active, paused, no-focus, multi-goal, and auditor-rejected cases.
313
+
314
+ ### FR4: Durable Auditor Feedback
315
+
316
+ Auditor results must become durable future context.
317
+
318
+ Acceptance criteria:
319
+
320
+ - `completion_requested` is logged before running the auditor.
321
+ - `audit_started` is logged with selected config when known.
322
+ - `audit_result` is logged for approval, disapproval, no marker, both markers, model error, config error, and abort.
323
+ - If disapproved, the next goal prompt includes the auditor's objections or a concise summary.
324
+ - If approved, completion proceeds and `goal_completed` is logged.
325
+
326
+ ### FR5: Relax Drafting Strategy Gates
327
+
328
+ The system should stop enforcing behavioral drafting strategy as hard runtime state.
329
+
330
+ Acceptance criteria:
331
+
332
+ - Fully specified user requests can reach `propose_goal_draft` without requiring a synthetic question gate.
333
+ - Minimal read-only reconnaissance during drafting is allowed when used to improve the goal contract.
334
+ - Substantive task execution before confirmation remains discouraged by prompt and experiments, not blocked by broad runtime gates.
335
+ - User confirmation remains required before a durable goal starts.
336
+
337
+ ### FR6: Relax Active-Goal Conversation Gates
338
+
339
+ The system should allow the agent to ask real clarification questions during active goals when needed.
340
+
341
+ Acceptance criteria:
342
+
343
+ - Active-goal prompts still prefer concrete work tools.
344
+ - Question-like tools are not hard-blocked solely because a goal is active.
345
+ - `pause_goal` remains the correct structured channel for real blockers.
346
+ - Experiments verify that the agent does not use questions as an excuse to avoid obvious next work.
347
+
348
+ ### FR7: Replace Repeated `get_goal` Block with Nudge
349
+
350
+ Repeated `get_goal` calls should not hard-stop a turn.
351
+
352
+ Acceptance criteria:
353
+
354
+ - The tool may include a warning such as "You already inspected this goal; prefer work tools now."
355
+ - The runtime does not set post-stop state solely because of repeated `get_goal`.
356
+ - Empty-turn auto-continue protections still prevent infinite no-progress loops.
357
+
358
+ ### FR8: Preserve Focus Ownership
359
+
360
+ The system must preserve human-owned focus.
361
+
362
+ Acceptance criteria:
363
+
364
+ - Explicit no-focus remains no-focus.
365
+ - Stale focus does not auto-focus a remaining single goal.
366
+ - Single-open auto-focus only applies when no explicit focus entry or ledger focus event exists.
367
+ - The agent has no normal tool to silently switch focus.
368
+
369
+ ### FR9: Preserve Completion Integrity
370
+
371
+ Completion must remain protected by independent semantic auditing.
372
+
373
+ Acceptance criteria:
374
+
375
+ - Only a clean `<approved/>` permits archive/completion.
376
+ - `<disapproved/>`, no marker, both markers, errors, and aborts all reject completion.
377
+ - Rejected completion keeps goal open.
378
+ - Rejection is durable future context.
379
+
380
+ ### FR10: Keep Backward Compatibility
381
+
382
+ Existing projects and goal files must continue to work.
383
+
384
+ Acceptance criteria:
385
+
386
+ - Missing ledger does not break `get_goal`, `/goal-list`, `/goal-focus`, `/goal-resume`, or completion.
387
+ - Existing `.pi/goals/active_goal_*.md` files remain readable.
388
+ - Existing archived goals remain untouched.
389
+ - Existing tests pass during shadow-log phase.
390
+
391
+ ## 10. Non-Functional Requirements
392
+
393
+ ### Reliability
394
+
395
+ - Ledger append must be robust and low-risk.
396
+ - A failed ledger append should not corrupt goal files.
397
+ - If append fails during a terminal transaction, the system should fail closed or report clearly.
398
+
399
+ ### Observability
400
+
401
+ - Ledger path should be shown in diagnostics or docs.
402
+ - Lifecycle tool responses may mention relevant ledger/auditor context.
403
+ - Tests should inspect ledger content directly.
404
+
405
+ ### Maintainability
406
+
407
+ - New modules should be small and pure where possible.
408
+ - Reconstruction should be unit-tested separately from pi runtime hooks.
409
+ - Prompt policy should be centralized, not scattered across gate reasons.
410
+
411
+ ### Performance
412
+
413
+ - Ledger reading should handle reasonably long sessions.
414
+ - Compaction summary should guard recent events, e.g. last 20 or last 50 relevant events.
415
+ - No expensive auditor or reconstruction work should run on every trivial UI refresh.
416
+
417
+ ### Security and Safety
418
+
419
+ - Ledger must not store tool outputs wholesale by default.
420
+ - Secrets should not be copied into ledger events.
421
+ - Goal file paths must remain constrained to `.pi/goals`.
422
+ - Auditor prompts must treat goal objective as untrusted user data.
423
+
424
+ ## 11. Migration Strategy
425
+
426
+ ### Stage A: Shadow Ledger
427
+
428
+ Add ledger writes without changing behavior.
429
+
430
+ - Current state machine remains authoritative.
431
+ - Ledger is used only for diagnostics and tests.
432
+ - This is low-risk and establishes durable facts.
433
+
434
+ ### Stage B: Summary from Ledger
435
+
436
+ Use ledger to build deterministic summaries.
437
+
438
+ - Compaction summary uses ledger + active goal files.
439
+ - Runtime behavior still mostly unchanged.
440
+ - This proves ledger utility without destabilizing lifecycle.
441
+
442
+ ### Stage C: Prompt-First Soft Gates
443
+
444
+ Relax soft strategy gates one by one.
445
+
446
+ - Update prompts.
447
+ - Update experiments.
448
+ - Remove exact string tests for soft rejection behavior.
449
+ - Keep hard transaction invariants.
450
+
451
+ ### Stage D: Reconstruct Runtime from Ledger
452
+
453
+ Make ledger reconstruction participate in `loadState`.
454
+
455
+ - Prefer active files for canonical goal content initially.
456
+ - Use ledger for focus, latest auditor feedback, and terminal history.
457
+ - Gradually reduce reliance on session-only entries.
458
+
459
+ ### Stage E: Remove Old State-Machine Residue
460
+
461
+ Delete or simplify unused gates and phase machinery.
462
+
463
+ - `questionsAsked`, normal-goal `draftId`, and drafting nudges are removed from the `/goal-set` confirmation path.
464
+ - Drafting tool gate becomes no-op or is removed.
465
+ - Repeated `get_goal` block removed.
466
+ - Prompt text no longer references runtime gates for soft behavior.
467
+
468
+ ## 12. Rollout Plan
469
+
470
+ ### Milestone 1: PRD and Design Freeze
471
+
472
+ Deliverables:
473
+
474
+ - This PRD.
475
+ - Architecture doc update.
476
+ - Gate inventory: hard invariant vs soft guidance.
477
+
478
+ Exit criteria:
479
+
480
+ - Maintainer agrees on direction.
481
+ - No runtime behavior changed.
482
+
483
+ ### Milestone 2: Shadow Ledger
484
+
485
+ Deliverables:
486
+
487
+ - `extensions/goal-ledger.ts`.
488
+ - `tests/goal-ledger.test.ts`.
489
+ - Event appends in lifecycle transaction tools.
490
+
491
+ Exit criteria:
492
+
493
+ - `npm run check` passes.
494
+ - `npm test` passes.
495
+ - Existing experiments still valid.
496
+ - Ledger file is created and inspectable.
497
+
498
+ ### Milestone 3: Deterministic Summary
499
+
500
+ Deliverables:
501
+
502
+ - `extensions/goal-compaction.ts`.
503
+ - Compaction hook integration.
504
+ - Tests for summary generation.
505
+
506
+ Exit criteria:
507
+
508
+ - Compaction summary contains enough context to continue goal work.
509
+ - No LLM-generated summary is required for goal state.
510
+
511
+ ### Milestone 4: Auditor Feedback Loop
512
+
513
+ Deliverables:
514
+
515
+ - Completion request and auditor events.
516
+ - Rejection feedback in future prompts.
517
+ - Goal document update or summary injection for auditor objections.
518
+
519
+ Exit criteria:
520
+
521
+ - Auditor rejection is visible after compaction and restart.
522
+ - Agent can continue from rejection feedback.
523
+
524
+ ### Milestone 5: Soft Gate Relaxation
525
+
526
+ Deliverables:
527
+
528
+ - Relax the required-question gate.
529
+ - Relax drafting workhorse block.
530
+ - Relax active question block.
531
+ - Replace repeated `get_goal` block with nudge.
532
+ - Simplify `/goals` and `/sisyphus` confirmation to a thin intent instead of a hidden draft-id/question-counter state machine.
533
+ - Add `/goals-set` and `/sisyphus-set` for direct user-owned creation when no drafting discussion is wanted.
534
+ - Update prompts and tests.
535
+
536
+ Exit criteria:
537
+
538
+ - User-confirmed creation still works.
539
+ - Vague topics still lead to clarification in experiments.
540
+ - Complete specs converge faster.
541
+ - Minimal drafting reconnaissance is allowed but substantive pre-confirmation execution is discouraged and caught by outcome tests.
542
+
543
+ ### Milestone 6: Experiment Realignment
544
+
545
+ Deliverables:
546
+
547
+ - Updated experiment rubrics.
548
+ - New cases for ledger recovery and auditor feedback.
549
+ - Removed dependence on exact soft-gate rejection strings.
550
+
551
+ Exit criteria:
552
+
553
+ - Experiment suite evaluates product outcomes.
554
+ - New agentic behavior is stable across provider/model combinations.
555
+
556
+ ### Milestone 7: Runtime Simplification
557
+
558
+ Deliverables:
559
+
560
+ - Simplified `syncGoalTools()`.
561
+ - Removed no-longer-needed phase gate code.
562
+ - Reduced prompt duplication.
563
+ - Updated architecture docs.
564
+
565
+ Exit criteria:
566
+
567
+ - `extensions/goal.ts` is smaller or at least conceptually split.
568
+ - Hard invariants remain tested.
569
+ - Soft behavior lives in prompts/skills/experiments.
570
+
571
+ ## 13. Acceptance Criteria for the Whole Project
572
+
573
+ The project is successful when:
574
+
575
+ - Goal lifecycle facts are durably logged in append-only form.
576
+ - Goal context can be deterministically summarized after compaction.
577
+ - Auditor results are durable and influence future agent behavior.
578
+ - Creation, completion, focus, stale continuation, and path safety remain hard-protected.
579
+ - Drafting and execution feel less bureaucratic and more agentic.
580
+ - Fully specified goals do not require artificial questioning ceremony.
581
+ - Vague goals still generally produce clarification before commitment.
582
+ - Minimal context inspection during drafting is allowed when useful.
583
+ - Experiments focus on outcomes instead of exact tool-block mechanics.
584
+ - Existing goal files continue to work without manual migration.
585
+
586
+ ## 14. Risks and Mitigations
587
+
588
+ ### Risk: Agent starts task execution before goal confirmation
589
+
590
+ Mitigation:
591
+
592
+ - Prompt strongly says not to execute before confirmation.
593
+ - Experiments detect substantive file changes before confirmed goal creation.
594
+ - `propose_goal_draft` remains the only creation transaction.
595
+ - If needed, add lightweight detection/warning rather than broad hard blocks.
596
+
597
+ ### Risk: Vague requests create weak goals too easily
598
+
599
+ Mitigation:
600
+
601
+ - Prompt requires concrete objective, success criteria, boundaries, constraints, and blocker rule.
602
+ - Confirmation dialog lets user reject weak drafts.
603
+ - Experiments include vague-topic cases.
604
+ - Auditor later rejects weak completion if goal was under-specified and not actually satisfied.
605
+
606
+ ### Risk: Ledger and markdown disagree
607
+
608
+ Mitigation:
609
+
610
+ - Define precedence rules.
611
+ - During shadow phase, markdown remains canonical for goal content.
612
+ - Ledger drives lifecycle history and summary metadata.
613
+ - Tests cover disagreement cases.
614
+
615
+ ### Risk: Ledger grows too large
616
+
617
+ Mitigation:
618
+
619
+ - Summaries guard recent events.
620
+ - Keep event payloads concise.
621
+ - Consider per-goal event files or archival later.
622
+
623
+ ### Risk: Relaxing gates reintroduces old failures
624
+
625
+ Mitigation:
626
+
627
+ - Relax one gate at a time.
628
+ - Keep hard transaction gates.
629
+ - Update experiments before and after each relaxation.
630
+ - Use auditor feedback loop as semantic safety net.
631
+
632
+ ### Risk: Multi-agent auditor becomes too expensive or slow
633
+
634
+ Mitigation:
635
+
636
+ - Auditor runs only on completion request.
637
+ - Configurable provider/model/thinking remains available.
638
+ - Fail closed on config/model errors.
639
+ - Ledger logs auditor errors for debugging.
640
+
641
+ ## 15. Open Questions
642
+
643
+ 1. Should the ledger be global (`goal_events.jsonl`) or per-goal (`events/<goalId>.jsonl`)?
644
+ - Recommendation: start global for easier focus/pool reconstruction.
645
+
646
+ 2. Should ledger append failure fail the lifecycle transaction?
647
+ - Recommendation: for terminal transactions and auditor results, fail closed; for non-terminal diagnostics, warn and continue may be acceptable.
648
+
649
+ 3. Should goal documents be automatically updated with latest auditor feedback?
650
+ - Recommendation: yes, but only concise summaries; full report stays in ledger or tool output.
651
+
652
+ 4. Should soft gate relaxation be configurable?
653
+ - Recommendation: not initially. Avoid adding another settings surface until behavior stabilizes.
654
+
655
+ 5. Should future skills be bundled in this package or separate pi skills?
656
+ - Recommendation: start as bundled docs/prompts; split into skills if prompts become long or specialized.
657
+
658
+ ## 16. Test Plan
659
+
660
+ ### Unit Tests
661
+
662
+ - Ledger append/read/reconstruct.
663
+ - Malformed ledger handling.
664
+ - Compaction summary generation.
665
+ - Auditor event recording.
666
+ - Focus reconstruction.
667
+ - Terminal event handling.
668
+ - Legacy no-ledger fallback.
669
+
670
+ ### Integration Tests
671
+
672
+ - Create goal -> ledger event -> active file.
673
+ - Pause/resume -> ledger events -> prompt summary.
674
+ - Completion rejected -> ledger event -> next prompt includes feedback.
675
+ - Completion approved -> ledger events -> archive.
676
+ - Compaction mid-goal -> deterministic summary includes current state.
677
+ - Multiple open goals -> no autonomous focus switch.
678
+
679
+ ### Experiment Updates
680
+
681
+ Keep outcome tests for:
682
+
683
+ - vague goal clarification;
684
+ - full-spec goal fast path;
685
+ - Sisyphus ordered style;
686
+ - completion quality;
687
+ - abort/pause/clear;
688
+ - focus ownership;
689
+ - compaction recovery.
690
+
691
+ Remove or rewrite tests that require:
692
+
693
+ - mandatory question tool before proposal;
694
+ - absolute workhorse-tool ban during drafting;
695
+ - exact runtime block string for active goal questions;
696
+ - exact repeated `get_goal` block behavior.
697
+
698
+ ## 17. Metrics
699
+
700
+ ### Product Metrics
701
+
702
+ - Fewer false pauses or artificial questions on fully specified goals.
703
+ - Fewer brittle failures caused by tool visibility mismatch.
704
+ - Successful continuation after compaction without user restating context.
705
+ - Auditor rejection leads to meaningful follow-up work.
706
+
707
+ ### Engineering Metrics
708
+
709
+ - Reduced number of hard `tool_call` block branches.
710
+ - Reduced prompt references to "runtime gate" for soft behavior.
711
+ - Increased test coverage for ledger reconstruction and summaries.
712
+ - Fewer exact-string tests for behavior that should be prompt-guided.
713
+
714
+ ## 18. Implementation Notes
715
+
716
+ Initial files likely affected:
717
+
718
+ - `extensions/goal-ledger.ts` new.
719
+ - `extensions/goal-compaction.ts` new.
720
+ - `extensions/goal.ts` append ledger events and later simplify gates.
721
+ - `extensions/prompts/goal-prompts.ts` include ledger/auditor feedback.
722
+ - `extensions/goal-draft.ts` eventually relax soft validation.
723
+ - `extensions/goal-tool-names.ts` eventually simplify phase-based tool exposure.
724
+ - `tests/goal-ledger.test.ts` new.
725
+ - `tests/goal-prompts.test.ts`, `tests/goal-draft.test.ts`, `tests/goal-tool-names.test.ts` updates.
726
+ - `experiments/*` rubrics updates.
727
+
728
+ Implementation should proceed in small PRs. The first implementation PR should only add shadow ledger behavior and tests, with no soft-gate relaxation.
729
+
730
+ ## 19. Decision Record
731
+
732
+ Proposed decisions:
733
+
734
+ - Use append-only ledger as durable lifecycle history.
735
+ - Keep existing goal markdown files for compatibility and human/model readability.
736
+ - Keep hard invariants for irreversible transactions.
737
+ - Move behavioral strategy to prompts, skills, auditor agents, and experiments.
738
+ - Introduce deterministic compaction summary from persisted artifacts.
739
+ - Relax soft gates only after ledger and summary infrastructure exist.
740
+
741
+ ## 20. Appendix: Hard vs Soft Inventory
742
+
743
+ ### Hard Runtime Invariants
744
+
745
+ - Confirm before durable goal creation.
746
+ - Reject direct hidden `create_goal` creation path.
747
+ - Reject draft proposals that mismatch the user's requested goal mode.
748
+ - Abort stale continuation identity.
749
+ - Preserve human-owned focus.
750
+ - Require independent auditor approval for completion.
751
+ - Keep path safety for active/archive files.
752
+ - Avoid resource lifecycle gates unless the product explicitly reintroduces them.
753
+ - Prevent same-turn substantive work after terminal/pause/tweak transaction.
754
+
755
+ ### Soft Prompt-Guided Behaviors
756
+
757
+ - Ask one focused question during drafting when useful.
758
+ - Avoid unnecessary reconnaissance during drafting.
759
+ - Avoid substantive task execution before confirmation.
760
+ - Prefer work tools over repeated `get_goal` during active execution.
761
+ - Ask user only when there is a real blocker or ambiguity.
762
+ - Preserve Sisyphus ordering and patient style.
763
+ - Do not use proxy metrics as completion proof.
764
+ - Incorporate auditor feedback before retrying completion.