waypoint-codex 0.10.1 → 0.10.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -147,6 +147,8 @@ Waypoint ships a strong default skill pack for real coding work:
147
147
  These are repo-local, so the workflow travels with the project.
148
148
  `conversation-retrospective`, `break-it-qa`, `frontend-ship-audit`, and `backend-ship-audit` are on-demand skills, not default autonomous agent steps.
149
149
 
150
+ In practice, Waypoint now expects `conversation-retrospective` to run automatically after major completed work pieces so durable learnings, user feedback, errors, and skill improvements do not stay trapped in chat.
151
+
150
152
  ## Reviewer agents
151
153
 
152
154
  Waypoint scaffolds these reviewer agents by default:
@@ -161,6 +163,8 @@ For planning work, run `plan-reviewer` before presenting a non-trivial implement
161
163
 
162
164
  When the user approves a reviewed plan or explicitly says to proceed, the intended Waypoint behavior is autonomous execution: keep going through implementation, verification, review, and repo-memory updates unless a real blocker or materially risky unresolved decision requires a pause. If reviewers, subagents, CI, or other external work are still running, Waypoint should wait as long as necessary rather than interrupting them for speed.
163
165
 
166
+ When browser-based reproduction or verification is part of the work, Waypoint should also send screenshots of the relevant UI states so the user can see the evidence directly.
167
+
164
168
  ## What makes it different
165
169
 
166
170
  Waypoint is not trying to hide everything behind hooks and background machinery.
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "waypoint-codex",
3
- "version": "0.10.1",
3
+ "version": "0.10.3",
4
4
  "description": "Codex-native repository operating system: scaffolding, docs routing, repo-local skills, doctor, and sync.",
5
5
  "license": "MIT",
6
6
  "type": "module",
@@ -118,6 +118,7 @@ Anti-cheating rules:
118
118
  - Use `playwright-interactive`.
119
119
  - Exercise the actual UI instead of mocking the flow in code.
120
120
  - Keep the scope focused on the feature the user asked you to verify.
121
+ - Capture screenshots of the important states you observe so the user can see the evidence directly.
121
122
 
122
123
  ## Step 7: Try To Break It On Purpose
123
124
 
@@ -151,6 +152,7 @@ As you test, keep expanding the break log with new "What if...?" cases that emer
151
152
  - Update docs when the verification exposes stale assumptions about how the feature works.
152
153
  - Update the break log entry for each attempted action with what happened and whether the feature survived.
153
154
  - Require a short observed-result note for every executed item. "Worked" is too weak; capture what actually happened.
155
+ - Save screenshots for the key broken, risky, or fixed states as you go.
154
156
 
155
157
  Do not stop at the first bug.
156
158
 
@@ -174,6 +176,7 @@ Summarize:
174
176
  - the path to the break log markdown file
175
177
  - how many attack items were recorded and exercised
176
178
  - how coverage was distributed across steps and categories
179
+ - which screenshots you captured and what each one shows
177
180
  - what break attempts you tried
178
181
  - which issues you found
179
182
  - what you fixed
@@ -28,9 +28,12 @@ Review the current conversation and separate:
28
28
  - durable project knowledge
29
29
  - live execution state
30
30
  - transient chatter
31
+ - direct user feedback, corrections, complaints, and preferences
31
32
 
32
33
  Persist without asking follow-up questions when the correct destination is clear.
33
34
 
35
+ Treat explicit user feedback as a high-priority signal. If the user corrected the approach, rejected a behavior, called out friction, or stated a standing preference, prefer preserving that over the agent's earlier assumptions.
36
+
34
37
  Write durable knowledge to the smallest truthful home the repo already uses:
35
38
 
36
39
  - the main docs or knowledge layer for architecture, behavior, decisions, debugging knowledge, durable plans, and reusable operating guidance
@@ -48,11 +51,33 @@ Do not leave important truths only in chat.
48
51
 
49
52
  Identify which skills were actually used in this conversation, or which existing skills clearly should have covered the workflow but left avoidable gaps.
50
53
 
54
+ For each used or clearly relevant skill, explicitly decide whether it:
55
+
56
+ - succeeded
57
+ - partially succeeded
58
+ - failed
59
+
60
+ Base that judgment on the actual conversation, especially:
61
+
62
+ - direct user feedback
63
+ - whether the skill helped complete the task
64
+ - whether the agent had to work around missing guidance
65
+ - whether concrete errors, dead ends, or repeated corrections happened while using it
66
+
67
+ Distinguish between:
68
+
69
+ - a skill problem
70
+ - an execution mistake by the agent
71
+ - an external/tooling failure
72
+ - a one-off user preference that should not be generalized
73
+
74
+ Only change the skill when the problem is truly in the skill guidance.
75
+
51
76
  For each affected skill:
52
77
 
53
78
  - read the existing skill before editing it
54
79
  - update only reusable guidance, not one-off transcript details
55
- - add missing guardrails, path hints, failure modes, decision rules, or references that would have made the conversation easier to complete
80
+ - add missing guardrails, path hints, failure modes, error-handling guidance, decision rules, or references that would have made the conversation easier to complete
56
81
  - keep `SKILL.md` concise; prefer targeted structural improvements over turning the skill into a diary
57
82
 
58
83
  If the environment has both a source-of-truth skill and one or more mirrored or installed copies, update the source-of-truth version and any copies the user expects to stay in sync.
@@ -91,7 +116,9 @@ Do not invent a refresh command when the repo does not have one.
91
116
  Summarize:
92
117
 
93
118
  - what durable knowledge you saved and where
119
+ - which skills you evaluated and whether they succeeded, partially succeeded, or failed
94
120
  - which skills you improved
121
+ - which concrete errors, failure modes, or repeated friction points you captured
95
122
  - which new skill ideas you recorded, if any
96
123
  - what you intentionally left unpersisted because it was transient
97
124
 
@@ -1,4 +1,4 @@
1
1
  interface:
2
2
  display_name: "Conversation Retrospective"
3
3
  short_description: "Harvest the live conversation into repo memory"
4
- default_prompt: "Use this skill to analyze the active conversation, save durable knowledge into the repo's existing docs, memory, guidance, handoff, or tracker surfaces, improve any skills that were used or exposed gaps, and record new skill ideas without asking follow-up questions when the correct destination is clear."
4
+ default_prompt: "Use this skill to analyze the active conversation, preserve durable knowledge and user feedback in the repo's existing memory surfaces, evaluate whether used skills succeeded or failed, capture concrete errors and friction points, improve skills whose guidance was insufficient, and record new skill ideas without asking follow-up questions when the correct destination is clear."
@@ -54,6 +54,9 @@ If something important lives only in your head or in the chat transcript, the re
54
54
  - When waiting on reviewers, subagents, CI, automated review, or external jobs, wait as long as required. There is no fixed timeout where waiting itself becomes the problem.
55
55
  - Never interrupt in-flight work just to force a partial result, salvage something quickly, or avoid making the user wait longer.
56
56
  - Only stop waiting when the work has actually finished, clearly failed, or the user explicitly redirects the work.
57
+ - When browser work is part of reproduction or verification, send screenshots of the relevant UI states to the user so they can visually confirm what you observed.
58
+ - Capture the states that matter, such as the broken state, the fixed state, or an important intermediate state that explains the issue.
59
+ - If the current environment cannot provide screenshots, state that explicitly instead of silently omitting visual evidence.
57
60
 
58
61
  ## Execution autonomy
59
62
 
@@ -92,12 +95,15 @@ Do not document every trivial implementation detail. Document the non-obvious, d
92
95
  - `work-tracker` when large multi-step work needs durable progress tracking in `.waypoint/track/`
93
96
  - `docs-sync` when routed docs may be stale, missing, or inconsistent with the codebase
94
97
  - `code-guide-audit` when a specific feature or file set needs a targeted coding-guide compliance check
98
+ - `conversation-retrospective` after major completed work pieces so the active conversation is distilled into durable memory, user feedback and errors are preserved, exercised skills are improved, and real new-skill candidates are recorded
95
99
  - `break-it-qa` when a browser-facing feature should be attacked with invalid inputs, refreshes, repeated clicks, wrong action order, or other adversarial manual QA
96
100
  - `frontend-ship-audit` and `backend-ship-audit` only when the user explicitly requests a ship-readiness audit; do not trigger them autonomously as part of the default Waypoint workflow
97
101
  - `workspace-compress` after meaningful chunks, before stopping, and before review when the live handoff needs compression
98
102
  - `pre-pr-hygiene` before pushing or opening/updating a PR for substantial work
99
103
  - `pr-review` once a PR has active review comments or automated review in progress
100
104
 
105
+ Treat `conversation-retrospective` as a default closeout step for major work pieces, not as a rare manual tool.
106
+
101
107
  ## When to use the reviewer agents
102
108
 
103
109
  Waypoint scaffolds these focused second-pass specialists by default:
@@ -68,6 +68,7 @@ Prefer existing persisted context over re-interviewing the user.
68
68
 
69
69
  If the user approves a plan or explicitly tells you to proceed, treat that as authorization to execute the work end to end. Do not stop mid-implementation for incremental permission unless a real blocker, hidden-risk decision, or explicit user redirect requires a pause.
70
70
  When work is in flight elsewhere — reviewer agents, subagents, CI, automated review, external jobs, or other waiting periods — wait as long as required. There is no fixed waiting limit, and slowness alone is not a reason to interrupt or abandon the work.
71
+ When using a browser to reproduce a bug, verify behavior, or confirm that a fix works, send the user screenshots of the relevant UI states so they can see the evidence directly. If screenshots are not possible in the current environment, say so explicitly.
71
72
 
72
73
  Working rules:
73
74
  - Keep `.waypoint/WORKSPACE.md` current as the live execution state, with timestamped new or materially revised entries in multi-topic sections
@@ -77,6 +78,7 @@ Working rules:
77
78
  - Use `work-tracker` when a long-running implementation, remediation, or verification campaign needs durable progress tracking
78
79
  - Use `docs-sync` when the docs may be stale or a change altered shipped behavior, contracts, routes, or commands
79
80
  - Use `code-guide-audit` for a targeted coding-guide compliance pass on a specific feature, file set, or change slice
81
+ - Use `conversation-retrospective` after major completed work pieces to preserve durable learnings, capture user feedback and errors, improve any skills that were exercised, and record real new-skill candidates
80
82
  - Do not invoke `break-it-qa`, `frontend-ship-audit`, or `backend-ship-audit` yourself from the managed AGENTS block workflow; they are user-facing skills for explicit human-requested QA or ship-readiness audits, not default agent steps
81
83
  - Before presenting a non-trivial implementation plan to the user, run `plan-reviewer` and iterate on the plan until it has no meaningful review findings left
82
84
  - Before considering a non-trivial implementation slice complete, run `code-reviewer`; use a recent self-authored commit as the default scope anchor when one cleanly represents that slice