@penclipai/adapter-codex-local 2026.606.0 → 2026.607.0-canary.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "@penclipai/adapter-codex-local",
|
|
3
|
-
"version": "2026.
|
|
3
|
+
"version": "2026.607.0-canary.0",
|
|
4
4
|
"license": "MIT",
|
|
5
5
|
"homepage": "https://github.com/penclipai/paperclip-cn",
|
|
6
6
|
"bugs": {
|
|
@@ -38,7 +38,7 @@
|
|
|
38
38
|
"skills"
|
|
39
39
|
],
|
|
40
40
|
"dependencies": {
|
|
41
|
-
"@penclipai/adapter-utils": "2026.
|
|
41
|
+
"@penclipai/adapter-utils": "2026.607.0-canary.0",
|
|
42
42
|
"picocolors": "^1.1.1"
|
|
43
43
|
},
|
|
44
44
|
"devDependencies": {
|
|
@@ -13,6 +13,10 @@ description: >
|
|
|
13
13
|
|
|
14
14
|
You run in **heartbeats** — short execution windows triggered by Paperclip. Each heartbeat, you wake up, check your work, do something useful, and exit. You do not run continuously.
|
|
15
15
|
|
|
16
|
+
## Terminology
|
|
17
|
+
|
|
18
|
+
In Paperclip, **task** and **issue** refer to the same work item. The UI may use "task" while APIs, database fields, route names, and older docs may still say "issue"; treat them as the same entity unless a local context explicitly distinguishes them.
|
|
19
|
+
|
|
16
20
|
## Authentication
|
|
17
21
|
|
|
18
22
|
Env vars auto-injected: `PAPERCLIP_AGENT_ID`, `PAPERCLIP_COMPANY_ID`, `PAPERCLIP_API_URL`, `PAPERCLIP_RUN_ID`. Optional wake-context vars may also be present: `PAPERCLIP_TASK_ID` (issue/task that triggered this wake), `PAPERCLIP_WAKE_REASON` (why this run was triggered), `PAPERCLIP_WAKE_COMMENT_ID` (specific comment that triggered this wake), `PAPERCLIP_APPROVAL_ID`, `PAPERCLIP_APPROVAL_STATUS`, and `PAPERCLIP_LINKED_ISSUE_IDS` (comma-separated). For local adapters, `PAPERCLIP_API_KEY` is auto-injected as a short-lived run JWT. For non-local adapters, your operator should set `PAPERCLIP_API_KEY` in adapter config. All requests use `Authorization: Bearer $PAPERCLIP_API_KEY`. All endpoints under `/api`, all JSON. Never hard-code the API URL.
|
|
@@ -190,6 +194,67 @@ POST /api/companies/{companyId}/approvals
|
|
|
190
194
|
|
|
191
195
|
`issueIds` links the approval into the issue thread. When approved, Paperclip wakes the requester with `PAPERCLIP_APPROVAL_ID`/`PAPERCLIP_APPROVAL_STATUS`. Keep the payload concise and decision-ready.
|
|
192
196
|
|
|
197
|
+
## Issue-Thread Interactions
|
|
198
|
+
|
|
199
|
+
Issue-thread interactions are first-class cards that render in the issue thread and capture a typed board/user response. Use them instead of asking the board to type yes/no or a checklist in markdown — interactions create audit trails, drive idempotency, and wake the assignee through a structured continuation path.
|
|
200
|
+
|
|
201
|
+
Four kinds are supported. Pick the smallest kind that fits the decision shape:
|
|
202
|
+
|
|
203
|
+
| Kind | When to use | When **not** to use |
|
|
204
|
+
| ------------------------------- | -------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------- |
|
|
205
|
+
| `request_confirmation` | Single yes/no decision bound to a target (e.g. accept a plan revision, approve a launch). | Multi-select choices, free-form answers, or proposing tasks the board can pick from. |
|
|
206
|
+
| `request_checkbox_confirmation` | Board must select any subset of a known list (up to 200 options) and then confirm or reject. | Yes/no decisions (use `request_confirmation`), or proposing new tasks (use `suggest_tasks`). |
|
|
207
|
+
| `ask_user_questions` | Short structured form: a handful of typed questions, each with answers/options/text. | Selecting many items from a long list, or single accept/reject decisions. |
|
|
208
|
+
| `suggest_tasks` | Proposing concrete tasks for the board to accept; accepted tasks become real subtasks. | Asking the board to confirm a plan or arbitrary selection. Tasks are the unit; not arbitrary ids. |
|
|
209
|
+
|
|
210
|
+
Key shared semantics:
|
|
211
|
+
|
|
212
|
+
- **Continuation policy.** `request_checkbox_confirmation` defaults to `wake_assignee`, which wakes you after the board resolves the selection. `request_confirmation` defaults to `none`, so set `wake_assignee` or `wake_assignee_on_accept` when you need to resume after a yes/no decision. `none` never wakes you — only use it when you truly do not need to resume.
|
|
213
|
+
- **Target binding and staleness.** `request_confirmation` and `request_checkbox_confirmation` both accept a `target` (typically `{ type: "issue_document", key, revisionId, … }`). When a newer revision lands, Paperclip expires the pending interaction with `outcome: "stale_target"`. Rebuild against the latest revision and create a fresh interaction.
|
|
214
|
+
- **Supersede on user comment.** Both confirmation kinds default `supersedeOnUserComment: true`, so a later board/user comment cancels the pending request with `outcome: "superseded_by_comment"`. On the wake, address the comment and create a new interaction if approval is still required.
|
|
215
|
+
- **Idempotency.** Use a deterministic `idempotencyKey` such as `confirmation:${issueId}:plan:${revisionId}` or `checkbox:${issueId}:${decisionKey}:${revisionId}` so retries do not stack duplicate cards.
|
|
216
|
+
- **Source issue posture.** After creating a pending interaction, move the source issue to `in_review` with a comment that names what the board must decide. The pending interaction is the explicit waiting path.
|
|
217
|
+
|
|
218
|
+
Create a `request_checkbox_confirmation` (board selects any subset, then confirms):
|
|
219
|
+
|
|
220
|
+
```json
|
|
221
|
+
POST /api/issues/{issueId}/interactions
|
|
222
|
+
{
|
|
223
|
+
"kind": "request_checkbox_confirmation",
|
|
224
|
+
"idempotencyKey": "checkbox:{issueId}:cleanup-files:{planRevisionId}",
|
|
225
|
+
"title": "Confirm files to delete",
|
|
226
|
+
"summary": "Pick the files you want removed before I run the cleanup.",
|
|
227
|
+
"continuationPolicy": "wake_assignee",
|
|
228
|
+
"payload": {
|
|
229
|
+
"version": 1,
|
|
230
|
+
"prompt": "Check the files you want deleted.",
|
|
231
|
+
"detailsMarkdown": "I will run the deletion against everything you check, then report back here.",
|
|
232
|
+
"options": [
|
|
233
|
+
{ "id": "draft-report-march", "label": "Old draft report", "description": "QA test pass, March." },
|
|
234
|
+
{ "id": "tmp-export-2025", "label": "tmp/export-2025.csv" }
|
|
235
|
+
],
|
|
236
|
+
"defaultSelectedOptionIds": ["draft-report-march"],
|
|
237
|
+
"minSelected": 0,
|
|
238
|
+
"maxSelected": null,
|
|
239
|
+
"acceptLabel": "Delete selected",
|
|
240
|
+
"rejectLabel": "Request changes",
|
|
241
|
+
"rejectRequiresReason": true,
|
|
242
|
+
"rejectReasonLabel": "What should change?",
|
|
243
|
+
"supersedeOnUserComment": true,
|
|
244
|
+
"target": {
|
|
245
|
+
"type": "issue_document",
|
|
246
|
+
"issueId": "{issueId}",
|
|
247
|
+
"key": "plan",
|
|
248
|
+
"revisionId": "{latestPlanRevisionId}"
|
|
249
|
+
}
|
|
250
|
+
}
|
|
251
|
+
}
|
|
252
|
+
```
|
|
253
|
+
|
|
254
|
+
When the board accepts, your wake delivers `result.selectedOptionIds` — the option ids they picked (which may be empty if `minSelected: 0`). Rejection delivers `result.reason` and a `commentId`.
|
|
255
|
+
|
|
256
|
+
For full payload schemas, validation limits (option count, label lengths, min/max rules), accept/reject route bodies, and result fields, see `references/api-reference.md` -> **Checkbox confirmations**.
|
|
257
|
+
|
|
193
258
|
## Niche Workflow Pointers
|
|
194
259
|
|
|
195
260
|
Load `references/workflows.md` when the task matches one of these:
|
|
@@ -686,6 +686,118 @@ Rules:
|
|
|
686
686
|
- A pending interaction is an explicit waiting path. Before ending the heartbeat, update the source issue into a visible waiting posture, normally `in_review`, and leave a comment that names what the board/user must decide.
|
|
687
687
|
- For plan approval, update the `plan` issue document first, create the confirmation against the latest plan revision, set the source issue to `in_review`, and wait for acceptance before creating implementation subtasks.
|
|
688
688
|
|
|
689
|
+
### Checkbox confirmations
|
|
690
|
+
|
|
691
|
+
Use `request_checkbox_confirmation` when the board needs to **select any subset of a known list** (up to 200 options) and then confirm or reject. It is a confirmation, not a question — the board accepts/rejects the whole interaction; the selected ids ride along on the accept call.
|
|
692
|
+
|
|
693
|
+
When to choose this kind over the others:
|
|
694
|
+
|
|
695
|
+
- Choose `request_checkbox_confirmation` over `ask_user_questions` when the decision is a single multi-select (especially with more than a handful of options or near the ~100-option range). `ask_user_questions` is for short structured forms, not long lists.
|
|
696
|
+
- Choose `request_checkbox_confirmation` over `request_confirmation` when the board's decision is "yes, but only these items," not a pure yes/no.
|
|
697
|
+
- Choose `request_checkbox_confirmation` over `suggest_tasks` when the items are not concrete tasks to be created. `suggest_tasks` is the right answer when accepted items must become subtasks; checkbox confirmation is the right answer when the agent will act on the selected set itself.
|
|
698
|
+
|
|
699
|
+
Create a checkbox confirmation:
|
|
700
|
+
|
|
701
|
+
```json
|
|
702
|
+
POST /api/issues/{issueId}/interactions
|
|
703
|
+
{
|
|
704
|
+
"kind": "request_checkbox_confirmation",
|
|
705
|
+
"idempotencyKey": "checkbox:{issueId}:cleanup-files:{planRevisionId}",
|
|
706
|
+
"title": "Confirm files to delete",
|
|
707
|
+
"summary": "Pick the files you want removed before I run the cleanup.",
|
|
708
|
+
"continuationPolicy": "wake_assignee",
|
|
709
|
+
"payload": {
|
|
710
|
+
"version": 1,
|
|
711
|
+
"prompt": "Check the files you want deleted.",
|
|
712
|
+
"detailsMarkdown": "I will run the deletion against everything you check, then report back here.",
|
|
713
|
+
"options": [
|
|
714
|
+
{ "id": "draft-report-march", "label": "Old draft report", "description": "QA test pass, March." },
|
|
715
|
+
{ "id": "tmp-export-2025", "label": "tmp/export-2025.csv" }
|
|
716
|
+
],
|
|
717
|
+
"defaultSelectedOptionIds": ["draft-report-march"],
|
|
718
|
+
"minSelected": 0,
|
|
719
|
+
"maxSelected": null,
|
|
720
|
+
"acceptLabel": "Delete selected",
|
|
721
|
+
"rejectLabel": "Request changes",
|
|
722
|
+
"rejectRequiresReason": true,
|
|
723
|
+
"rejectReasonLabel": "What should change?",
|
|
724
|
+
"allowDeclineReason": true,
|
|
725
|
+
"declineReasonPlaceholder": "Tell me what to revise.",
|
|
726
|
+
"supersedeOnUserComment": true,
|
|
727
|
+
"target": {
|
|
728
|
+
"type": "issue_document",
|
|
729
|
+
"issueId": "{issueId}",
|
|
730
|
+
"key": "plan",
|
|
731
|
+
"revisionId": "{latestPlanRevisionId}"
|
|
732
|
+
}
|
|
733
|
+
}
|
|
734
|
+
}
|
|
735
|
+
```
|
|
736
|
+
|
|
737
|
+
Payload field reference (`RequestCheckboxConfirmationPayload`):
|
|
738
|
+
|
|
739
|
+
| Field | Type | Default | Notes |
|
|
740
|
+
| --------------------------- | ------------------------------------------ | -------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
741
|
+
| `version` | `1` | required | Versioned for forward compatibility. |
|
|
742
|
+
| `prompt` | string (1–1000 chars) | required | Headline rendered above the checkbox list. |
|
|
743
|
+
| `detailsMarkdown` | string (≤ 20000 chars) \| `null` | `null` | Optional markdown context above the list. |
|
|
744
|
+
| `options` | `[{ id, label, description? }]` | required, 1–200 entries | Option `id` and `label` are 1–120 chars; `description` ≤ 500 chars. Option ids must be unique within the payload. |
|
|
745
|
+
| `defaultSelectedOptionIds` | string array | `[]` | Pre-checks these option ids in the UI. Each id must reference an option in `options`. Length must not exceed `maxSelected` when set. |
|
|
746
|
+
| `minSelected` | integer ≥ 0 | `0` | Server rejects acceptances below this floor. Cannot exceed `options.length`. |
|
|
747
|
+
| `maxSelected` | integer ≥ 0 \| `null` | `null` (unbounded) | Must satisfy `maxSelected ≥ minSelected` and `maxSelected ≤ options.length` when set. |
|
|
748
|
+
| `acceptLabel` | string (1–80) \| `null` | `null` (UI default) | Button label for accept. |
|
|
749
|
+
| `rejectLabel` | string (1–80) \| `null` | `null` (UI default) | Button label for reject/request-changes. |
|
|
750
|
+
| `rejectRequiresReason` | boolean | `false` | When `true`, the board must supply a non-empty `reason` on reject; the server returns 422 otherwise. |
|
|
751
|
+
| `rejectReasonLabel` | string (1–160) \| `null` | `null` | Field label for the reject reason. |
|
|
752
|
+
| `allowDeclineReason` | boolean | `true` | Whether to render the reason input at all. |
|
|
753
|
+
| `declineReasonPlaceholder` | string (1–240) \| `null` | `null` | Placeholder text in the reason input. |
|
|
754
|
+
| `supersedeOnUserComment` | boolean | `true` (set server-side) | When `true`, a board/user comment after the interaction supersedes it with `outcome: "superseded_by_comment"`. |
|
|
755
|
+
| `target` | `RequestConfirmationTarget` \| `null` | `null` | Reuses the `request_confirmation` target schema. Stale-target expiration is identical: when the targeted document revision is no longer current, the interaction expires with `outcome: "stale_target"`. |
|
|
756
|
+
|
|
757
|
+
Envelope defaults that differ from other kinds:
|
|
758
|
+
|
|
759
|
+
- `continuationPolicy` defaults to `"wake_assignee"` for `request_checkbox_confirmation` (same as `suggest_tasks` and `ask_user_questions`). Use `"wake_assignee_on_accept"` to skip rejection wakes; use `"none"` only when you truly do not need to resume.
|
|
760
|
+
|
|
761
|
+
Accept (board action, requires board/user role; agents creating the interaction cannot accept):
|
|
762
|
+
|
|
763
|
+
```json
|
|
764
|
+
POST /api/issues/{issueId}/interactions/{interactionId}/accept
|
|
765
|
+
{ "selectedOptionIds": ["draft-report-march", "tmp-export-2025"] }
|
|
766
|
+
```
|
|
767
|
+
|
|
768
|
+
If `selectedOptionIds` is omitted on accept, the server falls back to the payload's `defaultSelectedOptionIds`. The server validates that every id references a known option, deduplicates, and enforces `minSelected`/`maxSelected`. Unknown ids return 422.
|
|
769
|
+
|
|
770
|
+
Reject:
|
|
771
|
+
|
|
772
|
+
```json
|
|
773
|
+
POST /api/issues/{issueId}/interactions/{interactionId}/reject
|
|
774
|
+
{ "reason": "Keep the March draft; only delete tmp/export-2025.csv." }
|
|
775
|
+
```
|
|
776
|
+
|
|
777
|
+
`reason` is required when `rejectRequiresReason: true`, otherwise optional.
|
|
778
|
+
|
|
779
|
+
Resolved result (`RequestCheckboxConfirmationResult`):
|
|
780
|
+
|
|
781
|
+
```json
|
|
782
|
+
{
|
|
783
|
+
"version": 1,
|
|
784
|
+
"outcome": "accepted",
|
|
785
|
+
"selectedOptionIds": ["draft-report-march", "tmp-export-2025"]
|
|
786
|
+
}
|
|
787
|
+
```
|
|
788
|
+
|
|
789
|
+
Other outcomes match `request_confirmation`:
|
|
790
|
+
|
|
791
|
+
- `rejected` — `{ outcome: "rejected", reason, commentId }`. `selectedOptionIds` is absent.
|
|
792
|
+
- `superseded_by_comment` — `{ outcome: "superseded_by_comment", commentId }`. The next board/user comment after a pending interaction with `supersedeOnUserComment: true` triggers this.
|
|
793
|
+
- `stale_target` — `{ outcome: "stale_target", staleTarget }`. Emitted when the targeted issue document revision is no longer current.
|
|
794
|
+
|
|
795
|
+
Best practice:
|
|
796
|
+
|
|
797
|
+
- Use a deterministic idempotency key like `checkbox:${issueId}:${decisionKey}:${revisionId}` so retries (e.g. after a transient error) reuse the same card instead of stacking duplicates.
|
|
798
|
+
- After creating a pending checkbox confirmation, move the source issue to `in_review` with a comment that names exactly what the board must decide. Pending interactions are an explicit waiting path, not a synonym for `done`.
|
|
799
|
+
- When a `superseded_by_comment` or `stale_target` wake fires, address the new comment or rebuild the target, then create a fresh checkbox confirmation with an idempotency key that includes the new revision id.
|
|
800
|
+
|
|
689
801
|
### Checking approval status
|
|
690
802
|
|
|
691
803
|
```
|
|
@@ -792,8 +904,8 @@ Terminal states: `done`, `cancelled`
|
|
|
792
904
|
| GET | `/api/issues/:issueId/comments/:commentId` | Get a specific comment by ID |
|
|
793
905
|
| POST | `/api/issues/:issueId/comments` | Add comment (@-mentions trigger wakeups) |
|
|
794
906
|
| GET | `/api/issues/:issueId/interactions` | List issue-thread interactions |
|
|
795
|
-
| POST | `/api/issues/:issueId/interactions` | Create issue-thread interaction (`suggest_tasks`, `ask_user_questions`, `request_confirmation`) |
|
|
796
|
-
| POST | `/api/issues/:issueId/interactions/:interactionId/accept` | Accept suggested tasks or confirmation
|
|
907
|
+
| POST | `/api/issues/:issueId/interactions` | Create issue-thread interaction (`suggest_tasks`, `ask_user_questions`, `request_confirmation`, `request_checkbox_confirmation`) |
|
|
908
|
+
| POST | `/api/issues/:issueId/interactions/:interactionId/accept` | Accept suggested tasks or confirmation (body: `selectedClientKeys` for `suggest_tasks`; `selectedOptionIds` for `request_checkbox_confirmation`) |
|
|
797
909
|
| POST | `/api/issues/:issueId/interactions/:interactionId/reject` | Reject suggested tasks or confirmation |
|
|
798
910
|
| POST | `/api/issues/:issueId/interactions/:interactionId/respond` | Respond to structured questions |
|
|
799
911
|
| GET | `/api/issues/:issueId/documents` | List issue documents |
|
|
@@ -1,161 +0,0 @@
|
|
|
1
|
-
---
|
|
2
|
-
name: diagnose-why-work-stopped
|
|
3
|
-
description: >
|
|
4
|
-
How to handle "why did this work stop / why is this looping?" assignments.
|
|
5
|
-
Forensics first on the named tree, surface the exact stop-point, frame the
|
|
6
|
-
fix as a general product rule that respects three invariants (productive
|
|
7
|
-
work continues, only real blockers stop work, no infinite loops), and
|
|
8
|
-
deliver a plan — no code changes — gated by board/CTO approval before
|
|
9
|
-
child issues are created. Use whenever the issue title or body asks for
|
|
10
|
-
forensics on a stalled, looping, or "went too deep" tree.
|
|
11
|
-
---
|
|
12
|
-
|
|
13
|
-
# Diagnose Why Work Stopped
|
|
14
|
-
|
|
15
|
-
A repeatable procedure for the recurring class of issues where the user (or a manager) points at a stalled / looping / over-recovered issue tree and asks "why did this stop / why is this looping / how do we make sure this doesn't happen again?"
|
|
16
|
-
|
|
17
|
-
This skill is **diagnostic + product-design**, not engineering. The output is a written root cause and an approved plan. No code changes leave this skill.
|
|
18
|
-
|
|
19
|
-
Canonical execution model: read `doc/execution-semantics.md` before diagnosing or proposing a new liveness/recovery rule. Use that document as the source of truth for status, action-path, post-run disposition, bounded continuation, productivity review, pause-hold, watchdog, and explicit recovery semantics. If the investigation finds a true product-rule gap, the plan should say whether `doc/execution-semantics.md` needs a matching update.
|
|
20
|
-
|
|
21
|
-
## When to use
|
|
22
|
-
|
|
23
|
-
Trigger on an assignment whose title or body matches any of:
|
|
24
|
-
|
|
25
|
-
- "why did this work stop", "why did this stall", "why did this just stop"
|
|
26
|
-
- "infinite loop", "looping", "spinning", "going too deep", "recovery went too deep"
|
|
27
|
-
- "liveness — what happened here", "this tree stopped working", "stuck"
|
|
28
|
-
- "approach it from a product perspective", "general product principle / rule"
|
|
29
|
-
- An attached link to a specific stalled / looping / over-recovered issue tree
|
|
30
|
-
|
|
31
|
-
Also use when the user asks for forensics, root cause, or a write-up *before* any product change.
|
|
32
|
-
|
|
33
|
-
## When NOT to use
|
|
34
|
-
|
|
35
|
-
- The assignment asks you to ship a code change directly. Use normal engineering flow.
|
|
36
|
-
- The assignment is a normal bug report against a specific feature. Use normal investigation.
|
|
37
|
-
- You are the original implementer being asked to fix your own bug. Use normal debugging.
|
|
38
|
-
|
|
39
|
-
## Three invariants you must preserve
|
|
40
|
-
|
|
41
|
-
Every diagnosis and every proposed rule must hold these three invariants together. The user has restated them on at least four issues; treat them as load-bearing:
|
|
42
|
-
|
|
43
|
-
1. **Productive work continues.** Agents that have a clear next action must keep working without needing the user to wake them. ([PAP-2674](/PAP/issues/PAP-2674), [PAP-2708](/PAP/issues/PAP-2708))
|
|
44
|
-
2. **Only real blockers stop work.** Stops happen when something genuinely cannot proceed (missing approval, missing dependency, human owner). Pseudo-stops (in_review with no action path, cancelled leaves, malformed metadata) must be detected and routed, not left silent. ([PAP-2335](/PAP/issues/PAP-2335), [PAP-2674](/PAP/issues/PAP-2674))
|
|
45
|
-
3. **No infinite loops.** Stranded-work recovery and continuation loops must be bounded and distinguishable from genuinely productive continuation. ([PAP-2602](/PAP/issues/PAP-2602), [PAP-2486](/PAP/issues/PAP-2486))
|
|
46
|
-
|
|
47
|
-
If a proposed rule violates any of the three, drop it or rework it. State explicitly in the plan how each invariant is held.
|
|
48
|
-
|
|
49
|
-
## Procedure
|
|
50
|
-
|
|
51
|
-
### 0. Read the current execution contract
|
|
52
|
-
|
|
53
|
-
Before walking the tree, read `doc/execution-semantics.md` and keep its terms intact:
|
|
54
|
-
|
|
55
|
-
- live path / waiting path / recovery path
|
|
56
|
-
- post-run disposition: terminal, explicitly live, explicitly waiting, invalid
|
|
57
|
-
- bounded `run_liveness_continuation`
|
|
58
|
-
- productivity review vs liveness recovery
|
|
59
|
-
- active subtree pause holds
|
|
60
|
-
- silent active-run watchdog
|
|
61
|
-
|
|
62
|
-
Do not invent a new rule until you can state how it differs from the current execution semantics document.
|
|
63
|
-
|
|
64
|
-
### 1. Forensics on the named tree — before anything else
|
|
65
|
-
|
|
66
|
-
Do this in the same heartbeat. Do not propose a rule until you have a concrete stop point.
|
|
67
|
-
|
|
68
|
-
- Open the linked issue (and its blocker chain, parents, recovery siblings, recent runs).
|
|
69
|
-
- Walk the tree node-by-node and find the exact issue + state combination that stops the world. Common shapes seen in the company so far:
|
|
70
|
-
- `in_review` with no typed execution participant, no active run, no pending interaction, no recovery issue ([PAP-2335](/PAP/issues/PAP-2335), [PAP-2674](/PAP/issues/PAP-2674)).
|
|
71
|
-
- `in_progress` after a successful run with no future action path queued ([PAP-2674](/PAP/issues/PAP-2674)).
|
|
72
|
-
- Blocker chain whose leaf is `cancelled` / malformed / cross-company-inaccessible ([PAP-2602](/PAP/issues/PAP-2602)).
|
|
73
|
-
- `issue.continuation_recovery` waking the same issue >N times after successful runs ([PAP-2602](/PAP/issues/PAP-2602)).
|
|
74
|
-
- Stranded-work recovery treating its own recovery issues as more recoverable source work ([PAP-2486](/PAP/issues/PAP-2486)).
|
|
75
|
-
- Quote the evidence: run ids, comment timestamps, status transitions. "Inferred" is acceptable only when an API boundary blocks direct evidence — say so explicitly and mark the claim provisional ([PAP-2631](/PAP/issues/PAP-2631)).
|
|
76
|
-
|
|
77
|
-
Respect the API boundary. If the linked issue is in another company and your agent token returns 403, do not bypass scoping. Either request a board-approved diagnostic path or proceed from inferred PAP-side evidence and label it.
|
|
78
|
-
|
|
79
|
-
### 2. Survey recent related work
|
|
80
|
-
|
|
81
|
-
Before proposing a new product rule, read what already shipped this week in the same area. The user has explicitly called this out: ([PAP-2602](/PAP/issues/PAP-2602)) "review our recent work on liveness that we shipped in the last couple of days." A new rule that contradicts code merged 48 hours ago is rework, not improvement.
|
|
82
|
-
|
|
83
|
-
Quick survey:
|
|
84
|
-
- Recent merged PRs in the affected area.
|
|
85
|
-
- Recent done issues whose title mentions liveness, recovery, productivity, continuation, or the affected subsystem.
|
|
86
|
-
- Any active plan documents on parent issues. The fix may belong as a revision to an existing plan, not as a new top-level proposal.
|
|
87
|
-
|
|
88
|
-
State in the forensics: "I reviewed X, Y, Z. The new gap is …"
|
|
89
|
-
|
|
90
|
-
### 3. Classify each non-progressing issue in the tree
|
|
91
|
-
|
|
92
|
-
For every issue in the affected tree that is not `done` / `cancelled` / actively running, decide:
|
|
93
|
-
|
|
94
|
-
- **Truly needs human or board intervention** — name the owner and the action.
|
|
95
|
-
- **Agent-actionable but not currently routed** — name the rule that would have routed it, and the agent that should have been waked.
|
|
96
|
-
- **Already covered** — point at the active run, queued wake, recovery issue, or pending interaction.
|
|
97
|
-
|
|
98
|
-
This is the table the user has asked for repeatedly ([PAP-2335](/PAP/issues/PAP-2335)). Without it the plan is abstract.
|
|
99
|
-
|
|
100
|
-
### 4. Frame as a general product rule
|
|
101
|
-
|
|
102
|
-
The user does not want a one-off patch on the named tree. They want the rule. Two checks:
|
|
103
|
-
|
|
104
|
-
- The rule is **stated as a contract**, not as an if/else patch. Example contract: "every agent-owned non-terminal issue must finish each heartbeat with a terminal state, an explicit waiting path, or an explicit live path" ([PAP-2674](/PAP/issues/PAP-2674)).
|
|
105
|
-
- The rule is reconciled against `doc/execution-semantics.md`. Prefer citing and applying the existing contract; propose a document change only when the current doc is incomplete or contradicted by accepted/implemented behavior.
|
|
106
|
-
- The rule **explicitly preserves the three invariants** above. Show the work.
|
|
107
|
-
|
|
108
|
-
If the rule would have blocked a recent productive run from succeeding, drop or narrow it.
|
|
109
|
-
|
|
110
|
-
### 5. Plan, do not code
|
|
111
|
-
|
|
112
|
-
Write the plan into the issue's `plan` document. Cover:
|
|
113
|
-
|
|
114
|
-
- Forensics summary (root cause + evidence).
|
|
115
|
-
- The general product rule, stated as a contract.
|
|
116
|
-
- Whether the existing `doc/execution-semantics.md` contract already covers the case, or what exact documentation update is needed.
|
|
117
|
-
- Phased subtasks: typically `Phase 0` resolves the named live tree (carefully, not destructively), `Phase 1` codifies the contract in docs, then implementation phases for detection, recovery, UI surfacing, security review, QA, and CTO review.
|
|
118
|
-
- Explicit assignees per phase; favor team specialty (CodexCoder for server, ClaudeCoder for FE, UXDesigner for visible state, SecurityEngineer for ownership/permissions, QA for validation).
|
|
119
|
-
- Blocking dependencies wired with `blockedByIssueIds`, parallel branches identified.
|
|
120
|
-
|
|
121
|
-
Do not create the child issues yet. Do not push code.
|
|
122
|
-
|
|
123
|
-
### 6. Request approval, then decompose
|
|
124
|
-
|
|
125
|
-
- Open a `request_confirmation` interaction targeting the latest plan revision. Idempotency key `confirmation:{issueId}:plan:{revisionId}`.
|
|
126
|
-
- Wait for board/CTO acceptance. If the user posts a new comment that supersedes the plan, the prior confirmation is invalidated — open a fresh confirmation tied to the new revision ([PAP-2602](/PAP/issues/PAP-2602) cycled three revisions; that is fine).
|
|
127
|
-
- Only after acceptance: create the phased child issues with the right assignees and dependencies, then block this parent on the final QA / CTO review issue so the parent only wakes when the chain finishes.
|
|
128
|
-
|
|
129
|
-
### 7. Phase 0 hygiene on the named tree
|
|
130
|
-
|
|
131
|
-
Phase 0 cleans up the live tree without papering over evidence:
|
|
132
|
-
|
|
133
|
-
- Move stalled `in_review` leaves with no participant to `todo` with a precise next action and named owner ([PAP-2335](/PAP/issues/PAP-2335)).
|
|
134
|
-
- Detach cancelled/dead blockers from chains they were holding hostage; do not silently mark issues `done` to clear backlog.
|
|
135
|
-
- Leave a comment on the original named issue summarizing what changed and why; never hide the recovery chain history.
|
|
136
|
-
|
|
137
|
-
### 8. Final close-out
|
|
138
|
-
|
|
139
|
-
When the phase chain is complete, post a board-level summary comment on the parent issue: what changed, what the new contract is, what the rollout step is (e.g. "restart the control-plane to pick up the new response shape"), and the live state of the originally-named tree. Then close the parent.
|
|
140
|
-
|
|
141
|
-
## Pitfalls
|
|
142
|
-
|
|
143
|
-
- **Coding before approval.** The user has said "make a plan first" on every recent diagnostic issue. Producing code in the forensic phase wastes the round-trip.
|
|
144
|
-
- **Restating one invariant at the cost of another.** Bound continuation too tightly and productive work stalls; loosen recovery and infinite loops return. Always check all three.
|
|
145
|
-
- **Skipping the recent-work survey.** Proposing a contract that contradicts what shipped 24 hours ago is the easiest way to get the plan rejected.
|
|
146
|
-
- **Letting "in_review" mean done.** A leaf assigned to another agent with no participant or active run is not progress; treat it as a stop.
|
|
147
|
-
- **Bypassing company scoping.** Cross-company forensics needs a board-approved diagnostic path, not a database read.
|
|
148
|
-
- **Recursive recovery.** Stranded-work recovery that recovers its own recovery issues is the canonical infinite loop ([PAP-2486](/PAP/issues/PAP-2486)). Detect it and refuse to deepen.
|
|
149
|
-
- **Hiding the chain.** Don't silently delete or hide the symptomatic recovery issues — the operator needs the audit trail.
|
|
150
|
-
|
|
151
|
-
## Verification checklist (before posting the plan)
|
|
152
|
-
|
|
153
|
-
- [ ] The exact stop point in the named tree is identified with run ids / comment ids.
|
|
154
|
-
- [ ] Recent shipped work in the same area was surveyed and is referenced.
|
|
155
|
-
- [ ] Every non-progressing issue is classified human-needed / agent-actionable / already-covered.
|
|
156
|
-
- [ ] The proposed rule is stated as a contract, not a patch.
|
|
157
|
-
- [ ] All three invariants are explicitly preserved.
|
|
158
|
-
- [ ] No code change has landed in this heartbeat.
|
|
159
|
-
- [ ] A `request_confirmation` against the latest plan revision is open.
|
|
160
|
-
- [ ] Phase 0 of the plan addresses the live named tree without destroying evidence.
|
|
161
|
-
- [ ] Implementation phases name specialty-appropriate assignees and `blockedByIssueIds` dependencies.
|
|
@@ -1,154 +0,0 @@
|
|
|
1
|
-
---
|
|
2
|
-
name: paperclip-create-plugin
|
|
3
|
-
description: >
|
|
4
|
-
Create and develop external Paperclip plugins with the CLI-first workflow.
|
|
5
|
-
Use when scaffolding a new plugin, working on a local plugin against a running
|
|
6
|
-
Paperclip instance, or updating plugin authoring docs. Covers `penclip
|
|
7
|
-
plugin init`, the local install loop via `penclip plugin install <path>`,
|
|
8
|
-
worker/UI rebuild and reload semantics, and the required success checklist.
|
|
9
|
-
---
|
|
10
|
-
|
|
11
|
-
# Create and develop a Paperclip plugin
|
|
12
|
-
|
|
13
|
-
Use this skill when the task is to create, scaffold, or iterate on a Paperclip plugin against a local Paperclip instance.
|
|
14
|
-
|
|
15
|
-
## 1. Default: build the plugin OUTSIDE Paperclip core
|
|
16
|
-
|
|
17
|
-
Plugins are their own packages. Unless the task **explicitly** asks for a bundled in-repo example, do not add plugin source under `packages/plugins/` in this repo.
|
|
18
|
-
|
|
19
|
-
- Scaffold the plugin into a directory outside the Paperclip checkout (e.g. `~/dev/paperclip-plugins/<name>`).
|
|
20
|
-
- Install it into the running Paperclip instance by local absolute path.
|
|
21
|
-
- Edit code in the external package; let Paperclip pick up rebuilt output.
|
|
22
|
-
|
|
23
|
-
Only edit Paperclip core itself when the user asks to surface a plugin as a bundled example (`server/src/routes/plugins.ts`, in-repo example lists, docs).
|
|
24
|
-
|
|
25
|
-
## 2. Ground rules
|
|
26
|
-
|
|
27
|
-
Reference docs when you need detail:
|
|
28
|
-
|
|
29
|
-
1. `doc/plugins/PLUGIN_AUTHORING_GUIDE.md`
|
|
30
|
-
2. `packages/plugins/sdk/README.md`
|
|
31
|
-
3. `doc/plugins/PLUGIN_SPEC.md` — future-looking context only
|
|
32
|
-
|
|
33
|
-
Current runtime assumptions:
|
|
34
|
-
|
|
35
|
-
- plugin workers are trusted code
|
|
36
|
-
- plugin UI is trusted same-origin host code
|
|
37
|
-
- worker APIs are capability-gated
|
|
38
|
-
- plugin UI is not sandboxed by manifest capabilities
|
|
39
|
-
- no host-provided shared plugin UI component kit yet
|
|
40
|
-
- `ctx.assets` is not supported in the current runtime
|
|
41
|
-
|
|
42
|
-
## 3. CLI-first scaffold workflow
|
|
43
|
-
|
|
44
|
-
Use `penclip plugin init`. Do not invoke the scaffold package node entrypoint by hand unless the CLI command is unavailable in the environment.
|
|
45
|
-
|
|
46
|
-
```bash
|
|
47
|
-
penclip plugin init @acme/my-plugin --output ~/dev/paperclip-plugins
|
|
48
|
-
```
|
|
49
|
-
|
|
50
|
-
Useful flags (all optional):
|
|
51
|
-
|
|
52
|
-
- `--output <dir>` — parent directory; the command creates `<dir>/<unscoped-name>/`. Defaults to the current directory.
|
|
53
|
-
- `--template <default|connector|workspace|environment>` — starter template.
|
|
54
|
-
- `--category <connector|workspace|automation|ui|environment>` — manifest category.
|
|
55
|
-
- `--display-name <name>`, `--description <text>`, `--author <name>` — manifest metadata.
|
|
56
|
-
- `--sdk-path <path>` — snapshot the local SDK from a Paperclip checkout into `.paperclip-sdk/` (useful when developing against an unreleased SDK).
|
|
57
|
-
|
|
58
|
-
On success the command prints the exact next commands (`cd`, `pnpm install`, `pnpm dev`, `penclip plugin install <abs-path>`). Run them in order.
|
|
59
|
-
|
|
60
|
-
If `penclip` is not on PATH in your environment, fall back to:
|
|
61
|
-
|
|
62
|
-
```bash
|
|
63
|
-
pnpm --filter @penclipai/create-paperclip-plugin build
|
|
64
|
-
node packages/plugins/create-paperclip-plugin/dist/index.js @acme/plugin-name \
|
|
65
|
-
--output /absolute/path/to/plugin-repos \
|
|
66
|
-
--sdk-path /absolute/path/to/paperclip/packages/plugins/sdk
|
|
67
|
-
```
|
|
68
|
-
|
|
69
|
-
## 4. Local install + rebuild loop
|
|
70
|
-
|
|
71
|
-
In the scaffolded plugin folder:
|
|
72
|
-
|
|
73
|
-
```bash
|
|
74
|
-
pnpm install
|
|
75
|
-
pnpm dev # esbuild --watch: rebuilds dist/manifest.js, dist/worker.js, dist/ui/
|
|
76
|
-
penclip plugin install /absolute/path/to/my-plugin
|
|
77
|
-
```
|
|
78
|
-
|
|
79
|
-
Notes:
|
|
80
|
-
|
|
81
|
-
- `penclip plugin install` auto-detects local paths (absolute, `./`, `../`, `~`, or an existing relative folder) and forwards `isLocalPath: true` to the server. Pass `--local` to force local mode if the heuristic is ambiguous.
|
|
82
|
-
- Paths are resolved to absolute paths before being sent to the server.
|
|
83
|
-
- The server watches built outputs (`dist/`) for local-path plugins and restarts the plugin worker on rebuild — you do not need to reinstall after every edit.
|
|
84
|
-
- UI hot reload via the SDK dev server (`pnpm dev:ui`, port `4177`) is optional and template-dependent; only mention it if the template wires `devUiUrl` and you verified it works end to end.
|
|
85
|
-
- `--version` only applies to npm package installs. Combining it with a local path is an error.
|
|
86
|
-
|
|
87
|
-
After install, inspect with:
|
|
88
|
-
|
|
89
|
-
```bash
|
|
90
|
-
penclip plugin list
|
|
91
|
-
penclip plugin inspect <plugin-key>
|
|
92
|
-
```
|
|
93
|
-
|
|
94
|
-
## 5. After scaffolding, sanity-check the package
|
|
95
|
-
|
|
96
|
-
Open and confirm:
|
|
97
|
-
|
|
98
|
-
- `src/manifest.ts` — declared capabilities and slots
|
|
99
|
-
- `src/worker.ts` — worker entry
|
|
100
|
-
- `src/ui/index.tsx` — UI entry (if applicable)
|
|
101
|
-
- `tests/plugin.spec.ts` — placeholder test
|
|
102
|
-
- `package.json` — `paperclipPlugin` block points at `dist/manifest.js`, `dist/worker.js`, `dist/ui/`
|
|
103
|
-
|
|
104
|
-
Make sure the plugin:
|
|
105
|
-
|
|
106
|
-
- declares only supported capabilities
|
|
107
|
-
- does not use `ctx.assets`
|
|
108
|
-
- does not import host UI component stubs
|
|
109
|
-
- keeps UI self-contained
|
|
110
|
-
- uses `routePath` only on `page` slots
|
|
111
|
-
|
|
112
|
-
## 6. Verification (run before declaring success)
|
|
113
|
-
|
|
114
|
-
From the plugin folder:
|
|
115
|
-
|
|
116
|
-
```bash
|
|
117
|
-
pnpm typecheck
|
|
118
|
-
pnpm test
|
|
119
|
-
pnpm build
|
|
120
|
-
```
|
|
121
|
-
|
|
122
|
-
If the plugin is already running under `pnpm dev`, you can keep the watcher up and run `pnpm typecheck` and `pnpm test` in a separate shell.
|
|
123
|
-
|
|
124
|
-
If you changed Paperclip SDK/host/plugin runtime code in addition to the plugin, also run the relevant Paperclip workspace checks.
|
|
125
|
-
|
|
126
|
-
## 7. Success checklist (report this back)
|
|
127
|
-
|
|
128
|
-
When you finish a local plugin task, report:
|
|
129
|
-
|
|
130
|
-
- **Scaffold path** — absolute path of the created plugin folder.
|
|
131
|
-
- **Commands run** — the exact `penclip plugin init`, `pnpm install`, `pnpm dev`, `penclip plugin install <path>` invocations (and any verification commands).
|
|
132
|
-
- **Install status** — output of `penclip plugin list` / `plugin inspect` (plugin key, version, status). Note if `status` is anything other than `ready` and include `lastError`.
|
|
133
|
-
- **Tests / build result** — `pnpm typecheck`, `pnpm test`, `pnpm build` pass/fail with the failing output if any.
|
|
134
|
-
- **Reload limitations** — call out anything that did not hot-reload (e.g. manifest changes required a reinstall, UI dev server was not wired, etc.).
|
|
135
|
-
|
|
136
|
-
If any item is missing, mark it as such — do not silently skip.
|
|
137
|
-
|
|
138
|
-
## 8. When NOT to edit Paperclip core
|
|
139
|
-
|
|
140
|
-
Do not add the plugin under `packages/plugins/` or update bundled-example wiring unless the user explicitly asks for a bundled example. Local-path installs are the supported development model; npm packages are the production deployment path.
|
|
141
|
-
|
|
142
|
-
If the user does ask for a bundled example, also update:
|
|
143
|
-
|
|
144
|
-
- `server/src/routes/plugins.ts` example list
|
|
145
|
-
- any docs that enumerate in-repo example plugins
|
|
146
|
-
|
|
147
|
-
## 9. Documentation expectations
|
|
148
|
-
|
|
149
|
-
When authoring or updating plugin docs:
|
|
150
|
-
|
|
151
|
-
- distinguish current implementation from future spec ideas
|
|
152
|
-
- be explicit about the trusted-code model
|
|
153
|
-
- do not promise host UI components or asset APIs
|
|
154
|
-
- prefer local-path development + npm-package deployment guidance over repo-local workflows
|
|
@@ -1,236 +0,0 @@
|
|
|
1
|
-
---
|
|
2
|
-
name: terminal-bench-loop
|
|
3
|
-
description: >
|
|
4
|
-
Run a single Terminal-Bench problem through Paperclip in a bounded,
|
|
5
|
-
human-in-the-loop improvement cycle until the smoke passes, the board
|
|
6
|
-
rejects the next fix, the iteration budget is exhausted, or a real
|
|
7
|
-
blocker is named. Each iteration runs a bounded smoke against an
|
|
8
|
-
isolated Paperclip App worktree, captures artifacts, diagnoses the
|
|
9
|
-
exact stop point with `/diagnose-why-work-stopped`, requests board
|
|
10
|
-
confirmation before any product fix, then reruns against the same
|
|
11
|
-
worktree. Use whenever an issue asks to "run Terminal-Bench in a
|
|
12
|
-
loop", "drive Terminal-Bench until it passes", "loop fix-git through
|
|
13
|
-
Paperclip", or otherwise points at a Terminal-Bench task and asks for
|
|
14
|
-
bounded iteration with diagnosis.
|
|
15
|
-
---
|
|
16
|
-
|
|
17
|
-
# Terminal-Bench Loop
|
|
18
|
-
|
|
19
|
-
A repeatable operating skill for driving one Terminal-Bench problem to a passing smoke through Paperclip, with explicit issue topology, bounded runs, board-gated product fixes, and worktree continuity.
|
|
20
|
-
|
|
21
|
-
This skill is **operational + diagnostic**, not engineering. It coordinates issues, artifacts, and approvals around a Terminal-Bench loop. It does not authorize code changes — every accepted product fix lands as a separate implementation child issue after a board confirmation.
|
|
22
|
-
|
|
23
|
-
Canonical execution model: read `doc/execution-semantics.md` before starting a loop or moving any loop issue. Every loop issue must rest in a state the doc allows: terminal (`done`/`cancelled`), explicitly live (active run / queued wake), explicitly waiting (`in_review` with participant/interaction/approval), or explicit recovery/blocker (`blocked` with `blockedByIssueIds` and a named owner).
|
|
24
|
-
|
|
25
|
-
## When to use
|
|
26
|
-
|
|
27
|
-
Trigger on an assignment whose title or body matches any of:
|
|
28
|
-
|
|
29
|
-
- "run Terminal-Bench in a loop", "loop \<task-name\> through Paperclip"
|
|
30
|
-
- "drive Terminal-Bench fix-git", "iterate on Terminal-Bench until it passes"
|
|
31
|
-
- "Terminal-Bench smoke loop", "bench loop", "smoke loop on \<task-name\>"
|
|
32
|
-
- An attached link to a Terminal-Bench loop parent issue, plus a request to do another iteration
|
|
33
|
-
|
|
34
|
-
Also use when the user hands you an existing top-level loop issue and asks for the next iteration, diagnosis, or rerun.
|
|
35
|
-
|
|
36
|
-
## When NOT to use
|
|
37
|
-
|
|
38
|
-
- The assignment is to build or change `paperclip-bench` itself (Harbor adapter, wrapper, telemetry). Use normal engineering flow on that repo.
|
|
39
|
-
- The assignment is to submit a benchmark result for ranking. This skill produces smoke/non-comparable runs by design — escalate full-suite or comparable runs to BenchmarkQualityManager.
|
|
40
|
-
- The assignment is a normal Paperclip product bug not surfaced by a Terminal-Bench loop. Use normal investigation.
|
|
41
|
-
- You have not been granted permission to install or assign company skills, and the asker actually wants library mutation. Hand that step to an authorized skill-library owner.
|
|
42
|
-
|
|
43
|
-
## Three invariants you must preserve
|
|
44
|
-
|
|
45
|
-
Every loop iteration and every proposed product fix must hold these three invariants together. They come from `/diagnose-why-work-stopped` and the user has restated them across the liveness work:
|
|
46
|
-
|
|
47
|
-
1. **Productive work continues.** Each loop issue must always have a clear next action owner — agent, board, user, or named blocker. No silent `in_review` with nothing waiting on it.
|
|
48
|
-
2. **Only real blockers stop work.** Stops happen when something genuinely cannot proceed (board confirmation, QA, missing credentials, exhausted budget). Pseudo-stops must be detected and routed.
|
|
49
|
-
3. **No infinite loops.** Iteration count, wall-clock budget, and a board gate before product fixes are applied keep the loop bounded.
|
|
50
|
-
|
|
51
|
-
If a proposed iteration violates any of the three, drop it or rework it. State explicitly in the loop issue how each invariant is held this iteration.
|
|
52
|
-
|
|
53
|
-
## Inputs
|
|
54
|
-
|
|
55
|
-
Collect these on the top-level loop issue before iteration 1. Any input that cannot be supplied is a blocker — name the unblock owner and stop.
|
|
56
|
-
|
|
57
|
-
- **Source issue.** The Paperclip issue that asked for the loop. The loop parent links back to it.
|
|
58
|
-
- **Terminal-Bench task name.** Single-task identifier (e.g. `terminal-bench/fix-git`). Multi-task suites are out of scope for this skill.
|
|
59
|
-
- **Iteration budget.** Maximum number of iterations before the loop must stop without further fixes (typical: 3–5). Also record a per-iteration wall-clock cap.
|
|
60
|
-
- **Paperclip App worktree issue.** The implementation-side issue under the Paperclip App project whose execution workspace owns the isolated worktree. First iteration creates it; later iterations reuse it via `inheritExecutionWorkspaceFromIssueId` or equivalent.
|
|
61
|
-
- **Benchmark command.** The exact `paperclip-bench` invocation, including the `PAPERCLIPAI_CMD` (or equivalent) binding pinned to the Paperclip App worktree under test. Record verbatim on the loop issue.
|
|
62
|
-
- **Dispatch runner config.** The exact Harbor/Paperclip runner dispatch config required for the smoke to actually start a Paperclip heartbeat. For the current Harbor wrapper, record the `PAPERCLIP_HARBOR_RUNNER_CONFIG` JSON (or equivalent config file) verbatim enough to preserve: `assignee`, `heartbeat_strategy`, `agent_adapter` / `agent_adapters`, `reuse_host_home` when local credentials are intentionally needed, and the stop budget. A bare Harbor command that creates `BEN-1` as unassigned `todo` with zero heartbeat-enabled agents is a harness/setup failure, not a valid product diagnosis.
|
|
63
|
-
- **Latest artifact root.** Filesystem or storage path under which `paperclip-bench` writes run artifacts (manifest, `results.jsonl`, Harbor raw job folders, redacted telemetry). Each iteration appends; nothing is overwritten.
|
|
64
|
-
- **Approval policy.** Who must accept a proposed product fix before implementation (default: board via `request_confirmation`; CTO if delegated; never the loop driver alone).
|
|
65
|
-
|
|
66
|
-
Record each input on the top-level loop issue (description or a dedicated `inputs` document). If any input changes mid-loop, note the change and the iteration it took effect.
|
|
67
|
-
|
|
68
|
-
## Issue topology
|
|
69
|
-
|
|
70
|
-
The loop must be representable as a tree, not as prose in comments:
|
|
71
|
-
|
|
72
|
-
- **Top-level loop issue.** Long-lived. Holds inputs, iteration counter, current state, links to every iteration child, and the product-rule history. Rests in `in_progress` while an iteration is running, `in_review` only when a typed waiter sits directly on the loop parent (execution-policy participant, `request_confirmation` / `ask_user_questions` / `suggest_tasks` interaction, approval, or named human owner), `blocked` with `blockedByIssueIds` while a child issue is the gating work (iteration child holding the fix-proposal `request_confirmation`, or implementation, QA, or CTO review children), `done` on pass, or `cancelled` on board-rejection / budget exhaustion.
|
|
73
|
-
- **Iteration child issues.** One per iteration. Each carries: a bounded run issue (smoke), a diagnosis issue (applies `/diagnose-why-work-stopped`), a fix-proposal document with a `request_confirmation` interaction, and — only after acceptance — implementation, QA, CTO review, and rerun children. Iteration children are blocked by their predecessors so the executor wakes them in order.
|
|
74
|
-
- **Paperclip App implementation issue.** The first iteration creates a fresh Paperclip App child whose project policy spawns an isolated worktree. Every later iteration's implementation/rerun child references that same execution workspace via `inheritExecutionWorkspaceFromIssueId` so the same worktree is amended and tested.
|
|
75
|
-
|
|
76
|
-
Wire dependencies with `blockedByIssueIds`, never with prose like "blocked by X". When a dependent child is `done`, the executor auto-wakes the next.
|
|
77
|
-
|
|
78
|
-
## Procedure
|
|
79
|
-
|
|
80
|
-
### 0. Read the current execution contract
|
|
81
|
-
|
|
82
|
-
Before opening or advancing a loop, read `doc/execution-semantics.md`. Use that document's terms intact when classifying loop-issue state: live path / waiting path / recovery path; post-run disposition; bounded continuation; productivity review; pause-hold; watchdog. Do not invent a new state.
|
|
83
|
-
|
|
84
|
-
### 1. Open or reuse the top-level loop issue
|
|
85
|
-
|
|
86
|
-
- If an existing loop issue is supplied, read it: inputs, iteration counter, last iteration's stop reason, current Paperclip App worktree pointer, latest benchmark command.
|
|
87
|
-
- If no loop issue exists, create one under the Paperclip App project (or the project the source issue points at). Title: `Terminal-Bench loop: <task-name>`. Description captures the inputs above, the iteration budget, and a link to the source issue.
|
|
88
|
-
- Verify the worktree pointer still resolves. If the recorded execution workspace was discarded (worktree pruned, project changed), the loop is blocked — name the unblock owner (CodexCoder or the Paperclip App owner) and stop.
|
|
89
|
-
|
|
90
|
-
### 2. Open the iteration child
|
|
91
|
-
|
|
92
|
-
- Increment the iteration counter on the loop issue.
|
|
93
|
-
- Create an iteration child titled `Iteration N: <task-name>`. Its description repeats the inputs and references the loop parent. Block it on the prior iteration's terminal child (if any) so the executor cannot start two iterations in parallel.
|
|
94
|
-
- If the iteration counter would exceed the budget, do not create the child. Move the loop issue to `cancelled` (budget exhausted) or `in_review` if the user must decide whether to extend the budget.
|
|
95
|
-
|
|
96
|
-
### 3. Run the bounded smoke
|
|
97
|
-
|
|
98
|
-
- The benchmark command must use the Paperclip App worktree under test. Set `PAPERCLIPAI_CMD` (or the equivalent command binding) to the CLI entrypoint inside that worktree. Never let the smoke run against the operator's current Paperclip checkout.
|
|
99
|
-
- The same command block must include the runner dispatch config that makes the benchmark issue actionable. For the current Harbor wrapper, export `PAPERCLIP_HARBOR_RUNNER_CONFIG` with the intended assignee, heartbeat strategy, agent adapter, credential/home mode, and stop budget. Do not treat a bare `uvx harbor run ...` as the canonical smoke if it omits the dispatch config; record that as a harness/setup miss and rerun with the recorded config.
|
|
100
|
-
- Bound the run by wall-clock and by Paperclip's run-budget controls. If the smoke would exceed the per-iteration cap, kill it and record the truncation reason.
|
|
101
|
-
- Capture, in the iteration child or a dedicated `run` document:
|
|
102
|
-
- Paperclip run id and heartbeat run ids
|
|
103
|
-
- benchmark run id, manifest, `results.jsonl` row, Harbor raw job folder
|
|
104
|
-
- dispatch config used (`PAPERCLIP_HARBOR_RUNNER_CONFIG` or equivalent), including assignee and adapter type
|
|
105
|
-
- the exact stop reason reported by the harness (pass, harness fail, verifier fail, timeout, agent gave up, infrastructure error)
|
|
106
|
-
- heartbeat-enabled and heartbeat-observed agent counts when Paperclip telemetry exports them
|
|
107
|
-
- failure taxonomy bucket (task/model, Paperclip product, harness/setup, verifier/infrastructure, security, unclear)
|
|
108
|
-
- artifact paths under the latest artifact root
|
|
109
|
-
- Label the iteration as **smoke / non-comparable**. Comparable runs are out of scope for this skill.
|
|
110
|
-
|
|
111
|
-
### 4. Diagnose the exact stop point
|
|
112
|
-
|
|
113
|
-
Apply the `/diagnose-why-work-stopped` pattern to the iteration's run, scoped to this loop only — do not pull in unrelated forensic boilerplate. Specifically:
|
|
114
|
-
|
|
115
|
-
- Walk the Paperclip issue tree the smoke produced under the Paperclip App worktree, node by node, and find the exact `(issue, status)` combination that stopped progress. Quote evidence: run ids, comment timestamps, status transitions.
|
|
116
|
-
- Classify every non-progressing issue in that subtree as **truly needs human/board intervention**, **agent-actionable but not currently routed**, or **already covered**.
|
|
117
|
-
- State whether the failure is task/model, Paperclip product, harness/setup, verifier/infrastructure, security, or unclear. Be explicit when evidence is inferred (e.g. cross-company API boundary blocks direct reads).
|
|
118
|
-
- If the failure is a Paperclip product gap, frame the fix as a **general product rule** stated as a contract, and check it against the three invariants above. If the rule would have blocked a recent productive run, narrow it.
|
|
119
|
-
|
|
120
|
-
Record the diagnosis on the iteration child as a `diagnosis` document. Do not propose code yet.
|
|
121
|
-
|
|
122
|
-
### 5. Decide the next move
|
|
123
|
-
|
|
124
|
-
Based on the diagnosis, the iteration ends in exactly one of these terminal-for-iteration states:
|
|
125
|
-
|
|
126
|
-
- **Pass.** Smoke verifier reports pass. Move the iteration child and the loop parent toward QA/CTO review (Step 8).
|
|
127
|
-
- **Product fix proposed.** A Paperclip product gap was identified. Write the fix proposal as a `plan` document on the iteration child, then go to Step 6.
|
|
128
|
-
- **Non-product failure with retry.** Failure is harness/setup/infrastructure or model flakiness, the iteration budget is not exhausted, and the loop driver believes a rerun without code changes has signal (e.g. transient infra). Record the rationale on the iteration child and go to Step 7 with no implementation step.
|
|
129
|
-
- **Real blocker.** Named external blocker (credentials, quota, third-party outage, security review). Move the loop issue to `blocked`, set `blockedByIssueIds` to the blocker issue (creating one if needed), and name the unblock owner. Stop.
|
|
130
|
-
- **Budget or board stop.** Iteration budget reached, or the board has rejected the next fix proposal. Move the loop issue to `cancelled` with a comment that summarizes the run history and the reason for stopping.
|
|
131
|
-
|
|
132
|
-
### 6. Request board confirmation before any product fix
|
|
133
|
-
|
|
134
|
-
When the iteration ends in **product fix proposed**:
|
|
135
|
-
|
|
136
|
-
- Update the iteration child's `plan` document with the proposed contract, the three-invariant check, the affected Paperclip surfaces, and the phased subtasks (implementation, QA, CTO review, rerun) — but do not create those subtasks.
|
|
137
|
-
- Open the `request_confirmation` interaction on the **iteration child** (the same issue that owns the `plan` document), targeting the latest plan revision. Idempotency key: `confirmation:{iterationIssueId}:plan:{revisionId}`. Set `continuationPolicy` to `wake_assignee`.
|
|
138
|
-
- Move the **iteration child** to `in_review`. The typed waiter — the `request_confirmation` interaction — sits directly on it, so its `in_review` is healthy. Comment links the plan document and names the pending confirmation.
|
|
139
|
-
- Move the **loop parent** to `blocked` with `blockedByIssueIds: [iterationChildId]` and a comment naming the board (or whichever approver the approval policy designates) as the unblock owner. Do not move the loop parent to `in_review` here: the typed waiter lives on the iteration child, not on the parent, so the parent's wait path is the child blocker. This matches the topology rule that the loop parent only sits in `in_review` when a typed waiter is attached directly to the parent.
|
|
140
|
-
- Wait for acceptance. If the board posts a superseding comment that changes the plan, revise the document, then open a fresh confirmation tied to the new revision on the iteration child — the prior one is invalidated. The loop parent's `blockedByIssueIds` already points at the iteration child, so it does not need to change.
|
|
141
|
-
- On rejection, end the loop per the **Budget or board stop** rule; do not silently retry the same proposal.
|
|
142
|
-
- On acceptance, create the implementation, QA, CTO review, and rerun child issues with `blockedByIssueIds` wired in order, and update the loop parent's `blockedByIssueIds` to point at the new gating child (typically the implementation child) so the parent stays `blocked` against real downstream work. The implementation child must inherit the Paperclip App execution workspace (`inheritExecutionWorkspaceFromIssueId` to the worktree-owning issue) so the fix lands in the same isolated worktree the smoke ran against.
|
|
143
|
-
|
|
144
|
-
### 7. Rerun against the same worktree
|
|
145
|
-
|
|
146
|
-
After implementation and QA complete (or immediately, in the **non-product failure with retry** case), the rerun child runs the same `paperclip-bench` invocation with `PAPERCLIPAI_CMD` still pinned to the Paperclip App worktree under test.
|
|
147
|
-
|
|
148
|
-
- The rerun must use the same worktree the fix landed in. If the workspace was reset between iterations, the loop is invalid — open a blocker on the loop issue and stop.
|
|
149
|
-
- On completion, the rerun child becomes the next iteration's run record. If the smoke now passes, jump to Step 8. Otherwise return to Step 4 with a new iteration child (subject to the iteration budget).
|
|
150
|
-
|
|
151
|
-
### 8. Pass: QA, CTO review, close
|
|
152
|
-
|
|
153
|
-
When the smoke passes:
|
|
154
|
-
|
|
155
|
-
- Create QA and CTO review children if they are not already in the dependency chain (CTO review blocked by QA, so the chain wakes in order). Move the loop parent to `blocked` with `blockedByIssueIds` set to the QA / CTO review chain, and post a comment that names QA and CTO as the unblock owners and links the children. The loop parent stays `blocked` — not `in_review` — because the typed waiter lives on the children, not on the parent.
|
|
156
|
-
- If you instead want the loop parent itself to sit in `in_review` during this phase (for example because a board user has explicitly volunteered to drive the review), put a typed waiter directly on the parent — execution-policy participant, `request_confirmation` / `ask_user_questions` / `suggest_tasks` interaction, approval, or named human owner — and do not rely on the child chain alone. Do not combine `in_review` on the parent with QA/CTO children acting as the blocker; that is the ambiguous review shape this skill exists to prevent.
|
|
157
|
-
- QA validates artifacts (manifest, `results.jsonl`, Harbor raw job, redacted telemetry) and the rerun reproducibility against the same worktree.
|
|
158
|
-
- CTO reviews the technical scope of any product fixes that landed during the loop.
|
|
159
|
-
- On QA + CTO acceptance, close the loop issue with a board-level summary comment: task name, iteration count, stop reason (pass), worktree pointer, link to the final artifact root, and the list of accepted product fixes (each with its implementation issue id).
|
|
160
|
-
|
|
161
|
-
### 9. Stop rules
|
|
162
|
-
|
|
163
|
-
The loop **must** stop, with state explicitly recorded on the loop issue, when any of these is true:
|
|
164
|
-
|
|
165
|
-
- **Pass.** Smoke verifier reports pass and QA + CTO accept (Step 8). Loop issue → `done`.
|
|
166
|
-
- **Board rejection.** Board rejects a fix proposal and does not request a revision. Loop issue → `cancelled`. Comment names the rejected proposal and the reason.
|
|
167
|
-
- **Iteration budget reached.** Iteration counter reaches the budget without a pass. Loop issue → `cancelled` (or `in_review` if the user must decide whether to extend the budget). Never silently start iteration N+1.
|
|
168
|
-
- **Real blocker named.** External blocker (credentials, quota, infra, security, missing skill) cannot be resolved by the loop driver. Loop issue → `blocked` with `blockedByIssueIds` to the blocker issue and the unblock owner named.
|
|
169
|
-
|
|
170
|
-
A loop must never end on a prose comment alone. Every stop is a status transition with a named next-action owner.
|
|
171
|
-
|
|
172
|
-
## Worktree rule
|
|
173
|
-
|
|
174
|
-
The loop must not test whatever Paperclip checkout happens to be current for the heartbeat. It must test the same isolated Paperclip App worktree where proposed fixes are applied.
|
|
175
|
-
|
|
176
|
-
- The first iteration creates the Paperclip App implementation child; that project's git-worktree policy spawns a fresh worktree.
|
|
177
|
-
- The loop issue records the worktree-owning issue id and the workspace path (or workspace id).
|
|
178
|
-
- Every later implementation, QA, and rerun child sets `inheritExecutionWorkspaceFromIssueId` to that worktree-owning issue, so all subsequent loop work shares one workspace.
|
|
179
|
-
- The benchmark command always sets `PAPERCLIPAI_CMD` (or the equivalent command binding) to the CLI entrypoint inside that worktree, and it carries the recorded dispatch runner config (`PAPERCLIP_HARBOR_RUNNER_CONFIG` or equivalent) needed to assign the benchmark issue and start the heartbeat. The benchmark command stored on the loop issue is the source of truth — if a heartbeat needs to run the smoke from a different shell, it copies the recorded command block verbatim, not only the Harbor invocation line.
|
|
180
|
-
- If the workspace is pruned or the worktree path no longer resolves, the loop is invalid until rebuilt. Mark the loop `blocked` and name the unblock owner (typically CodexCoder or the Paperclip App owner).
|
|
181
|
-
|
|
182
|
-
## Liveness rule
|
|
183
|
-
|
|
184
|
-
Every loop issue, at the end of every heartbeat, must rest in one of:
|
|
185
|
-
|
|
186
|
-
- **Terminal:** `done` or `cancelled`. No further action.
|
|
187
|
-
- **Explicitly live:** `in_progress` with an active run, an upcoming queued wake, or a child issue actively executing under it.
|
|
188
|
-
- **Explicitly waiting:** `in_review` with a typed waiter — execution-policy participant, `request_confirmation` / `ask_user_questions` / `suggest_tasks` interaction, approval, or a named human owner.
|
|
189
|
-
- **Explicit recovery / blocker:** `blocked` with `blockedByIssueIds` set to a real blocking issue, plus a comment naming the unblock owner and the action needed.
|
|
190
|
-
|
|
191
|
-
If a loop issue does not fit one of these on exit, the heartbeat is not done. Fix the state before exiting.
|
|
192
|
-
|
|
193
|
-
## Pitfalls
|
|
194
|
-
|
|
195
|
-
- **Running the smoke against the operator's Paperclip checkout.** The whole point of the worktree rule is that the bench tests the worktree the fix lands in. Always set `PAPERCLIPAI_CMD` and verify the path before launching the run.
|
|
196
|
-
- **Dropping the dispatch config.** A Harbor run that omits `PAPERCLIP_HARBOR_RUNNER_CONFIG` (or equivalent) may boot Paperclip and create `BEN-1`, but leave it unassigned with zero heartbeat-enabled agents. That is not a Terminal-Bench product signal. Preserve and rerun the full command block, including assignee and adapter config.
|
|
197
|
-
- **Coding before approval.** No implementation child exists until a board confirmation accepts the iteration's `plan` document. Do not push code in the diagnostic phase.
|
|
198
|
-
- **Skipping the recent-work survey.** When proposing a Paperclip product rule, check what already shipped in the affected liveness/execution area in the last few days. A rule that contradicts last-week's accepted contract is rework.
|
|
199
|
-
- **Letting `in_review` mean done.** A loop or iteration child sitting in `in_review` with no participant, no interaction, no approval, and no human owner is a stop, not progress. Treat it as a liveness violation and route it.
|
|
200
|
-
- **Silent iteration N+1.** If the iteration budget is reached, never start another iteration without an explicit budget extension recorded on the loop issue.
|
|
201
|
-
- **Comparable-run drift.** This skill produces smoke runs only. If the asker wants a comparable benchmark submission, hand off to BenchmarkQualityManager and BenchmarkForensics — do not relabel a smoke as comparable.
|
|
202
|
-
- **Recursive recovery.** Stranded-work recovery that recovers its own recovery issues is the canonical infinite loop. If a diagnosis surfaces it inside the smoke's subtree, refuse to deepen and route to `/diagnose-why-work-stopped` for a product-rule fix.
|
|
203
|
-
- **Skill-library mutation.** This skill never installs, edits, or assigns company skills as part of a loop iteration. Library changes go to an authorized skill-library owner via a separate issue.
|
|
204
|
-
- **Hiding the chain.** Do not silently delete or hide failed iteration children, retracted proposals, or rejected confirmations. The audit trail is the loop's evidence.
|
|
205
|
-
|
|
206
|
-
## Verification checklist (before exiting a heartbeat that touched the loop)
|
|
207
|
-
|
|
208
|
-
- [ ] All inputs are recorded on the top-level loop issue, including the exact benchmark command, `PAPERCLIPAI_CMD` binding, and dispatch runner config.
|
|
209
|
-
- [ ] Iteration counter is up to date and within budget.
|
|
210
|
-
- [ ] The Paperclip App worktree pointer still resolves, and the iteration's run/implementation/rerun children share that workspace.
|
|
211
|
-
- [ ] The smoke run is captured with run ids, manifest, `results.jsonl`, Harbor raw job folder, and stop reason.
|
|
212
|
-
- [ ] Paperclip telemetry shows the benchmark issue was assigned and a heartbeat was enabled/observed, or the iteration is explicitly classified as harness/setup no-dispatch.
|
|
213
|
-
- [ ] Diagnosis applies the `/diagnose-why-work-stopped` pattern, classifies every non-progressing issue, and checks the three invariants.
|
|
214
|
-
- [ ] No implementation child exists for an unapproved fix proposal; if one was proposed, a `request_confirmation` is open against the latest plan revision.
|
|
215
|
-
- [ ] Every loop and iteration issue rests in a terminal, explicitly-live, explicitly-waiting, or named-blocker state.
|
|
216
|
-
- [ ] The stop reason — if the loop stopped this heartbeat — is one of pass, board rejection, budget exhausted, or named real blocker.
|
|
217
|
-
- [ ] No company-skill library mutation happened in this heartbeat.
|
|
218
|
-
|
|
219
|
-
## Deterministic smoke
|
|
220
|
-
|
|
221
|
-
Run this smoke after installing or changing the skill, before treating it as operational for a live Terminal-Bench loop:
|
|
222
|
-
|
|
223
|
-
```sh
|
|
224
|
-
pnpm smoke:terminal-bench-loop-skill
|
|
225
|
-
```
|
|
226
|
-
|
|
227
|
-
The command uses the current Paperclip API token and company from `PAPERCLIP_API_URL`, `PAPERCLIP_API_KEY`, and `PAPERCLIP_COMPANY_ID`. When `PAPERCLIP_TASK_ID` is set, it attaches the smoke issues under that source issue and inherits its project/goal context. By default it cancels the short-lived smoke issues after verification; pass `-- --keep` to leave the verified `blocked` loop parent, `in_review` iteration child, and pending confirmation available for manual inspection.
|
|
228
|
-
|
|
229
|
-
The smoke is deterministic and intentionally non-comparable. It does not start Terminal-Bench, Harbor, an agent model, or a provider runtime. It verifies only the control-plane shape:
|
|
230
|
-
|
|
231
|
-
- local `skills/terminal-bench-loop/SKILL.md` contains the loop contract terms;
|
|
232
|
-
- a top-level loop issue can be created and updated into a blocker posture;
|
|
233
|
-
- an iteration child issue can be created under the loop parent;
|
|
234
|
-
- mocked benchmark artifact paths are recorded on a `run` document;
|
|
235
|
-
- a `diagnosis` document names the exact stop point and next-action owner;
|
|
236
|
-
- a `request_confirmation` interaction is created and the iteration child rests in `in_review` with a typed waiting path rather than silent review.
|