osborn 0.9.49 → 0.9.51

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,247 @@
1
+ # Ground Assumptions
2
+
3
+ ## SKILL IDENTITY
4
+ Name: ground-assumptions
5
+ Install path: ~/.claude/skills/ground-assumptions/SKILL.md
6
+ Portable: yes — drops into any agent's skills dir (Claude Code, osborn on Fly, other Claude Agent SDK hosts)
7
+
8
+ ## WHEN THIS SKILL ACTIVATES
9
+ This skill applies whenever the conversation enters a **planning / design / architecture phase**.
10
+ Specifically, if the user says or asks any of:
11
+
12
+ - "let's plan / design / architect..."
13
+ - "how should we..."
14
+ - "what's the best way to..."
15
+ - "I'm thinking we..."
16
+ - "what do you recommend..."
17
+ - "approach", "architecture", "design", "should we"
18
+ - Any time I am about to recommend an implementation strategy, performance characteristic,
19
+ behavioral guarantee, or comparative judgment ("X is faster than Y", "this propagates", "this scales")
20
+
21
+ Also activates explicitly with:
22
+ - "ground assumptions"
23
+ - "verify before planning"
24
+ - "check that hypothesis"
25
+
26
+ ## CORE PRINCIPLE
27
+ When you're **planning new work — a new feature, a new integration, or fitting
28
+ a change into the existing architecture** — every load-bearing assumption is a
29
+ **hypothesis until verified against real evidence.** Training-data intuition and
30
+ "it should work" do not count.
31
+
32
+ The canonical situation this skill is for: *we're about to implement something
33
+ new (e.g. add OpenAI Codex as an agent option alongside Claude Code), and we
34
+ need to know it actually fits our architecture one-to-one — does Codex's session
35
+ model, SDK, and data storage map to what we already do — BEFORE we build on that
36
+ assumption.* That's the shape: a new piece must slot into the existing system,
37
+ and we confirm the fit with evidence, not hope.
38
+
39
+ **What counts as verification** (in priority order):
40
+ 1. **Existing tests / previous test runs** — has this already been proven? check first.
41
+ 2. **Authoritative documentation / source code** — does the doc or the actual code confirm the behavior?
42
+ 3. **A newly-created, targeted test** — if nothing above answers it, build a
43
+ specific test (or several) that exercises exactly the assumption, run it on
44
+ real infrastructure, and read the result.
45
+
46
+ **The main agent does NOT do the verification work itself.** Spawn subagents
47
+ (Agent tool) to run the tests / read the docs / check prior results, so the main
48
+ agent stays free to keep the planning conversation moving and react to results
49
+ as they land. **Delegation rule (hard):** when verification is needed, ALWAYS
50
+ delegate to a subagent — don't call Bash/WebSearch/Grep inline yourself. Spawn
51
+ multiple subagents in parallel when there are several independent assumptions to
52
+ check. (Exception: the user explicitly asks you to run something inline — then
53
+ say you're breaking the delegation pattern.)
54
+
55
+ Why this matters — past sessions shipped plans on unverified assumptions and
56
+ paid for it: a "small" change broke an unrelated subsystem; a behavior we
57
+ *assumed* ("this propagates", "this auto-updates") silently failed; an
58
+ integration we *assumed* was 1:1 wasn't. The expensive surprises live in
59
+ **architectural fit and integration**, not in micro-benchmarks. (Timing claims
60
+ matter too — don't say "fast"/"Xs" without measuring — but they are the *least*
61
+ of it. Lead with "does this fit / does this actually work", not "how fast".)
62
+
63
+ The discipline: **verify against evidence — existing tests, docs, or a new
64
+ delegated test — before the assumption is surfaced as fact or built upon.**
65
+
66
+ ## ASSUMPTION PRIORITY ORDER (verify highest first)
67
+
68
+ When verification budget is limited, ALWAYS verify in this order:
69
+
70
+ 1. **ARCHITECTURAL IMPACT** — does this change break or alter existing flows / subsystems?
71
+ - "Does the new entrypoint affect the OAuth flow?"
72
+ - "If we move osborn off the volume, does session resume still work?"
73
+ - "Does the bind-mount conflict with Fly's shutdown umount?"
74
+ 2. **INTEGRATION** — does this work with all the connected pieces (auth, network, sessions, data, persistence, MCP, recording, etc.)?
75
+ - "Does Claude Code's `setup-token` pty work inside chroot?"
76
+ - "Does the frontend's `/api/sandbox` fetch-log still read the right path?"
77
+ 3. **BEHAVIORAL** — does the system actually do what we claim it does?
78
+ - "Does image-swap actually replace the running osborn binary?"
79
+ 4. **TIMING** — is the speed claim true under real conditions?
80
+ - "Is the seed tarball really 5s to extract?"
81
+ 5. **COSMETIC** — minor polish items that don't gate the architecture.
82
+
83
+ Timing claims are the LOWEST priority. We've burned multiple cycles on timing measurements
84
+ while missing that the architecture itself had subtle bugs that broke other parts of the system.
85
+ Architectural and integration assumptions are where the expensive surprises live.
86
+
87
+ ## THE WORKFLOW (followed strictly during planning)
88
+
89
+ ### 1. PLAN DRAFT
90
+ State the proposed plan as usual — fully, with intent and reasoning.
91
+
92
+ ### 2. ASSUMPTION EXTRACTION
93
+ Before presenting the plan as a recommendation, **list every load-bearing assumption**.
94
+ A load-bearing assumption is anything where, if it's wrong, the plan stops working.
95
+
96
+ For each assumption, **also identify its second/third-order implications** —
97
+ what else in the system depends on it being true?
98
+
99
+ Format:
100
+ ```
101
+ ASSUMPTIONS (must be verified before plan ships):
102
+ 1. <claim that the plan depends on>
103
+ → implications: <what else breaks if 1 is false>
104
+ 2. <claim that the plan depends on>
105
+ → implications: <what else breaks if 2 is false>
106
+ 3. ...
107
+ ```
108
+
109
+ If an assumption can't be stated cleanly in one sentence, it isn't ready to be tested.
110
+ Break it down further.
111
+
112
+ **Ripple-effect check** (do this once for every plan that touches existing architecture):
113
+
114
+ Ask explicitly:
115
+ - What existing flows touch the system we're changing?
116
+ (auth, network, sessions, MCP, recording, persistence, log-fetch, dashboard, voice loop)
117
+ - For each connected flow, can the change break it in a non-obvious way?
118
+ - Is there a code path that USED to work without our knowledge that depends on the old behavior?
119
+
120
+ If yes to any of those, add the affected flow as a new assumption that needs verification.
121
+ This is where the expensive surprises hide.
122
+
123
+ ### 3. ASYNC VERIFIER SPAWN (parallel, non-blocking)
124
+ For EACH assumption, spawn an Agent subagent **immediately**, in a SINGLE message
125
+ with multiple Agent tool calls so they run concurrently.
126
+
127
+ Choose verifier type by the nature of the assumption. **Architectural and integration verifiers come first** — they catch the expensive surprises.
128
+
129
+ | Assumption type | Verifier type | What it does |
130
+ |---|---|---|
131
+ | **Architectural impact** (`does X break flow Y?`) | **Ripple agent** | Traces all callers/consumers of the changed component, checks each for breakage |
132
+ | **Integration** (`does X work with subsystem Y?`) | **Integration agent** | Spawns end-to-end test exercising the connection between subsystems |
133
+ | Behavioral (`does X`, `propagates`, `survives Y`) | Test agent | Triggers the behavior, observes outcome |
134
+ | Documented (`API supports X`, `library does Y`) | Research agent | Fetches docs/code/sources, returns citation with quote |
135
+ | Derivable (`X+Y → Z`) | Reasoning agent | Derives from established facts, returns chain |
136
+ | Timing (`Xs`, `fast`, `slow`) | Test agent | Runs the actual operation on real infra, measures under stated conditions |
137
+ | Empirical (`users typically do X`) | Research agent | Cites surveys/data/observations |
138
+
139
+ Each verifier returns one of:
140
+ - **MEASURED**: empirical observation with conditions documented
141
+ - **SOURCED**: cited from authoritative source with quote
142
+ - **DERIVED**: chain from established facts
143
+ - **CONTRADICTED**: evidence that the assumption is false
144
+ - **UNVERIFIABLE**: cannot be determined in available time/resources
145
+
146
+ ### 4. CONTINUE PLANNING (don't block on verifiers)
147
+ Keep talking with the user through design tradeoffs, edge cases, etc.
148
+ **Main agent is NOT in the test loop.** Verifiers run in the background.
149
+ DO NOT commit to a recommendation until verifiers report.
150
+
151
+ **Main agent's role while verifiers run:**
152
+ - Stay in conversation with the user
153
+ - Sketch more of the plan / explore tradeoffs / answer questions
154
+ - Track which verifiers are still in flight, which returned, which contradicted
155
+ - React to verifier results as they arrive — don't poll, don't wait silently
156
+
157
+ **Things the main agent should NOT do while verifiers are in flight:**
158
+ - Run a Bash command that performs the same test (defeats delegation)
159
+ - Read files the verifier is already reading (duplicative)
160
+ - "Just check one quick thing myself" — that's how delegation collapses
161
+ - Block the conversation until results come back
162
+
163
+ ### 5. INTEGRATE RESULTS
164
+ When a verifier returns:
165
+ - **MEASURED / SOURCED / DERIVED** → mark assumption ✓, keep going
166
+ - **CONTRADICTED** → STOP, mark assumption ✗, announce: "Assumption N contradicted by <evidence>. Replanning." → restart at step 1 with revised approach
167
+ - **UNVERIFIABLE** → mark ⚠️, ask user: "Cannot verify <assumption>. Proceed with explicit risk, or pivot to a verifiable approach?"
168
+
169
+ ### 6. COMMITTED PLAN
170
+ Only present a plan as the recommended approach when every assumption is
171
+ ✓ MEASURED, ✓ SOURCED, ✓ DERIVED, or explicitly accepted as ⚠️ UNVERIFIABLE.
172
+
173
+ ## OUTPUT FORMAT
174
+
175
+ While verifiers are in flight:
176
+ ```
177
+ PLAN: <draft summary>
178
+
179
+ ASSUMPTIONS (verifiers running in parallel):
180
+ ☐ A1: <assumption>
181
+ ☐ A2: <assumption>
182
+ ☐ A3: <assumption>
183
+ ```
184
+
185
+ As verifiers return:
186
+ ```
187
+ ✓ A1 MEASURED: <result> (conditions: <where/when/setup>)
188
+ ✓ A2 SOURCED: <URL> — "<quote>"
189
+ ✗ A3 CONTRADICTED: <evidence>
190
+ → STOPPING. Replanning around A3.
191
+ ```
192
+
193
+ Final state:
194
+ ```
195
+ VERIFIED PLAN:
196
+ <plan with every assumption marked ✓ or explicitly ⚠️>
197
+ ```
198
+
199
+ ## HARD RULES (no exceptions)
200
+
201
+ 1. **No naked "it fits / it works" claims.** Never assert that a new piece integrates with the existing architecture — "Codex maps 1:1 to our session model", "this slots into the existing flow", "the SDK stores data the same way" — without backing from an existing test, the actual docs/source, or a new delegated test.
202
+ 2. **No naked behavioral claims.** Never write "auto-updates", "propagates", "survives", "rolls back", "X just works" without MEASURED or SOURCED backing.
203
+ 3. **Check for existing evidence FIRST.** Before commissioning a new test, have a subagent check whether a previous test run, doc, or the source already answers it. Don't re-test what's already proven.
204
+ 4. **CONTRADICTED stops everything.** When a verifier contradicts an assumption, NO new content is written about the plan until the plan is revised and the verifier rerun.
205
+ 5. **UNVERIFIABLE is loud.** Mark it ⚠️ in the output AND ask the user for explicit acceptance. Don't hide unverified parts in prose.
206
+ 6. **Training-data intuition is forbidden as evidence.** "X typically works this way" is not a citation. Verify against a test, doc, or source — or skip.
207
+ 7. **Timing/comparative claims are the least of it, but still bound:** don't write "fast"/"slow"/"X seconds"/"faster than Y" without a measurement + conditions. Just don't let speed-benchmarking crowd out the architectural-fit and integration checks, which are where the expensive surprises actually live.
208
+
209
+ ## SUBAGENT SPAWNING PATTERNS
210
+
211
+ For timing/behavioral tests on real infra:
212
+ > Spawn Agent subagent with prompt: "Run <specific command> on <specific target>. Measure <specific metric>. Report back the measurement and the conditions (machine type, memory, network state, cold/warm cache). Do not attempt the broader task — only verify this one assumption."
213
+
214
+ For documentation lookups:
215
+ > Spawn Agent subagent with prompt: "Find authoritative source for <specific claim>. Return URL + verbatim quote. If multiple sources, prefer official docs > vendor blogs > Stack Overflow. If no source exists, return UNVERIFIABLE with reasoning."
216
+
217
+ For derivation:
218
+ > Spawn Agent subagent with prompt: "Given these established facts: <list>, can we derive <claim>? Return either the derivation chain OR 'cannot derive — gap at: <step>'."
219
+
220
+ For ripple-effect / architectural impact (HIGHEST priority):
221
+ > Spawn Agent subagent with prompt: "Trace all consumers / callers / dependencies of `<component being changed>` in the codebase. For each consumer, check whether the proposed change would break it. Report each potential break with file:line and the specific failure mode. Do not propose fixes — only enumerate breaks."
222
+
223
+ For integration testing (HIGH priority):
224
+ > Spawn Agent subagent with prompt: "End-to-end test: after the proposed change, exercise the connection between `<subsystem A>` and `<subsystem B>` on real infra. Specifically, verify `<concrete cross-system flow>`. Report MEASURED behavior and any divergence from the expected flow."
225
+
226
+ **Pattern**: invoke all Agent tools in a single response message so they run concurrently rather than sequentially. The Agent tool's `subagent_type` should be `general-purpose` or `Explore` (read-only) depending on what the verifier needs.
227
+
228
+ ## PORTABILITY NOTES
229
+
230
+ This skill works in any Claude Agent SDK environment because:
231
+ - It only requires the Agent tool (standard SDK feature)
232
+ - The trigger logic is prose, not code
233
+ - No host-specific paths, IDs, or APIs
234
+
235
+ To deploy on another agent (e.g. osborn on a Fly machine), copy this SKILL.md to that agent's skills dir:
236
+ - Claude Code: `~/.claude/skills/ground-assumptions/SKILL.md`
237
+ - osborn on Fly: `/workspace/root-chroot/root/.claude/skills/ground-assumptions/SKILL.md`
238
+ - Other Claude Agent SDK hosts: their configured skills path
239
+
240
+ ## EXIT CRITERIA
241
+
242
+ The skill releases its grip on a conversation when:
243
+ - All assumptions are verified and a committed plan exists, OR
244
+ - The user explicitly asks to skip verification ("just give me your best guess"), OR
245
+ - The conversation shifts away from planning to execution of an already-verified plan
246
+
247
+ In the second case, mark the response with "WARNING: Skipping verification at user request. The following is unverified intuition." so the lack of grounding is visible.
package/dist/index.js CHANGED
@@ -1092,6 +1092,21 @@ async function main() {
1092
1092
  return;
1093
1093
  try {
1094
1094
  const llm = currentLLM;
1095
+ // Heap-OOM fix (2026-06-02): stop the PipelineDirectLLM summary-index
1096
+ // watcher BEFORE we abort + drop the reference. The watcher is a 10s
1097
+ // setInterval whose closure retains the entire PipelineDirectLLM →
1098
+ // ClaudeLLM object graph. killCurrentLLM is the single chokepoint all
1099
+ // three cleanup sites (Disconnected, previous-session-cleanup,
1100
+ // ParticipantDisconnected) call, but it previously only aborted the
1101
+ // SDK subprocess — leaving the interval (and the whole graph) alive and
1102
+ // uncollectable on every disconnect/reconnect. A reconnect-heavy session
1103
+ // (e.g. 15 reconnects from a frontend redeploy) leaked 15 timers + 15
1104
+ // retained graphs, each re-reading JSONL every 10s, until the node heap
1105
+ // OOM'd (~980MB) and the process crashed. Stopping the watcher here lets
1106
+ // the abandoned graph be GC'd. Duck-typed: only PipelineDirectLLM has it.
1107
+ if (typeof llm.stopIndexWatcher === 'function') {
1108
+ llm.stopIndexWatcher();
1109
+ }
1095
1110
  if (typeof llm.abortQuery === 'function') {
1096
1111
  llm.abortQuery();
1097
1112
  }
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "osborn",
3
- "version": "0.9.49",
3
+ "version": "0.9.51",
4
4
  "description": "Voice AI coding assistant - local agent that connects to Osborn frontend",
5
5
  "type": "module",
6
6
  "bin": {