instar 1.2.72 → 1.2.74

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,62 @@
1
+ # Upgrade Guide — one ranked working set across the memory stores
2
+
3
+ <!-- bump: minor -->
4
+ <!-- minor = new features, new APIs, new capabilities (backwards-compatible) -->
5
+
6
+ ## What Changed
7
+
8
+ **The memory stores now feed one ranked reading list instead of three separate ones.**
9
+
10
+ Topic-intent (conversation facts + task frame), Playbook (tagged context items),
11
+ and semantic/episodic memory each had their own read path and ranking. This unifies
12
+ the **reading**: the existing working-memory assembler — already a token-budgeted
13
+ multi-source context builder — now also draws from topic-intent and Playbook, and
14
+ ranks everything together by relevance blended with a recency-decay factor (so
15
+ fresher, more-relevant context floats up; stale context sinks but isn't deleted and
16
+ re-warms on reference).
17
+
18
+ The important design choice (ratified): this unifies the **reading, not the
19
+ storage**. The three stores keep their own backends and write paths; the assembler
20
+ is the single unified read. That makes it additive and reversible — and it's
21
+ guarded by a regression pin: **when the new sources are empty, the assembled
22
+ context is byte-for-byte what it was before.** Playbook is read straight from its
23
+ manifest files (no Python invoked in the hot assembly path), and every new source
24
+ is degrade-safe — a missing or erroring source contributes nothing and never
25
+ breaks assembly.
26
+
27
+ The result shows up as a "Working Set" section in the assembled context
28
+ (`GET /session/context/:topicId`), drawing from all sources under one budget, with
29
+ per-source contribution visible in the response's `sources`.
30
+
31
+ **Evidence**: 11 new tests (8 unit — blended ranking, both source adapters,
32
+ degrade-safety, and the regression pin; 3 boot-path route tests confirming the
33
+ unified read path is alive and surfaces topic-intent in a Working Set section, and
34
+ that a ref-less topic is unchanged). The existing 49 assembler + working-memory
35
+ tests stay green (the regression pin, confirmed). `tsc` + lint clean.
36
+
37
+ Spec: `docs/specs/cwa-unify-stores.md` (approved; Claude-authored + manual review —
38
+ this is the most architectural rung, fuller multi-model review advisable, caveat
39
+ ratified). ELI16: `docs/specs/cwa-unify-stores.eli16.md`. Side-effects:
40
+ `upgrades/side-effects/cwa-unify-stores.md`.
41
+
42
+ ## What to Tell Your User
43
+
44
+ - **One sorted view of what matters**: "I used to keep 'what's relevant right now'
45
+ in three separate notebooks that didn't talk to each other. Now they feed one
46
+ ranked reading list, so when we pick something up I get a single sorted view —
47
+ freshest and most-relevant first — instead of three half-answers."
48
+
49
+ ## Summary of New Capabilities
50
+
51
+ | Capability | How to Use |
52
+ |-----------|-----------|
53
+ | Unified working set in assembled context | Automatic — `GET /session/context/:topicId` now includes a "Working Set" section drawing from topic-intent + Playbook + memory |
54
+ | Blended relevance×recency ranking across sources | Automatic (the assembler ranks all sources together) |
55
+
56
+ ## Evidence
57
+
58
+ Not a bug fix — a new capability over the existing assembler. Verified by 11 new
59
+ tests including 3 that boot the real AgentServer and confirm the assembled-context
60
+ route surfaces a topic-intent ref in a "Working Set" section, plus a regression pin
61
+ (unit + the unchanged existing 49-test suites) proving the assembled output is
62
+ identical when the new sources are empty. `tsc` + lint clean.
@@ -0,0 +1,83 @@
1
+ # Side-effects review — unify the working-awareness stores (rung 2)
2
+
3
+ **Scope**: Give the working-awareness stores one ranked read path. Per the
4
+ ratified spec (`docs/specs/cwa-unify-stores.md`), this unifies the READ — extends
5
+ the existing `WorkingMemoryAssembler` to draw topic-intent refs + Playbook
6
+ manifest items into one ranked working-set section — NOT a physical store
7
+ migration. Additive, reversible, regression-pinned.
8
+
9
+ **Files touched**:
10
+ - `src/memory/WorkingSet.ts` — NEW. The `WorkingSetItem` lingua franca +
11
+ `blendedScore`/`rankWorkingSet` (relevance × recency-decay) + two read-only,
12
+ degrade-safe source adapters: `topicIntentToWorkingSet` (refs at/above tentative
13
+ → items; relevance = confidence, recency = lastReinforcedAt) and
14
+ `playbookManifestToWorkingSet` (scans `{stateDir}/playbook/**.json` manifests,
15
+ trigger/tag-gated by the query, relevance = match + usefulness, recency =
16
+ freshness — **never invokes the Python scripts**).
17
+ - `src/memory/WorkingMemoryAssembler.ts` — optional `topicIntentStore` + `stateDir`
18
+ config; `workingSet` token budget (default 500); a new "Working Set" section
19
+ appended AFTER the existing knowledge/episodes/relationships sections, gated on
20
+ the new deps + content; `assembleWorkingSet`/`renderWorkingSet`; section header.
21
+ - `src/commands/server.ts` — pass `topicIntentStore` + `stateDir` to the assembler
22
+ and broaden its construction gate to include `topicIntentStore` (so the unified
23
+ read path is available even in minimal setups).
24
+
25
+ **Under-block**: None. The new section is purely additive context; it gates
26
+ nothing. Each new source is read-only and wrapped so an error/absence contributes
27
+ nothing.
28
+
29
+ **Over-block**: None. No authority anywhere — the assembler informs context, it
30
+ doesn't decide.
31
+
32
+ **Level-of-abstraction fit**: The stores keep their backends and write paths; only
33
+ the READ is unified, via the assembler that already does token-budgeted
34
+ multi-source assembly. The new sources speak the same `WorkingSetItem` shape and
35
+ flow through the same budget + render machinery. Playbook is read at the manifest
36
+ level (a stable JSON file), not through its Python CLI — the right seam for a fast,
37
+ degrade-safe assembler.
38
+
39
+ **Signal vs authority**: N/A — pure read/ranking.
40
+
41
+ **Interactions**:
42
+ - **REGRESSION PIN (load-bearing):** the working-set section is appended after the
43
+ existing three and only when `topicIntentStore || stateDir` is configured AND a
44
+ source returns content. With the new deps absent OR their sources empty, the
45
+ assembled output is byte-for-byte unchanged — verified by (a) a dedicated unit
46
+ test, (b) the existing 26 assembler unit tests + 9 working-memory route tests +
47
+ the assembler-context route tests all still green, (c) a route test asserting a
48
+ ref-less topic yields no working-set section.
49
+ - The assembler construction gate broadened to include `topicIntentStore` (always
50
+ present), so the assembler now constructs in more setups. This is additive —
51
+ callers in minimal setups gain a working-set context they didn't have; existing
52
+ setups' output is unchanged (pin).
53
+ - The new section draws only from REMAINING token budget after the existing
54
+ sources, so it cannot starve them.
55
+ - Playbook manifest scan is bounded (depth-limited, per-file try/catch) and
56
+ trigger-gated (empty query → no Playbook items), so it can't flood or slow
57
+ assembly.
58
+
59
+ **External surfaces**: New module `src/memory/WorkingSet.ts` (exports
60
+ `WorkingSetItem`, `blendedScore`, `rankWorkingSet`, the two adapters). New optional
61
+ assembler config fields + a `workingSet` budget. The assembled-context HTTP routes
62
+ now MAY include a "Working Set" section. No new endpoint, no config-shape change
63
+ for users (the assembler deps are wired server-side).
64
+
65
+ **Deferred (tracked)**: deeper cross-source blended re-ranking of the existing
66
+ sources (`cwa-physical-store-merge` rejected; cross-blend is implicit future work),
67
+ the Usher / mid-task re-surfacing (`cwa-usher`), capability+standards descriptors
68
+ as sources (`cwa-capability-index-context`). All in the spec's non-goals.
69
+
70
+ **Rollback cost**: Low. Drop the working-set section + the two adapter calls (or
71
+ just don't pass `topicIntentStore`/`stateDir`); the assembler returns to its
72
+ current three sources. No data migration. The regression pin guarantees the
73
+ revert is a no-op for existing output.
74
+
75
+ **Migration parity**: Additive assembler sources + budget default + the broadened
76
+ construction gate — all server-side (every agent gets it on update). No store
77
+ schema change, no hook/template/skill change, no user config change required.
78
+
79
+ **Convergence honesty**: Claude-authored + manual review; full multi-model
80
+ convergence tooling absent on host. This is the most architectural rung (touches
81
+ the shared assembler), so the regression pin is the primary safety and a fuller
82
+ multi-model review remains advisable — but the pin + the unchanged existing suites
83
+ bound the risk tightly.
@@ -0,0 +1,54 @@
1
+ # Side-Effects Review — Never a False Blocker (B17_FALSE_BLOCKER)
2
+
3
+ **Slug:** `never-a-false-blocker-standard`
4
+ **Date:** 2026-05-24
5
+ **Author:** echo
6
+ **Second-pass reviewer:** internal adversarial convergence (two reviewers) + real-LLM test-as-self
7
+
8
+ ## Summary of the change
9
+
10
+ Adds the constitution standard "Never a False Blocker" to `docs/STANDARDS-REGISTRY.md` and its structural enforcement: a new always-evaluated rule **B17_FALSE_BLOCKER** in `MessagingToneGate` (the outbound-message authority that hosts B15/B16). B17 holds an outbound message that defers a doable task to a person — "needs a human / I can't / second opinion / reverse-engineering" — when the message names no genuinely-human-only item and shows no inventory of the agent's own means (computer use, terminal, send-keys, MCP). The `deferral-detector` PreToolUse hook is extended (signal-only) to prime the inventory checklist for the new excuse-shapes. Registers the standard in `docs/INSTAR-DESIGN-PRINCIPLES-AND-LESSONS.md` (P12). The sibling of B16 — feasibility-surrender (B16) vs human-deference (B17).
11
+
12
+ ## Decision-point inventory
13
+
14
+ - `VALID_RULES` set — **add** `'B17_FALSE_BLOCKER'`. Without this the gate's drift-detection fails-open on a legitimate B17 citation (verified: a real-LLM B17 citation is accepted, `failedOpen=false`).
15
+ - `buildPrompt()` rule section — **add** the B17 definition after B16 (always-evaluated, no precondition), including the B16/B17 de-confliction + straddle handling + citation precedence (B15>B16>B17) + the UI-interaction clarification + a worked block example.
16
+ - Response-format enumeration + two doc comments (`B1..B16`→`B1..B17`) — **modify**.
17
+ - `deferral-detector` template (`PostUpdateMigrator.getDeferralDetectorHook`) + the deployed copy — **add** `needs_human_to` / `needs_reverse_engineering` patterns and a guarded `wants_second_opinion` (suppressed when a model/agent is named, so self-fetched cross-model review is not flagged). Checklist text updated to name the agent's own means + the tiny human-only set.
18
+ - No route changes: `checkOutboundMessage` → 422 is rule-agnostic; B17 rides the existing outbound paths.
19
+
20
+ ## 1. Over-block
21
+
22
+ Principal risk: blocking legitimate escalations. Mitigated — severity favors false-negatives, and the allowlist explicitly passes: a password/secret only the user holds, CAPTCHA, legal/billing/payment authorization, **required approvals** (side-effects/policy-gated), **account/access grants**, **external rate-limit/quota waits**, genuine value judgments, deferrals after a named-outcome inventory, self-fetched cross-model review, and rule-discussion. Real-LLM test-as-self confirmed password escalation, value-judgment, and required-approval all PASS while the founding false-blocker BLOCKS — no false-positive introduced by the precision-tightening.
23
+
24
+ ## 2. Under-block (a real false blocker slipping through)
25
+
26
+ Two known holes, both accepted by design:
27
+ - The gate sees only message text, so a **fabricated inventory** ("I tried everything, your call") can pass — same limit as B16, stated honestly in the rule. Mitigated by requiring *named outcomes* (not bare tool names); the hollow-inventory case is a unit assertion.
28
+ - Borderline misses are acceptable per the false-negative-favoring posture. Test-as-self caught the founding case passing initially and the prompt was tightened (UI-interaction clarification + worked example) until real Haiku blocked it.
29
+
30
+ ## 3. Level-of-abstraction fit
31
+
32
+ Correct: the block authority lives inside the single outbound authority (where B15/B16 live), not in the detector. The `deferral-detector` extension is signal-only (injects `additionalContext`, never blocks). Signal-vs-authority compliant.
33
+
34
+ ## 4. Blocking authority
35
+
36
+ No new brittle authority. B17 is one more rule the existing authority may cite; the 422 plumbing and fail-open behavior are inherited unchanged.
37
+
38
+ ## 5. Interactions
39
+
40
+ B17 is always evaluated alongside B15/B16 in one LLM call — no extra calls, marginally longer prompt. De-conflicted from B16 (missing mechanism → B16; person required → B17; the straddle → B17) with explicit citation precedence B15>B16>B17 so telemetry is deterministic. Drift-detection unaffected (an invented rule id still fails open — regression test included). The detector's orphan-TODO patterns are preserved (the regenerated deployed copy carries them, so migration does not regress that prior improvement).
41
+
42
+ ## 6. External surfaces
43
+
44
+ None. No new endpoints, credentials, or network calls.
45
+
46
+ ## 7. Rollback cost
47
+
48
+ Low. Reverting removes the rule from the set + prompt, the detector patterns, and the doc entries; no state, no migration, no schema. An older server simply lacks the rule.
49
+
50
+ ## 8. Test evidence
51
+
52
+ - Unit (`messaging-tone-gate-b17.test.ts`, 13 tests) + integration (`telegram-reply-b17-false-blocker.test.ts`, 2 tests) green; tsc clean; smoke suite (62 files / 2371 tests) green.
53
+ - Detector behaviorally exercised: false-blocker and reverse-engineering payloads flag; self-fetched cross-model review and clean status messages do not.
54
+ - **Real-LLM test-as-self** (real `ClaudeCliIntelligenceProvider` → Haiku, in-process against the built rule, production server untouched): founding codex-trust message + the fused straddle both BLOCK with B17; password escalation, value judgment, required approval, self-fetched second opinion, and post-inventory deferral all PASS.