npm - @kontourai/flow-agents - Versions diffs - 1.4.0 → 2.0.1 - Mend

@kontourai/flow-agents 1.4.0 → 2.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (184) hide show

package/.github/CODEOWNERS +29 -0
package/.github/actions/trust-verify/action.yml +145 -0
package/.github/workflows/ci.yml +11 -4
package/.github/workflows/kit-gates-demo.yml +2 -2
package/.github/workflows/publish-npm.yml +10 -2
package/.github/workflows/release-please.yml +1 -1
package/.github/workflows/runtime-compat.yml +1 -1
package/.github/workflows/trust-reconcile.yml +113 -0
package/AGENTS.md +13 -0
package/CHANGELOG.md +103 -0
package/CONTRIBUTING.md +4 -4
package/README.md +1 -0
package/agents/tool-planner.json +1 -1
package/build/src/cli/init.js +242 -20
package/build/src/cli/validate-workflow-artifacts.js +19 -2
package/build/src/cli/verify.d.ts +1 -0
package/build/src/cli/verify.js +90 -0
package/build/src/cli/workflow-sidecar.d.ts +316 -8
package/build/src/cli/workflow-sidecar.js +1996 -91
package/build/src/cli.js +2 -3
package/build/src/lib/flow-resolver.d.ts +111 -0
package/build/src/lib/flow-resolver.js +308 -0
package/build/src/tools/build-universal-bundles.js +34 -22
package/build/src/tools/generate-context-map.js +3 -16
package/build/src/tools/validate-source-tree.d.ts +1 -1
package/build/src/tools/validate-source-tree.js +42 -162
package/context/contracts/artifact-contract.md +10 -0
package/context/contracts/delivery-contract.md +1 -0
package/context/contracts/review-contract.md +1 -0
package/context/contracts/verification-contract.md +2 -0
package/context/gate-awareness.md +39 -0
package/context/scripts/hooks/stop-goal-fit.js +632 -70
package/docs/adr/0001-flow-agents-consumes-flow.md +1 -1
package/docs/adr/0002-flow-kits-as-extension-unit.md +1 -1
package/docs/adr/0004-gates-expect-surface-claims.md +2 -0
package/docs/adr/0005-kubernetes-inspired-resource-contracts.md +2 -0
package/docs/adr/0007-skill-audit.md +1 -1
package/docs/adr/0009-canonical-hook-core-kit-boundary.md +95 -0
package/docs/adr/0010-workflow-trust-state-as-hachure-bundle.md +139 -0
package/docs/adr/0011-mcp-posture.md +100 -0
package/docs/adr/0012-agent-coordination-as-liveness-claims.md +119 -0
package/docs/adr/0013-context-lifecycle.md +151 -0
package/docs/adr/0014-core-vs-domain-kit-boundary.md +143 -0
package/docs/adr/0015-flow-flow-agents-boundary-reconciliation.md +120 -0
package/docs/adr/0016-three-hard-boundary-model.md +71 -0
package/docs/adr/0017-anti-gaming-trust-security-model.md +155 -0
package/docs/agent-system-guidebook.md +5 -12
package/docs/context-map.md +4 -10
package/docs/index.md +3 -2
package/docs/integrations/framework-adapter.md +19 -6
package/docs/integrations/index.md +2 -2
package/docs/north-star.md +4 -4
package/docs/operating-layers.md +3 -3
package/docs/plans/adr-0010-phase2-gate-recompute.md +55 -0
package/docs/repository-structure.md +2 -2
package/docs/skills-map.md +1 -0
package/docs/spec/runtime-hook-surface.md +62 -9
package/docs/standards-register.md +3 -3
package/docs/survey-utterance-check.md +1 -1
package/docs/trust-anchor-adoption.md +197 -0
package/docs/verifiable-trust.md +95 -0
package/docs/veritas-integration.md +2 -2
package/docs/workflow-usage-guide.md +69 -0
package/evals/acceptance/DEMO-false-completion.md +144 -0
package/evals/acceptance/demo-cast.sh +92 -0
package/evals/acceptance/demo-false-completion.sh +72 -0
package/evals/acceptance/demo-real-evidence.sh +104 -0
package/evals/acceptance/demo.tape +29 -0
package/evals/acceptance/prove-capture-teeth-declared.sh +335 -0
package/evals/acceptance/prove-capture-teeth.sh +114 -0
package/evals/acceptance/prove-teeth.sh +105 -0
package/evals/ci/antigaming-suite.sh +55 -0
package/evals/ci/run-baseline.sh +2 -0
package/evals/fixtures/flow-kit-repository/invalid-missing-extension-asset/flows/review.flow.json +26 -0
package/evals/fixtures/flow-kit-repository/invalid-missing-extension-asset/kit.json +20 -0
package/evals/fixtures/flow-kit-repository/valid-unknown-extension/flows/review.flow.json +26 -0
package/evals/fixtures/flow-kit-repository/valid-unknown-extension/kit.json +18 -0
package/evals/integration/test_builder_step_producers.sh +379 -0
package/evals/integration/test_bundle_install.sh +35 -71
package/evals/integration/test_bundle_lifecycle.sh +39 -2
package/evals/integration/test_captured_fail_reconciliation.sh +820 -0
package/evals/integration/test_checkpoint_signing.sh +489 -0
package/evals/integration/test_claim_lookup.sh +352 -0
package/evals/integration/test_command_log_fork_classification.sh +134 -0
package/evals/integration/test_command_log_integrity.sh +275 -0
package/evals/integration/test_context_map.sh +0 -2
package/evals/integration/test_dual_emit_flow_step.sh +278 -0
package/evals/integration/test_enforcer_expects_driven.sh +281 -0
package/evals/integration/test_evidence_capture_hook.sh +185 -0
package/evals/integration/test_flow_kit_repository.sh +2 -0
package/evals/integration/test_flowdef_session_activation.sh +273 -0
package/evals/integration/test_flowdef_session_history_preservation.sh +250 -0
package/evals/integration/test_gate_bypass_chain.sh +448 -0
package/evals/integration/test_gate_lockdown.sh +1137 -0
package/evals/integration/test_gate_review_inquiry_records.sh +399 -0
package/evals/integration/test_goal_fit_escape_hatch.sh +73 -0
package/evals/integration/test_goal_fit_hook.sh +69 -4
package/evals/integration/test_goal_fit_rederive.sh +263 -0
package/evals/integration/test_install_merge.sh +1176 -0
package/evals/integration/test_kit_identity_trust.sh +393 -0
package/evals/integration/test_mint_attestation.sh +373 -0
package/evals/integration/test_phase_map_and_gate_claim.sh +365 -0
package/evals/integration/test_publish_delivery.sh +269 -0
package/evals/integration/test_reconcile_soundness.sh +528 -0
package/evals/integration/test_resolvefirststep_security.sh +208 -0
package/evals/integration/test_session_resume_roundtrip.sh +286 -0
package/evals/integration/test_trust_checkpoint.sh +325 -0
package/evals/integration/test_trust_reconcile.sh +293 -0
package/evals/integration/test_verify_cli.sh +208 -0
package/evals/integration/test_workflow_sidecar_writer.sh +549 -34
package/evals/lib/node.sh +0 -6
package/evals/run.sh +47 -0
package/evals/static/test_workflow_skills.sh +6 -13
package/install.sh +0 -7
package/integrations/strands-ts/README.md +25 -15
package/integrations/veritas/flow-agents.adapter.json +1 -2
package/kits/builder/flows/build.flow.json +59 -12
package/kits/builder/kit.json +85 -15
package/kits/builder/skills/continue-work/SKILL.md +116 -0
package/kits/builder/skills/deliver/SKILL.md +36 -6
package/kits/builder/skills/design-probe/SKILL.md +28 -0
package/kits/builder/skills/execute-plan/SKILL.md +9 -1
package/kits/builder/skills/gate-review/SKILL.md +234 -0
package/kits/builder/skills/learning-review/SKILL.md +30 -0
package/kits/builder/skills/pickup-probe/SKILL.md +29 -0
package/kits/builder/skills/plan-work/SKILL.md +13 -1
package/kits/builder/skills/pull-work/SKILL.md +19 -0
package/kits/knowledge/adapters/default-store/index.js +38 -0
package/kits/knowledge/adapters/flow-runner/index.js +1620 -0
package/kits/knowledge/adapters/obsidian-store/index.js +36 -6
package/kits/knowledge/docs/store-contract.md +314 -0
package/kits/knowledge/evals/audit-freshness/suite.test.js +368 -0
package/kits/knowledge/evals/canonicalize-category/suite.test.js +383 -0
package/kits/knowledge/evals/contract-suite/suite.test.js +111 -0
package/kits/knowledge/evals/detect-contradictions/suite.test.js +324 -0
package/kits/knowledge/evals/entities/suite.test.js +40 -0
package/kits/knowledge/evals/glossary-sync/suite.test.js +416 -0
package/kits/knowledge/evals/hygiene-review/suite.test.js +396 -0
package/kits/knowledge/evals/retirement/suite.test.js +145 -0
package/kits/knowledge/flows/audit-freshness.flow.json +44 -0
package/kits/knowledge/flows/canonicalize-category.flow.json +44 -0
package/kits/knowledge/flows/detect-contradictions.flow.json +44 -0
package/kits/knowledge/flows/glossary-sync.flow.json +61 -0
package/kits/knowledge/flows/hygiene-review.flow.json +43 -0
package/kits/knowledge/kit.json +51 -1
package/package.json +6 -6
package/packaging/conformance/README.md +10 -2
package/packaging/conformance/fixtures/evidence-capture--allow-records-command.json +29 -0
package/packaging/conformance/fixtures/stop-goal-fit--block-bundle-disputed-claim.json +29 -0
package/packaging/conformance/fixtures/stop-goal-fit--block-capture-contradicts-claimed-pass.json +30 -0
package/packaging/conformance/fixtures/stop-goal-fit--block-mode.json +23 -0
package/packaging/conformance/fixtures/stop-goal-fit--off-mode.json +24 -0
package/packaging/conformance/fixtures/stop-goal-fit--warn-active-delivery.json +5 -2
package/packaging/conformance/fixtures/stop-goal-fit--warn-no-bundle.json +23 -0
package/packaging/conformance/fixtures/workflow-steering--reground-active-prompt.json +30 -0
package/packaging/conformance/fixtures/workflow-steering--reground-session-start.json +30 -0
package/packaging/conformance/run-conformance.js +1 -1
package/scripts/README.md +2 -1
package/scripts/build-universal-bundles.js +0 -1
package/scripts/ci/mint-attestation.js +221 -0
package/scripts/ci/trust-reconcile.js +545 -0
package/scripts/hooks/config-protection.js +423 -1
package/scripts/hooks/evidence-capture.js +348 -0
package/scripts/hooks/lib/liveness-read.js +113 -0
package/scripts/hooks/run-hook.js +6 -1
package/scripts/hooks/stop-goal-fit.js +1524 -79
package/scripts/hooks/workflow-steering.js +135 -5
package/scripts/install-codex-home.sh +39 -0
package/scripts/install-merge.js +330 -0
package/scripts/repair-command-log.js +115 -0
package/src/cli/init.ts +218 -20
package/src/cli/validate-workflow-artifacts.ts +18 -2
package/src/cli/verify.ts +100 -0
package/src/cli/workflow-sidecar.ts +2127 -84
package/src/cli.ts +2 -3
package/src/lib/flow-resolver.ts +369 -0
package/src/tools/build-universal-bundles.ts +34 -21
package/src/tools/generate-context-map.ts +3 -17
package/src/tools/validate-source-tree.ts +44 -104
package/build/src/tools/filter-installed-packs.d.ts +0 -2
package/build/src/tools/filter-installed-packs.js +0 -135
package/packaging/packs.json +0 -49
package/scripts/filter-installed-packs.js +0 -2
package/src/tools/filter-installed-packs.ts +0 -132

package/docs/adr/0013-context-lifecycle.md ADDED Viewed

@@ -0,0 +1,151 @@
+---
+title: "ADR 0013: Context Lifecycle — Workflow-Boundary Compaction, Freshness-Gated Reuse, and the Learning Split"
+---
+# ADR 0013: Context Lifecycle
+**Date:** 2026-06-25
+**Status:** Accepted as direction (decided with Brian Anderson, 2026-06-25). Implementation phased.
+---
+## Context
+A long agent session produces good results partly because the *conversation* holds the
+thread — the corrections, the discoveries, the accumulated repo model. But that thread is
+ephemeral: it does not survive a fresh session, and within a session it degrades the model
+(attention and reasoning fall off as context grows; cost rises). The goal is the *feeling* of
+an infinite session — let it run, or start fresh, and get very similar results — driven by the
+**durable system** (ADRs, issues, trust bundles, `state.json`, the gates, the context-map,
+skills), **not** by the chat history.
+We already have most of the substrate: durable per-task **trust bundles** keyed by
+`subjectId`; **freshness** (the liveness/duration + commit-window machinery from ADR 0012);
+the **liveness stream** indexing in-progress work; `context-map --check` for repo-structure
+freshness; and the `learning-review` loop. What is missing is the *wiring* that turns these
+into a context lifecycle — and a decision about **what may evolve where**, so the kit stays
+deterministic and reproducible.
+## Decision
+### 1. The workflow boundary is the context boundary (selective compaction)
+A workflow is a bounded unit of work; **`pull-work` (the start of a workflow) is the clean
+seam to reset context.** A new workflow rebuilds its context from durable artifacts (the work
+item, ADRs, context-map, prior bundles) — not the prior conversation. The *feel* of an
+infinite session is therefore **a seamless sequence of fresh-context workflows seamed at
+`pull-work`**, with the system carrying continuity: the model stays sharp (fresh window per
+item); the experience stays continuous (durable state).
+Compaction is **selective, not automatic.** Follow-on work that *shares a lane* with the
+current work benefits from the warm context; unrelated work does not. `pull-work` already
+reads the signal — the liveness lane / `subjectId` / dependency graph — so the rule is:
+**suggest a fresh/compacted context when the new workflow's lane is disjoint from the
+current; keep the context warm when it is a continuation.** `pull-work` *suggests*; the
+operator (or a policy) decides.
+### 2. Freshness-gated context reuse (don't rebuild what's fresh; don't trust what's stale)
+To limit rebuilding context without relying on stale information, **reuse context gated by
+freshness** — the trust primitive applied to context:
+- the **trust bundle** (per work-item, by `subjectId`) **is** the durable context of that work
+  — read it instead of re-deriving;
+- **freshness** is the **stale-guard** — reuse claims that are still fresh; re-derive only what
+  has gone stale (e.g., the commit window moved);
+- the **liveness stream** is the index of in-progress work — the entry point to *glean*;
+- `context-map --check` is the same pattern one level up (re-survey only changed structure).
+**Gleaning in-progress work** (a prior session's, or another developer's once the stream is
+shared per ADR 0012's cross-machine sink) **must respect claim *status*.** In-progress work is
+mid-flight — its claims are `proposed`/`assumed`, not `verified`. Glean the *intent* and the
+*verified* facts; treat unverified claims as provisional. This is the line between "I see what
+they're attempting" (safe) and "I'll build on their unproven conclusions" (the stale-info
+trap). The status field is the guard.
+### 3. The learning split — three buckets; the kit does not self-evolve per machine
+Self-improvement is **not one mechanism**. Conflating these would let the kit mutate its own
+behavior on each user's machine — a thousand divergent kits, the opposite of the open,
+deterministic, reproducible format the product *is*. The split:
+- **User-project knowledge** (docs, context, the project's `AGENTS.md` about *their* codebase)
+  is **data.** The learning loop that captures it **ships in the kit** as a feature, operates
+  **per-project**, and is *expected* to diverge.
+- **Kit discipline** (consume-never-fork, the vocabulary rules, "flakiness ⇒ real bug") is
+  guidance for *any* agent using the kit, so it ships in the kit — but it is **code/versioned,
+  encoded by us, uniform across users.** It does **not** self-modify on user machines; it
+  improves when *we* ship a new version.
+- **Model/agent tendencies** (e.g. sycophancy) belong to the **agent**, not the kit or the
+  project — a different home (the agent's own disciplines). Encoding a model quirk into a
+  shipped kit is a category error.
+**The invariant:** the kit (including the learning loop) is *uniform and versioned*; what it
+*learns per project* is *data* and diverges. Same app, different user data. The thing that must
+**never** diverge per machine is the kit's *behavior*.
+### 4. The self-encode mechanism — the learning loop + a confirmation gate
+Operating lessons self-encode through the existing loop, **extended to an
+`operating-agreement` claim type, with a propose→confirm gate** (zero-touch self-encoding is
+rejected — it produces a brittle, over-fit, self-contradicting rulebook):
+1. **Detect candidates** at workflow close (where `learning-review` runs): scan for *behavior*
+   signals — human pushback, redos, reverts, "you already built that," self-critique flags.
+2. **Propose** (never auto-apply): surface the candidate agreement with its motivating
+   evidence.
+3. **Confirm + distill** — the gate. A human (or a very-high-bar reviewer) accepts / rejects /
+   edits and generalizes the instance into a principle. **Without this gate, do not build it.**
+4. **Encode as a structured claim** (`subject: operating-discipline`, `evidence: [corrections]`,
+   `status`, `freshness`) — the human-readable `AGENTS.md` is a *projection* of these claims,
+   so it is queryable, versioned, and decayable, not a doc that rots.
+5. **Apply *relevant*, decay by freshness** — future sessions load agreements **filtered by
+   workflow type / lane** (not the whole rulebook), at session/workflow start; unused or
+   repeatedly-overridden agreements go stale and flag for review.
+**The escalation ladder** — `correction → advisory context → gate check → merge-readiness
+criterion` — exists in two contexts with *different drivers*, and **neither is the kit
+self-evolving**:
+- For **kit discipline**, the ladder is **our internal dogfooding dev process** (we run the kit
+  on the kit, catch what it does wrong, and the *output is a PR / ADR / version bump* — never
+  runtime self-modification).
+- For **user-project agreements**, the kit *offers* users the ability to promote *their*
+  agreements to *their* gates — **user-driven configuration of their data**, not the kit
+  autonomously adding a gate.
+## Consequences
+- **Stopping stops mattering.** Once a session's operating lessons live in the loop (or shipped
+  kit guidance) rather than the chat, a fresh session reproduces the quality — so restart is
+  free, and "infinite" is a UX property, not a context-length goal.
+- **Determinism is protected** — the kit's behavior is uniform and versioned; only per-project
+  *data* diverges.
+- **Context cost drops** — reuse-gated-by-freshness avoids rebuilding what's still valid and
+  avoids trusting what's stale; selective compaction keeps windows small without losing warm
+  context for continuations.
+- **Costs (eyes open):** (1) selective-compaction and gleaning need the lane/overlap +
+  status signals to be reliable (depends on ADR 0012 maturing); (2) the meta-learning loop
+  only automates the *second* occurrence of a lesson — a human still catches each *novel*
+  class of drift once (this is permanent, and fine); (3) the propose→confirm gate is human
+  effort (kept cheap, but not zero).
+## Alternatives Considered
+- **One long (infinite) session.** Rejected: fights model degradation and cost; the right
+  target is *session-independence* (cheap restart, identical results), not session-immortality.
+- **A hand-written `AGENTS.md` as the end-state.** Rejected as the *end-state*: it is the loop's
+  *seed* (initial state), not an alternative to it; left alone it rots.
+- **Zero-touch self-encoding of lessons.** Rejected: without a confirmation gate it over-fits
+  and contradicts itself; judgment must gate the rulebook.
+- **A self-evolving kit (per-machine).** Rejected hard: violates the determinism/reproducibility
+  thesis — a thousand divergent kits. Kit behavior is versioned; only project *data* diverges.
+- **Always rebuild context fresh / always reuse.** Rejected: rebuild-always is wasteful;
+  reuse-always trusts stale data. Freshness gates the choice per claim.
+## References
+- [ADR 0010](./0010-workflow-trust-state-as-hachure-bundle.md) — trust bundle as durable context.
+- [ADR 0012](./0012-agent-coordination-as-liveness-claims.md) — liveness/freshness, `subjectId`
+  correlation, the shared stream and cross-machine sink, resumption-via-durable-evidence.
+- `learning-review` skill; `gate-review` / self-critique; `context-map --check`; `pull-work`.

package/docs/adr/0014-core-vs-domain-kit-boundary.md ADDED Viewed

@@ -0,0 +1,143 @@
+---
+title: "ADR 0014: Flow Agents core vs domain kits — the generic/kit boundary"
+---
+# ADR 0014: Flow Agents core vs domain kits
+**Date:** 2026-06-25
+**Status:** Proposed (decision owner: Brian Anderson). Defines the boundary; code moves are sequenced, not immediate.
+Revised after boundary review: `workflow-sidecar` confirmed **core** (the lifecycle engine, with only a few developer-leaning *defaults* to make kit-extensible); Builder Kit reframed as a **first-class pulled-out kit**; `knowledge`/`release-evidence` cited as already validating the substrate.
+> **Superseded in part by ADR 0016 (2026-06-26).** The finding that the *bespoke `workflow-sidecar` FSM* is the legitimate core engine is superseded. The core does own a lifecycle **engine**, but per ADR 0016 + 0015 + #183 that engine must be the **FlowDefinition / Resource-Contract-driven** one — the bespoke FSM is a parallel reimplementation of ADR 0005 that retires via ADR 0015's migration. This ADR's boundary *principle* (the agent-blind "dividing test", core vs domain-kit) stands; only the "keep the FSM as core / tweak defaults" conclusion is replaced.
+---
+## Context
+Flow Agents is defined (CONTEXT.md) as *"an operating layer that helps agents route natural
+user requests into the right procedures, tools, state, evidence, knowledge, and follow-ups"* —
+a **generic, domain-agnostic** layer. The README states the intended composition model:
+*"domain kits that compose this substrate — a Sales Kit…, a Research Kit…"*. `kits/` already
+holds three: `builder` (developer workflows), `knowledge` (knowledge capture/recall), and
+`release-evidence`.
+But the structure does **not** reflect this. The entire `agents/` directory is *developer*
+tooling (`tool-code-reviewer`, `tool-verifier`, `tool-worker`, `tool-planner`,
+`tool-explore-*`, `tool-playwright`), and `context/contracts/{review,verification,execution}`
+are *developer* contracts (code-review lanes; build/types/lint/test phases). These live in the
+"core" locations (`agents/`, `context/`) yet are consumed only by the **Builder Kit**
+(`kits/builder/` has no contracts or agents of its own — it consumes core). The non-developer
+`knowledge` kit does not use them.
+This surfaced while placing two universal disciplines (fail-loud/no-silent-data-loss from #160;
+"a flake is a real defect"): the only "homes" available were developer contracts, which forced
+a generic principle into a developer-specific file (#170). The boundary is **implicit**, and
+the implicit version conflates the generic operating layer with one domain kit.
+## Decision
+### 1. The boundary principle
+- **Flow Agents (core) owns generic *mechanisms*** that any domain reuses: the workflow
+  lifecycle (work-items, states, phases, transitions), the trust substrate (bundle, claims,
+  evidence, policies, freshness, status derivation via Surface), the enforcement gates
+  (goal-fit, evidence-capture, reground), liveness/coordination (ADR 0012), routing, durable
+  persistence, kit installation/runtime adapters, and the **agent operating disciplines**
+  (consume-never-fork, fail-loud, evidence-bearing, freshness-gating).
+- **A domain kit owns the domain *specifics***: its vocabulary, the concrete workflow shape,
+  the domain verification/review *criteria*, the domain *tools*, the domain *schema*, and the
+  side-effect *adapters*. Builder Kit is the developer domain kit; Sales/Research/Knowledge are
+  siblings.
+### 2. The clean test
+> **"Would a non-developer domain kit (e.g. `knowledge`, a Sales Kit) need this?"**
+> Yes → it is generic → **Flow Agents core.** No → it is developer-specific → **Builder Kit.**
+By this test today: the `tool-*` agents (worker/code-reviewer/verifier/planner), the Builder
+*skills* (plan-work/execute-plan/review-work/verify-work), the code-review lanes, and the
+build/lint/test phases are **Builder Kit**. The lifecycle **engine** (`workflow-sidecar` — it
+writes the trust bundle, advances state, records evidence/claims, emits liveness), the trust
+substrate, the gates, liveness, and the persistence/data-integrity invariants are **core**.
+The only developer lean *inside* the core engine is in its **defaults**, not its mechanism: the
+code-specific `checkKinds` (`build`/`types`/`lint`/`test`/`browser`) and some vocabulary
+(`init-plan` reads developer-ish; it just means *open a tracked work-item from its defining
+artifact* — create its state/acceptance/handoff/trust.bundle and claim it via liveness). Those
+defaults should become **kit-extensible** (and the vocab can be neutralized, e.g. `init-plan` →
+`init-work`) — a small core cleanup, not a relocation. The engine is core; it is simply not yet
+*exercised* by a non-developer kit (`knowledge`/`release-evidence` took the lighter `flows`
+path), which is a validation gap, not a sign it is Builder-shaped.
+### 3. Kits extend, never reimplement
+Domain kits **consume** the generic mechanisms; they must not fork the lifecycle, the trust
+substrate, or the gates (consume-never-fork, ADR 0008, applied to kit authors). A kit that
+needs different behavior configures or extends the generic mechanism — it does not ship a
+parallel one. This is what keeps "an open format that means the same thing everywhere" true
+across kits.
+### 4. Mixed contracts split: generic invariant (core) + domain extension (kit)
+`review-contract` and `verification-contract` are **mixed**. The split:
+- **Generic (core):** verify work meets acceptance criteria with evidence; mark `NOT_VERIFIED`
+  honestly; **fail-loud, never fail-open** (persistence that silently drops a record is data
+  loss, not a degraded mode); **nondeterminism is a defect** (an operation that can pass
+  without doing its job is a failure); review against standards with evidence-bearing,
+  severity-tagged findings; don't silently pass.
+- **Domain (Builder Kit):** the concrete phases (`build`/`types`/`lint`/`tests`/`browser`) and
+  review lanes (code quality, security scanning, architecture fit).
+The two disciplines from #170 are **generic** and belong in the core invariant layer — so
+*every* kit (knowledge persisting a note, a Sales Kit logging to a CRM) inherits them — not in
+the developer `review-contract`. **#170 is re-homed here**, not merged as-placed.
+## Consequences
+- **Protects the core value proposition.** "Generic, domain-agnostic operating layer" is the
+  moat; a core that secretly assumes code-review/build/test undermines it for the next domain
+  kit author. The clean test (§2) becomes a **standing design gate**, not a one-time cleanup.
+- **Re-homes #170** and tells us where future cross-cutting disciplines go.
+- **Builder Kit becomes a first-class, pulled-out kit** — the same shape `knowledge` already
+  has (its own `kit.json`, flows, skills, and now agents + contracts) — and an *independently
+  valuable* product, not a demo. The developer tools (`agents/`) and the developer halves of the
+  mixed contracts move into it; the generic invariants consolidate in core. This touches code the
+  Phase-4 agents are active in — **define now, move later, coordinate** (no premature big-bang).
+## Alternatives Considered
+- **Leave the boundary implicit.** Rejected: it leaks developer assumptions into the core and
+  blocks clean domain-kit authoring.
+- **Refactor the folders first, define later.** Rejected: moving `agents/` and contracts into
+  `kits/builder/` is large and conflicts with active Phase-4 work; the *definition* is cheap and
+  must lead.
+- **Fully purify the core now (extract a grand generic verification/review framework).**
+  Rejected as **speculative generality** (the consume-never-fork sibling). The *substrate*
+  (gates, Surface claims, flows) is **already proven domain-neutral** by shipping kits —
+  `knowledge` and `release-evidence` use it with no developer machinery. What is *not* yet
+  exercised by a non-developer kit is the **lifecycle engine** (`workflow-sidecar`); a
+  Sales/Research kit that actually uses `init-plan → advance-state → record-evidence → liveness`
+  is the real validator, and would surface which engine *defaults* (above) are developer-shaped.
+  Build it for a real use case, not as architecture theater. Extract only the invariants that are
+  already obviously cross-domain (data integrity, evidence honesty, nondeterminism, freshness).
+## Product weigh-in (requested)
+1. **The boundary is the moat — treat the clean test as a gate.** Every time something lands in
+   `agents/` or `context/`, ask "would the knowledge/Sales kit need it?" If no, it's a Builder
+   Kit feature wearing a core costume.
+2. **Build out a non-developer kit that uses the lifecycle engine — it is the cheapest way to
+   find the true core.** The substrate is already validated; what isn't is `workflow-sidecar`
+   under a non-developer domain. A Sales/Research kit that *reuses* `init-plan → advance-state →
+   record-evidence → liveness` will expose which engine defaults are secretly developer-shaped
+   and de-risk the refactor — *if* there is a real use case (not architecture theater).
+3. **Ship the generic disciplines to the core invariant layer regardless** — data-integrity,
+   evidence honesty, nondeterminism, freshness are cross-domain today; they should not wait on
+   the full refactor.
+## References
+- CONTEXT.md (Flow Agents definition); README (domain-kit direction).
+- ADR 0008 (consume-never-fork), 0010 (trust bundle), 0012 (liveness), 0013 (context lifecycle).
+- #170 (the two disciplines, parked pending this boundary); #160 (the fail-open data-loss).

package/docs/adr/0015-flow-flow-agents-boundary-reconciliation.md ADDED Viewed

@@ -0,0 +1,120 @@
+---
+title: "ADR 0015: Flow / Flow Agents Boundary Reconciliation"
+---
+# ADR 0015: Flow / Flow Agents Boundary Reconciliation
+**Date:** 2026-06-25
+**Status:** Accepted. Tier 0 (#175) shipped; Tier 1 (#176) closed-by-evaluation (+ a found anti-gaming fix, #196); Tier 2 (#177) **reopened and scoped** as the Resource Contract migration (the sidecar FSM IS a parallel reimplementation per ADR 0005 / #183 — see corrected Reassessment); #178/#179 are deferred cross-package work.
+**Parent issue:** #174 (umbrella)
+---
+## Context
+ADR 0001 established that Flow Agents *consumes* Flow for generic workflow enforcement
+rather than owning the enforcement kernel. The boundary is owned by ADR 0001. During
+Phase 4 of ADR 0010 (trust.bundle as sole verification artifact), a drift was found:
+`src/cli/workflow-sidecar.ts` contained a bespoke trust-bundle schema validator
+(`tryLoadHachureValidator` / `getHachureValidator` / local `validateTrustBundle`) that
+duplicated logic already owned canonically by `@kontourai/surface`.
+Surface's `validateTrustBundle` is the canonical owner at the lowest code layer:
+hachure owns the schemas, surface owns the trust computation (including validation),
+flow owns the workflow engine, flow-agents owns product adapters. The bespoke validator
+was a THREE-WAY duplication of that ownership:
+1. Hachure's `trust-bundle.schema.json` (the schema source of truth)
+2. Surface's `validateTrustBundle` (the canonical validator, using those schemas)
+3. Flow Agents' bespoke AJV + hachure-schema-loading validator (the drift)
+A survey of flow-agents found that flow-agents uses approximately 1 of ~95 flow exports
+(the workflow engine / gate-expectation / run-state kernel). A parallel run-state / gate
+model is being reconciled through a tiered program (see below).
+## Decision
+### Layered ownership
+```
+hachure         — schemas (trust-bundle.schema.json, claim, evidence, policy, event)
+surface         — trust computation: validateTrustBundle, deriveClaimStatus, resolveInquiry
+flow            — workflow engine: Flow Definitions, Runs, steps, gates, transitions
+flow-agents     — product/adapters: skills, hooks, sidecar writers, runtime adapters
+```
+Flow-agents does not own trust-bundle schema validation. Surface owns it.
+### Tier 0 (this PR): consume surface's validateTrustBundle
+Replaced the bespoke `tryLoadHachureValidator` / `getHachureValidator` / local
+`validateTrustBundle` in `src/cli/workflow-sidecar.ts` with consumption of
+`@kontourai/surface`'s canonical `validateTrustBundle`.
+**Equivalence verified before swap:** surface's validator is equivalent-or-stronger than
+the bespoke one — it validates the same structural constraints (required fields,
+enum values, schema shape) plus cross-reference integrity (evidence → claim, event →
+claim, event → evidence) that the hachure JSON schema did not enforce. All nine
+test cases agreed; surface rejected two additional invalid bundles (dangling references)
+that the bespoke validator accepted.
+**Return shape preserved:** the public export `validateTrustBundle(bundle) →
+{ valid, errors, available }` is preserved. `available` reflects surface presence (surface
+is required per ADR 0010 Phase 4c; fail-open is maintained for diagnostic use). The
+function became `async` because surface is ESM-only and loaded via `import()`; the call
+site in `writeTrustBundle` is already async; the test inline script uses top-level await
+in ES module mode.
+**AJV decision:** AJV and hachure schema loading are retained for `validateInquiryRecord`
+(which validates inquiry-record.schema.json — a separate schema not covered by surface's
+`validateTrustBundle`). Only the trust-bundle AJV duplication was removed.
+**normalizeSurfaceRefs advisory validation:** the inline advisory validation in
+`normalizeSurfaceRefs` (which validates referenced trust.bundle files) was updated to use
+the cached `_surfaceModule` instead of the bespoke `getHachureValidator`. Fail-open
+behavior is preserved: if surface is not yet loaded when `normalizeSurfaceRefs` runs,
+validation is skipped.
+### Tiered reconciliation program (post Tier 0)
+The broader boundary reconciliation (issue #174) is phased:
+| Tier | Issue | Scope | Outcome |
+| --- | --- | --- | --- |
+| Tier 0 | #175 | consume surface's `validateTrustBundle`; delete bespoke validator | **DONE** — the one genuine fork removed |
+| Tier 1 | #176 | gate-expectation engine: consume flow's gate evaluation kernel | **CLOSED by evaluation** — the gate already consumes Surface (`deriveClaimStatus`) for re-derivation; residual logic is product-specific gate policy. Scoping it found+fixed a real anti-gaming regression (PR #196). |
+| Tier 2 | #177 | run-state kernel → Resource Contract migration | **REOPENED AND SCOPED** — the cheap `FlowRunState` swap is still churn (original eval correct on that narrow point), but the sidecar FSM IS a parallel reimplementation of ADR 0005's Resource Contract (`state.json→WorkflowRun.status`, `acceptance.json→RunPlan.spec`, `evidence→conditions[].evidenceRefs`). The real convergence (Resource Contract + Flow Definitions, retiring the FSM, #183) is the accepted direction. Scoped as a phased migration (projection → FlowDefinition-backed advance-state → hooks → resume/evals → retire sidecars). `kits/builder/flows/build.flow.json` already exists; the FSM just doesn't consult it. |
+| promotes | #178 | promote liveness / InquiryRecord / run-hook upstream | **Deferred — cross-package** (requires `flow`/`surface` source changes; not doable from this repo). |
+| contracts | #179 | extract generic vocabulary to flow contracts | **Deferred — cross-package.** |
+### Reassessment (post Tiers 1–2) — corrected
+**An earlier version of this Reassessment was too narrow and is corrected here.** It was right that Tier 1's gate computation already consumes Surface, and that a *cheap mechanical `FlowRunState` swap* (Tier 2's original framing) would be pure churn. But it wrongly concluded the sidecar FSM is a *legitimate product-specific layer* and that the program is "essentially resolved." That misses the larger issue documented in #183:
+`workflow-sidecar.ts`'s state model — the 11 phases, 13 statuses, bespoke `advanceState` guard, and per-session `state.json`/`handoff.json`/`acceptance.json`/`current.json` — **IS a parallel reimplementation of the Kontour Resource Contract (ADR 0005)** at the product level. ADR 0005 defines `WorkflowRun`/`RunPlan`/`SelectedScope`/`Gate` as the durable record shape for exactly this information; the sidecar FSM predates ADR 0005's acceptance and was never migrated. `docs/kontour-resource-contract.md`'s Compatibility Guidance already documents the mapping (`state.json→WorkflowRun.status`, `acceptance.json→RunPlan.spec`, `evidence→conditions[].evidenceRefs`, `handoff.json→WorkflowRun.status`) — i.e. it is a pre-ADR-0005 parallel implementation, not a deliberate product layer. Notably `kits/builder/flows/build.flow.json` (a Builder FlowDefinition, 10 steps / 9 gates) **already exists** — `advance-state` simply doesn't consult it.
+**Corrected outcome:** Tier 0 done (the one Surface-layer fork removed); Tier 1 closed-by-evaluation (+ the #196 anti-gaming fix); **Tier 2 reopened and scoped** as a phased Builder→Resource-Contract/Flow-Definition migration (see #177): Phase 1 projection layer → Phase 2 FlowDefinition-backed `advance-state` → Phase 3 hooks → Phase 4 resume/evals → Phase 5 retire sidecars → Phase 6 Flow kernel (deferred to #178). Per #183, Builder and Knowledge are the **same** abstraction (Resource Contract over `WorkflowRun`/`Gate`), so this is a prerequisite for new kit authors to have a stable target, not optional cleanup — and the Builder migration must coordinate with the parallel Knowledge work (which already ships Flow Definitions).
+**Invariant that must survive migration (#183 Finding 2):** `WorkflowRun.status.conditions` are writable summaries; the gate re-derives from Hachure claims via Surface; `conditions[].evidenceRefs` cite claim IDs. **Do not fuse Resource and claim** — the separation (a friendly mutable surface over an un-gameable derived core) is the architecture.
+## Consequences
+- **No bespoke trust-bundle schema validator in flow-agents.** Surface is the canonical
+  owner; flow-agents delegates.
+- **Stronger validation.** Surface also validates cross-reference integrity (dangling
+  evidence/event references) that the hachure JSON schema did not. Bundles produced by
+  `buildTrustBundle` are already reference-consistent, so no regression is possible in
+  normal operation — only malformed external inputs are now additionally rejected.
+- **async API.** `validateTrustBundle` is now async (returns `Promise<{valid,errors,available}>`).
+  All existing call sites are in async contexts. External consumers of the library export
+  must `await` the result.
+- **Surface availability:** surface was already REQUIRED for bundle writes per ADR 0010 4c.
+  `available: false` (fail-open) is only reachable in degraded diagnostic environments
+  (e.g. `FLOW_AGENTS_SURFACE_UNAVAILABLE=1` test seam) or if surface fails to load.
+## References
+- [ADR 0001](./0001-flow-agents-consumes-flow.md) — Flow Agents consumes Flow; boundary ownership.
+- [ADR 0010](./0010-workflow-trust-state-as-hachure-bundle.md) — trust bundle as workflow trust state; Phase 4c.
+- GitHub issue #174 (umbrella: flow/flow-agents boundary reconciliation)
+- GitHub issue #175 (Tier 0: this PR)

package/docs/adr/0016-three-hard-boundary-model.md ADDED Viewed

@@ -0,0 +1,71 @@
+---
+title: "ADR 0016: The Three-Hard-Boundary Model — a FlowDefinition-Driven, Kit-Agnostic Core"
+---
+# ADR 0016: The Three-Hard-Boundary Model — a FlowDefinition-Driven, Kit-Agnostic Core
+**Date:** 2026-06-26
+**Status:** Accepted
+**Supersedes (in part):** ADR 0014's finding that the bespoke `workflow-sidecar` FSM is the legitimate core engine.
+**Builds on:** ADR 0001 (Flow Agents consumes Flow), 0004 (gates expect trust claims), 0005 (Resource Contract), 0007 (Flow/Skill/Kit/Tool), 0009 (canonical hook core/kit boundary), 0015 (Flow↔Flow-Agents reconciliation); synthesis input #183.
+---
+## Context
+The boundary between **flow**, **flow-agents**, and **flow-agent-kits** is defined across many ADRs (0001, 0007, 0009, 0014, 0015) but never as one model, and the pieces have drifted:
+1. **A real contradiction.** ADR 0014 (Proposed) calls the bespoke `workflow-sidecar` FSM "confirmed core — the legitimate lifecycle engine" needing "a small cleanup, not a relocation." ADR 0015 (Accepted) calls the same FSM "a parallel reimplementation of ADR 0005's Resource Contract" to be retired via a phased migration. Both were written 2026-06-25; they cannot both stand as written.
+2. **A load-bearing gap.** No ADR states that the core's gate enforcer and lifecycle driver must be **driven by the active kit's FlowDefinition** rather than hardcode a claim taxonomy. The consequence is live in the code: `scripts/hooks/stop-goal-fit.js` enforces a hardcoded generic taxonomy (`workflow.check.*`, `workflow.critique.review`, `workflow.acceptance.criterion`) with **zero** references to any FlowDefinition, while the kits' FlowDefinitions declare a different, per-step vocabulary (`builder.verify.tests`, `knowledge.ingest.capture`). ADR 0009 narrowed the de-coupling rule to skill/template names and explicitly *blessed* the hardcoded `workflow.*` taxonomy as "core" — sanctioning the very coupling this ADR forbids.
+3. **Unresolved ownership.** Claim-taxonomy ownership (generic kinds vs kit-namespaced types) and the cardinality/lifetime parameterization (#183) are decided nowhere binding.
+We own the full stack (hachure → surface → flow → flow-agents → kits). The boundaries should be **hard**, and the abstractions should let the *demoed* use cases — the Builder delivery workflow and the Knowledge hygiene workflows — run on the same machinery from their FlowDefinitions.
+## Decision
+### 1. Three hard boundaries (one named model)
+- **flow** — the **domain-agnostic workflow engine.** Owns the FlowDefinition schema (steps, gates, `expects[]`, `route_back_policy`), gate *evaluation* (`evaluateGate` over expectations, re-derived from the trust layer), transition validation, run-state (`FlowRunState`/Resource Contract run model), route-back, and Flow Reports. It knows nothing about any kit or claim vocabulary; it operates on *whatever a FlowDefinition declares*.
+- **flow-agents (core)** — the **kit-agnostic execution of a flow inside an agent harness.** Owns the lifecycle *driver* (`advance-state`), the gate *enforcer* (the Stop hook), evidence capture, the trust.bundle producer, the Resource/sidecar projection, the runtime adapters (claude/codex/…), and session machinery. It executes **any** kit's FlowDefinition. "**Core**" in this repo means exactly this layer; align other ADRs' usage to it.
+- **flow-agent-kits** — the **domains** (builder, knowledge, and future Sales/Research). Each kit **declares**: its FlowDefinition(s) (steps + gates + `expects[]`), its skills/agents (the claim *producers*), and its domain store/adapters. The kit **declares**; the core **executes**. A kit never re-implements the engine or the enforcer.
+### 2. Abstraction A (load-bearing) — the core is FlowDefinition-driven
+The core gate enforcer and lifecycle driver **MUST be driven by the active kit's FlowDefinition.** The enforcer evaluates the claim expectations the kit's FlowDefinition `expects[]` declares for the current gate (re-deriving status via the trust layer); the lifecycle driver validates transitions and reads `route_back_policy` from the FlowDefinition. **The core MUST NOT hardcode a claim taxonomy, step graph, or route-back limit.**
+The current `stop-goal-fit.js` hardcoded `workflow.*` taxonomy and `advance-state`'s hardcoded `>= 3` / `phase==="learning"` rules are **violations to remediate**, not "core contract." (This corrects ADR 0009 §3, which reclassified skill/template names but left the hardcoded taxonomy in place.)
+### 3. Abstraction B — claim-taxonomy ownership
+The **kit's FlowDefinition is authoritative** for which claims each gate expects. The core derives the generic *kind* of an expectation (a check, a critique, an acceptance, etc.) from the FlowDefinition's expectation metadata; it does not pattern-match a hardcoded namespace. Generic claim **kinds** are flow/core vocabulary; the **binding** of kinds to steps + accepted statuses is the kit's FlowDefinition. (Reconciles ADR 0004's kit-namespaced examples with the core enforcer.)
+### 4. Abstraction C — cardinality & lifetime are kit parameters, not new engines
+Builder and Knowledge are the **same** model at two settings of two parameters (per #183): **cardinality** (Builder = one work-item subject; Knowledge = many records — `SelectedScope` already says "one or many") and **lifetime** (run-scoped vs durable Resources via `ownerReferences`). New kits set these parameters over the same core + FlowDefinition machinery; they do not author new lifecycle engines.
+### 5. Abstraction D — Resource/claim separation is the architecture (invariant)
+Per ADR 0005 + #183 Finding 2: the **Resource Contract** (`WorkflowRun`/`RunPlan`/`status.conditions`) is the run/state model; the bespoke `workflow-sidecar` FSM is a parallel reimplementation that **retires** via ADR 0015's phased migration. Conditions are **writable summaries**; the gate **re-derives** truth from Hachure claims via Surface; `conditions[].evidenceRefs` cite claim IDs. **Resource and claim must not be fused** — the friendly mutable surface over the un-gameable derived core is the whole point.
+### 6. Resolution of the 0014↔0015 contradiction
+Both were partly right. The core **owns a lifecycle engine** (0014) — but that engine is the **FlowDefinition / Resource-Contract-driven** one defined here, **not** the bespoke FSM, which retires (0015). ADR 0014's "keep the FSM as core, tweak defaults" finding is superseded by this ADR; its boundary *principle* (the agent-blind "dividing test") stands.
+## Consequences
+- A clear target for the ADR-0015 migration: each phase moves a core mechanism from hardcoded behavior to FlowDefinition-driven behavior, within these boundaries.
+- The first remediation of Abstraction A is the gate enforcer: `stop-goal-fit` should evaluate the active kit's FlowDefinition `expects[]` (still re-deriving via Surface — the anti-gaming property is unchanged, only the *source of expectations* moves).
+- New kits become "write a FlowDefinition + skills + (optionally) a store adapter" — no engine work.
+- **P-d (dual-emit shadow retired):** FlowDefinition-driven sessions now emit ONLY the declared `builder.*` (or kit-namespaced) claim per gate — the `-legacy` workflow.* shadow is removed. The no-flow `workflow.*` primary path in `buildTrustBundle` is the LEGITIMATE home for standalone primitive use (not scaffolding); it is preserved unchanged. Full removal of the no-flow path would require forcing primitives through a default/minimal FlowDefinition — a separate future decision outside this cleanup.
+- Terminology: "core" is fixed to the flow-agents kit-agnostic execution layer; 0001/0009/0014/0015 usage aligns to it.
+## References
+- ADR 0001, 0004, 0005, 0007, 0008, 0009, 0014, 0015
+- `docs/kontour-resource-contract.md` (Compatibility Guidance)
+- Issue #183 (synthesis input — "not a new decision"); #174 (boundary umbrella); #177 (the migration)
+- `scripts/hooks/stop-goal-fit.js` (the Abstraction-A violation to remediate); `kits/builder/flows/build.flow.json`, `kits/knowledge/flows/*.flow.json` (kit FlowDefinitions)