@tangle-network/agent-runtime 0.8.0 → 0.11.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@tangle-network/agent-runtime",
3
- "version": "0.8.0",
3
+ "version": "0.11.0",
4
4
  "description": "Reusable runtime lifecycle for domain-specific agents.",
5
5
  "homepage": "https://github.com/tangle-network/agent-runtime#readme",
6
6
  "repository": {
@@ -18,18 +18,42 @@
18
18
  "types": "./dist/index.d.ts",
19
19
  "import": "./dist/index.js",
20
20
  "default": "./dist/index.js"
21
+ },
22
+ "./platform": {
23
+ "types": "./dist/platform.d.ts",
24
+ "import": "./dist/platform.js",
25
+ "default": "./dist/platform.js"
26
+ },
27
+ "./analyst-loop": {
28
+ "types": "./dist/analyst-loop.d.ts",
29
+ "import": "./dist/analyst-loop.js",
30
+ "default": "./dist/analyst-loop.js"
31
+ },
32
+ "./agent": {
33
+ "types": "./dist/agent.d.ts",
34
+ "import": "./dist/agent.js",
35
+ "default": "./dist/agent.js"
21
36
  }
22
37
  },
23
38
  "files": [
24
39
  "dist",
25
- "README.md",
26
- "docs"
40
+ "README.md"
27
41
  ],
28
42
  "publishConfig": {
29
43
  "access": "public"
30
44
  },
45
+ "scripts": {
46
+ "build": "tsup",
47
+ "dev": "tsup --watch",
48
+ "prepare": "tsup",
49
+ "test": "vitest run",
50
+ "test:watch": "vitest",
51
+ "lint": "biome check src tests examples",
52
+ "lint:fix": "biome check --write src tests examples",
53
+ "typecheck": "tsc --noEmit"
54
+ },
31
55
  "dependencies": {
32
- "@tangle-network/agent-eval": "^0.24.0"
56
+ "@tangle-network/agent-eval": "^0.30.0"
33
57
  },
34
58
  "devDependencies": {
35
59
  "@biomejs/biome": "^2.4.0",
@@ -39,20 +63,21 @@
39
63
  "typescript": "^5.7.0",
40
64
  "vitest": "^3.0.0"
41
65
  },
66
+ "pnpm": {
67
+ "minimumReleaseAge": 4320,
68
+ "minimumReleaseAgeExclude": [
69
+ "@tangle-network/agent-eval"
70
+ ],
71
+ "onlyBuiltDependencies": [
72
+ "esbuild"
73
+ ]
74
+ },
42
75
  "engines": {
43
76
  "node": ">=20"
44
77
  },
45
78
  "license": "MIT",
79
+ "packageManager": "pnpm@10.28.0",
46
80
  "peerDependencies": {
47
81
  "@tangle-network/sandbox": "0.1.2"
48
- },
49
- "scripts": {
50
- "build": "tsup",
51
- "dev": "tsup --watch",
52
- "test": "vitest run",
53
- "test:watch": "vitest",
54
- "lint": "biome check src tests examples",
55
- "lint:fix": "biome check --write src tests examples",
56
- "typecheck": "tsc --noEmit"
57
82
  }
58
- }
83
+ }
@@ -1,165 +0,0 @@
1
- # Domain Agent Runtime Integration Issues
2
-
3
- GitHub issue creation was blocked in this environment: the GitHub connector returns 404 for the private repos and `gh` is not authenticated. These are the issue drafts to open once repository credentials are available.
4
-
5
- ## tax-agent
6
-
7
- Title: Integrate tax evals with agent-runtime knowledge readiness
8
-
9
- `@tangle-network/agent-runtime@0.2.0` provides the shared task harness for domain agents: `runAgentTask`, typed runtime events, `KnowledgeRequirement`, readiness scoring, question/acquisition preflight hooks, and integration with `@tangle-network/agent-eval@0.20.0`.
10
-
11
- First local implementation:
12
-
13
- - Bump root `@tangle-network/agent-eval` to `0.20.0`.
14
- - Add root `@tangle-network/agent-runtime` `^0.2.0`.
15
- - Bump `server` `@tangle-network/agent-eval` to `0.20.0`.
16
- - Add `tests/eval/lib/agent-runtime.ts` with `buildTaxKnowledgeRequirements()` and `runTaxAgentTask()`.
17
- - Add `tests/eval/lib/agent-runtime.test.ts` covering missing-context blocking and ready-context execution.
18
-
19
- Remaining changes:
20
-
21
- - Wire the runtime helper into `tests/eval/lib/deterministic-tax-workflow-runner.ts`, `tests/eval/harness/run-product-eval.ts`, `tests/eval/harness/run-workspace-eval.ts`, `tests/eval/optimize.ts`, `tests/eval/lib/agent-eval-runtime.ts`, and `tests/eval/lib/trace-sync.ts`.
22
- - Canonicalize tax requirements for taxpayer facts, filing year, jurisdiction, source documents, workflow tools, entity classification, elections, accounting method, book/tax reconciliation, payroll, withholding, estimated payments, credits, nexus, apportionment, OCR confidence, current authority, and user-confirmed assumptions.
23
- - Persist runtime events into traces: `task_start`, `readiness_start/end`, `questions_start/end`, `acquisition_start/end`, `control_start/end`, and `task_end`.
24
- - Report readiness score, blocking gaps, acquisition mode, evidence IDs, blocked-before-execution status, and optimizer responsible surface.
25
- - Classify missing documents, missing jurisdiction, stale authority, bad retrieval, insufficient evidence, contradictory evidence, missing credentials, and ambiguous taxpayer intent as knowledge failures rather than prompt failures.
26
- - Feed gaps into multi-shot optimization as `knowledge-requirements`, `data-acquisition`, `retrieval-policy`, or `user-question-policy`.
27
- - Add tests for missing taxpayer facts, missing documents, missing jurisdiction, stale authority, and fully ready execution.
28
-
29
- ## legal-agent
30
-
31
- Title: Integrate legal evals with agent-runtime knowledge readiness
32
-
33
- First local implementation:
34
-
35
- - Bump `@tangle-network/agent-eval` dev dependency to `^0.20.0`.
36
- - Add `@tangle-network/agent-runtime` dev dependency `^0.2.0`.
37
- - Add `tests/eval/lib/agent-runtime.ts` with `buildLegalKnowledgeRequirements()` and `runLegalAgentTask()`.
38
- - Add `tests/eval/lib/agent-runtime.test.ts` covering missing matter context and ready execution.
39
-
40
- Remaining changes:
41
-
42
- - Wire `runLegalAgentTask()` into `tests/eval/harness/run-product-eval.ts`, `tests/eval/harness/run-dual-agent-eval.ts`, `tests/eval/optimize.ts`, `tests/eval/lib/agent-eval-runtime.ts`, and `tests/eval/lib/trace-sync.ts`.
43
- - Canonicalize legal requirements for matter facts, parties, dates, jurisdiction, forum, venue, governing law, current authority, uploaded matter documents, knowledge-base search, client credentials, regulated industry constraints, deadlines, and user-approved assumptions.
44
- - Enforce knowledge-base retrieval before drafting or review, with retrieval evidence saved into the runtime bundle.
45
- - Persist runtime readiness metadata and events into trace sync and eval reports.
46
- - Classify missing facts, missing authority, stale law, bad retrieval, insufficient evidence, contradictory authority, missing credentials, and ambiguous user intent as knowledge failures.
47
- - Extend optimization ASI rows so legal reports recommend data acquisition or authority refresh before prompt rewrites.
48
- - Add tests for missing jurisdiction, stale authority, missing knowledge-base results, contradictory authority, and ready document review.
49
-
50
- ## gtm-agent
51
-
52
- Title: Integrate GTM evals with agent-runtime knowledge readiness
53
-
54
- First local implementation:
55
-
56
- - Bump `@tangle-network/agent-eval` to `0.20.0`.
57
- - Add `@tangle-network/agent-runtime` `^0.2.0`.
58
- - Add `eval/agent-runtime.ts` with `buildGtmKnowledgeRequirements()` and `runGtmAgentTask()`.
59
- - Add `tests/agent-runtime.test.ts` covering missing company/connector context and ready execution.
60
-
61
- Remaining changes:
62
-
63
- - Wire `runGtmAgentTask()` into `eval/business-owner/live-flow.ts`, `eval/discover-brief/run.ts`, `eval/run.ts`, `eval/agent-eval-traces.ts`, and optimizer flows.
64
- - Canonicalize GTM requirements for company profile, product/offer, ICP/personas, channel history, CRM state, campaign history, analytics, performance metrics, positioning, competitors, integrations, credentials, and workspace vault knowledge.
65
- - Turn `knowledge/wiki/company.md`, `products.md`, `channels.md`, and `open-questions.md` into runtime evidence inputs.
66
- - Persist readiness events into existing trace helpers and reports.
67
- - Classify missing company data, missing market data, stale metrics, missing credentials, bad connector retrieval, insufficient evidence, contradictory positioning, and ambiguous business-owner intent as knowledge failures.
68
- - Feed knowledge gaps into optimization as data-acquisition/retrieval-policy/user-question-policy issues.
69
- - Add tests for missing ICP, missing product context, missing connectors, stale metrics, and ready discover-brief execution.
70
-
71
- ## creative-agent
72
-
73
- Title: Integrate creative evals with agent-runtime knowledge readiness
74
-
75
- First local implementation:
76
-
77
- - Bump `@tangle-network/agent-eval` to `^0.20.0`.
78
- - Add `@tangle-network/agent-runtime` `^0.2.0`.
79
- - Add `eval/control/agent-runtime.ts` with `buildCreativeKnowledgeRequirements()` and `runCreativeAgentTask()`.
80
- - Add `tests/agent-runtime.test.ts` covering missing creative intent/rights and ready execution.
81
-
82
- Remaining changes:
83
-
84
- - Wire `runCreativeAgentTask()` into `eval/control/creative-onboarding.ts`, `eval/control/creative-workflow-optimization.ts`, `eval/control/creative-multishot-optimization.ts`, `eval/e2e/creative-product-harness.ts`, and trace/report code.
85
- - Canonicalize creative requirements for intent, audience, taste thesis, references, dislikes, brand system, source rights, generated asset policy, deliverable specs, channel constraints, localization, approval policy, feedback source, and revision budget.
86
- - Convert onboarding answers and approval feedback into runtime evidence and reusable knowledge requirements.
87
- - Persist readiness events in control/e2e traces and reports.
88
- - Classify missing creative intent, missing taste signal, missing brand assets, missing rights, stale source policy, bad asset retrieval, insufficient evidence, contradictory feedback, and ambiguous user intent as knowledge failures.
89
- - Feed approval/readiness gaps into multi-shot optimization as data or policy surfaces before prompt changes.
90
- - Add tests for missing intent, missing rights, missing approval policy, contradictory feedback, and ready workflow execution.
91
-
92
- ## blueprint-agent
93
-
94
- Title: Integrate Blueprint benchmarks with agent-runtime knowledge readiness
95
-
96
- No implementation was requested for this pass. The repo should adopt the runtime boundary before deeper report polish so benchmark failures separate missing task-world knowledge from coding-agent failures.
97
-
98
- Remaining changes:
99
-
100
- - Add `@tangle-network/agent-runtime` to the workspace package(s) that own benchmark execution.
101
- - Keep existing benchmark database/report terms stable if they still say `vertical`; map to runtime `domain` metadata instead of forcing a broad rename.
102
- - Define Blueprint requirements for task brief, repo/source checkout, package manager, framework/language, build command, test command, sandbox availability, runtime environment, credentials/secrets, Tangle Blueprint SDK docs, template/plugin docs, deploy target, and benchmark scoring contract.
103
- - Wrap the benchmark product path with `runAgentTask` so readiness is scored before agent execution.
104
- - Persist runtime events into agent-eval traces and `bench-report`.
105
- - Add readiness sections to markdown/HTML/JSON reports: requirement table, missing evidence, acquisition plan, blocked-before-execution runs, and readiness deltas by run/version.
106
- - Map failures to `knowledge_readiness_blocked`, `missing_codebase_context`, `missing_runtime_context`, `missing_credentials`, `stale_external_data`, `bad_retrieval`, `insufficient_evidence`, `contradictory_evidence`, `reasoning_error`, `tool_selection_error`, `sandbox_failure`, and `budget_exceeded`.
107
- - Extend holdout/promotion gates so a prompt/topology change is not promoted when observed gain is actually due to different context acquisition.
108
- - Add tests for missing build command, missing SDK docs, missing sandbox, stale template docs, and fully ready benchmark execution.
109
-
110
- ## agent-builder
111
-
112
- Title: Integrate agent-builder with agent-runtime knowledge readiness
113
-
114
- `agent-builder` is the meta-platform for creating, testing, deploying, researching, and monetizing domain-specific agents. It already has the strongest production loop among the app repos: Forge builder sims, per-agent scenarios, feedback trajectories, canaries, auto-research, multi-shot optimization, KB optimization, version history, sandbox execution, marketplace publishing, and Playwright-to-agent-eval reporting.
115
-
116
- The missing boundary is that generated agents and Forge runs do not yet have a first-class `agent-runtime` preflight. Failures caused by missing build spec context, missing user/business/domain data, unavailable integrations, missing secrets, stale KB evidence, sandbox/runtime gaps, or bad retrieval can still be absorbed as prompt/config failures.
117
-
118
- Package updates:
119
-
120
- - Upgrade `@tangle-network/agent-eval` from `^0.19.1` to `^0.20.0`.
121
- - Upgrade `@tangle-network/agent-knowledge` from `^1.0.0` to `^1.1.0`.
122
- - Add `@tangle-network/agent-runtime` `^0.2.0`.
123
-
124
- Required changes:
125
-
126
- - Add a server-side runtime module: `src/lib/.server/runtime/agent-builder-runtime.ts`, `src/lib/.server/runtime/requirements.ts`, and `src/lib/.server/runtime/events.ts`.
127
- - Expose helpers: `buildForgeKnowledgeRequirements(input)`, `buildPublishedAgentKnowledgeRequirements(input)`, `runForgeAgentTask(input)`, `runPublishedAgentTask(input)`, and `runtimeEventsToTraceMetadata(events)`.
128
- - Define Forge build requirements for creator intent, target user, domain/category, BuildSpec completeness and approval, expected artifact, success criteria, tools, integrations, APIs, credentials, secrets, sandbox availability, runtime image, repository/scaffold structure, generated agent config, pricing/marketplace/governance constraints, and user-approved assumptions.
129
- - Define generated-agent requirements for domain-specific user facts, company/business/product context, regulatory freshness, connected integrations, secrets, vault/knowledge pages, source freshness, scenario fixtures, and fallback policy.
130
- - Persist generated-agent runtime metadata with agent config/version metadata so forks and marketplace consumers inherit the right contract.
131
- - Wire `runAgentTask` into Forge builder sims: `src/lib/.server/eval/forge-builder-sim.ts`, `src/routes/api.agents.eval.builder-sim.ts`, and `src/routes/api.admin.builder-sim.run.ts`.
132
- - Wire scenario and eval runs through runtime: `src/routes/api.agents.$agentId.scenarios.run.ts`, `src/routes/api.agents.$agentId.eval.simulate.ts`, `src/routes/api.agents.$agentId.eval.refine.ts`, and `src/routes/api.agents.$agentId.eval.ts`.
133
- - Add readiness checks or metadata to production chat/sandbox chat: `src/routes/api.agents.$agentId.chat.ts`, `src/routes/api.agents.$agentId.sandbox.chat.ts`, and `src/routes/api.v1.agents.$slug.chat.completions.ts`.
134
- - Bridge `agent-knowledge@1.1.0` readiness with `src/routes/api.agents.$agentId.knowledge.index.ts`, `src/routes/api.agents.$agentId.knowledge.search.ts`, `src/routes/api.agents.$agentId.knowledge.discover.ts`, `src/routes/api.agents.$agentId.knowledge.write-blocks.ts`, and `src/lib/.server/kb/optimization.ts`.
135
- - Persist runtime events into trace/report surfaces: `src/lib/.server/eval/trace-store-d1.ts`, `src/lib/.server/eval/session.ts`, `src/lib/.server/eval/run-record-store.ts`, `src/lib/.server/eval/run-record-fields.ts`, and `e2e/reporters/agent-eval-reporter.ts`.
136
- - Preserve `task_start`, `readiness_start/end`, `questions_start/end`, `acquisition_start/end`, `control_start/end`, and `task_end`.
137
- - Report readiness score, blocking gaps, acquisition mode, evidence IDs, user questions, runtime status, and blocked-before-execution status.
138
- - Map failures to `knowledge_readiness_blocked`, `missing_user_data`, `missing_domain_data`, `missing_codebase_context`, `missing_runtime_context`, `missing_credentials`, `stale_external_data`, `bad_retrieval`, `insufficient_evidence`, `contradictory_evidence`, and `ambiguous_user_intent`.
139
- - Update `src/lib/.server/eval/failure-inspector.ts`, `src/lib/.server/eval/multi-shot-adapter.ts`, `src/lib/.server/eval/heuristic-researcher.ts`, `src/lib/.server/eval/auto-research-runner.ts`, and `src/lib/.server/eval/canary-cron.ts`.
140
- - Extend ASI responsible surfaces beyond `agent.config.systemPrompt`: `knowledge-requirements`, `data-acquisition`, `retrieval-policy`, `user-question-policy`, `runtime-environment`, `integration-policy`, `sandbox-policy`, and `agent.config.systemPrompt`.
141
- - Ensure auto-research does not mutate prompts when the dominant failure is missing KB pages, missing BuildSpec fields, missing secrets, unavailable sandbox, or stale evidence.
142
- - Persist runtime contracts across versions, publish, and forks: `src/lib/.server/versions.ts`, `src/routes/api.agents.$agentId.versions.ts`, `src/routes/api.agents.$agentId.versions.$versionId.revert.ts`, `src/routes/api.agents.$agentId.fork.ts`, `src/routes/api.agents.$agentId.fork.apply-update.ts`, and `src/routes/api.agents.$agentId.publish.ts`.
143
- - Add workbench/admin UI visibility for readiness on eval pages, research cycle pages, knowledge pages, chat/workbench blockers, and marketplace/published agent caveats. Do not expose private/secret requirement details to public consumers.
144
-
145
- Tests:
146
-
147
- - Forge sim blocks when BuildSpec or intent is incomplete.
148
- - Forge sim blocks or asks when required integration credentials are missing.
149
- - Scenario run records readiness metadata.
150
- - KB optimization reports no pages / too few labels as knowledge readiness failures.
151
- - Multi-shot ASI points to data-acquisition/retrieval-policy instead of `agent.config.systemPrompt` for missing knowledge.
152
- - Fork preserves runtime contract.
153
- - Public chat hides private/secret requirement details.
154
- - E2E reporter can include runtime readiness metadata.
155
-
156
- Acceptance criteria:
157
-
158
- - `agent-builder` depends on `agent-runtime@^0.2.0`, `agent-eval@^0.20.0`, and `agent-knowledge@^1.1.0`.
159
- - Forge builder sims run through `runAgentTask` or a thin typed wrapper.
160
- - Per-agent scenario/eval runs attach `KnowledgeReadinessReport` before execution.
161
- - KB search/optimization feeds readiness into runtime and traces.
162
- - Missing BuildSpec, missing credentials, missing KB pages, stale evidence, bad retrieval, and sandbox unavailability are classified as knowledge/runtime failures.
163
- - Multi-shot optimization can recommend acquisition/retrieval/user-question/runtime-policy changes instead of always mutating prompts.
164
- - Runtime requirements persist across config versions, publish, and fork flows.
165
- - Workbench/admin reports show readiness score, gaps, acquisition plan, runtime status, and blocked-before-execution runs.
@@ -1,326 +0,0 @@
1
- # Product Runtime Kernel
2
-
3
- Status: complete. Implemented in `@tangle-network/agent-runtime@0.5.0`;
4
- validated, documented, and hardened through `0.5.2`.
5
-
6
- This document is the completion record for the production runtime kernel: what
7
- it is for, what is done, how it was validated, what is intentionally outside the
8
- public package, and how product repos should adopt it.
9
-
10
- ## Purpose
11
-
12
- `agent-runtime` exists to make agent execution consistent across products and
13
- eval harnesses. It should own the contract for:
14
-
15
- - readiness gating before execution;
16
- - session create/resume for long-running coding harnesses;
17
- - backend-agnostic streaming;
18
- - sanitized product/eval telemetry;
19
- - durable evidence that can feed reports, failure classification, and
20
- optimization.
21
-
22
- It should not be a decorative event logger around unrelated product code. If a
23
- product route still calls a backend directly, hand-rolls SSE, and only emits
24
- `start/end`, it is not getting the full value.
25
-
26
- ## Runtime Flow
27
-
28
- ```txt
29
- TaskSpec
30
- -> knowledge readiness
31
- -> optional ask/acquire/refresh
32
- -> readiness decision
33
- -> session create/resume
34
- -> execution backend stream
35
- -> normalized RuntimeStreamEvent
36
- -> sanitized SSE / persisted session event history
37
- -> final task status
38
- ```
39
-
40
- ## Definition of Done
41
-
42
- The kernel is complete when these are true:
43
-
44
- - A product route can call one runtime entry point, `runAgentTaskStream`, rather
45
- than hand-rolling readiness + backend execution + SSE framing.
46
- - A coding harness can continue an existing workspace by passing `sessionId` and
47
- `resume: true`.
48
- - A backend can be swapped without changing product stream consumers.
49
- - A failed backend emits structured failure events and gets a `stop()` callback
50
- when available.
51
- - All UI/report telemetry has a safe sanitized representation by default.
52
- - Eval and optimization systems can distinguish missing context/runtime failure
53
- from prompt/model reasoning failure.
54
-
55
- All kernel-side criteria are satisfied in `0.5.2`. Durable storage and UI
56
- rollout are product adoption tasks, not core package blockers.
57
-
58
- Completion verdict: passed. There are no open kernel blockers in this document.
59
-
60
- ## Completed API Surface
61
-
62
- ### Execution
63
-
64
- - `runAgentTaskStream(options)`
65
- - Applies readiness before backend execution.
66
- - Emits `task_start`, `readiness_start`, `readiness_end`.
67
- - Stops before backend execution when blocking gaps remain.
68
- - Creates or resumes a backend session.
69
- - Normalizes backend output into `RuntimeStreamEvent`.
70
- - Emits `backend_start`, `backend_end`, `task_end`, and `final`.
71
- - Records backend stream events into an optional `RuntimeSessionStore`.
72
- - Calls `backend.stop(session, reason)` on stream failure when a backend
73
- supplies the hook.
74
-
75
- - `runAgentTask(options)`
76
- - Existing control-loop path for eval-oriented agents.
77
- - Still useful for deterministic eval/optimization harnesses that model
78
- observe/validate/decide/act directly.
79
-
80
- ### Stream Contract
81
-
82
- - `RuntimeStreamEvent`
83
- - Readiness: `readiness_start`, `readiness_end`.
84
- - Context collection: `questions_start`, `questions_end`,
85
- `acquisition_start`, `acquisition_end`.
86
- - Session: `session_created`, `session_resumed`.
87
- - Backend lifecycle: `backend_start`, `backend_end`, `backend_error`.
88
- - Product stream: `text_delta`, `reasoning_delta`, `tool_call`,
89
- `tool_result`, `artifact`.
90
- - Completion: `task_end`, `final`.
91
-
92
- ### Sessions
93
-
94
- - `RuntimeSession`
95
- - Stable `id`, backend kind, status, timestamps, optional `resumeToken`, and
96
- metadata.
97
-
98
- - `RuntimeSessionStore`
99
- - Minimal persistence contract: `get`, `put`, `appendEvent`, `listEvents`.
100
- - Product repos should back this with D1/Postgres/Redis/etc. for real resume.
101
-
102
- - `InMemoryRuntimeSessionStore`
103
- - Useful for tests, local demos, and short-lived worker processes.
104
- - Not durable enough for production resume by itself.
105
-
106
- ### Backend Abstraction
107
-
108
- - `AgentExecutionBackend`
109
- - `start`, `resume`, `stream`, optional `stop`.
110
- - SDK-agnostic: the package owns the contract, callers own concrete clients
111
- and auth.
112
-
113
- - `createIterableBackend`
114
- - Escape hatch for custom harnesses, browser agents, and test doubles.
115
-
116
- - `createSandboxPromptBackend`
117
- - Wraps sandbox/sidecar clients that expose `streamPrompt`.
118
- - Supports caller-provided session IDs and resume via backend `resume`.
119
- - Maps common sandbox events to `text_delta`, `tool_call`, and `tool_result`.
120
-
121
- - `createOpenAICompatibleBackend`
122
- - Wraps TCloud/OpenAI-compatible `/chat/completions` streaming APIs.
123
- - Normalizes streamed content deltas into `text_delta`.
124
- - Also covers [cli-bridge](https://github.com/drewstone/cli-bridge) and any
125
- other OpenAI-compatible HTTP gateway — point `baseUrl` at the bridge's
126
- `/v1` and use a `<harness>/<model>` string as `model`.
127
-
128
- ### Sanitization and SSE
129
-
130
- - `sanitizeRuntimeStreamEvent(event, options)`
131
- - Redacts task inputs, user answers, control payloads, metadata, artifact
132
- URIs, and evidence IDs by default.
133
- - Reveals payloads only through explicit diagnostic options.
134
-
135
- - `runtimeStreamServerSentEvent(event, options)`
136
- - Encodes any sanitized runtime stream event as SSE.
137
- - Prevents every product route from hand-rolling inconsistent framing.
138
-
139
- - Existing helpers remain:
140
- - `sanitizeAgentRuntimeEvent`
141
- - `createRuntimeEventCollector`
142
- - `readinessServerSentEvent`
143
- - `encodeServerSentEvent`
144
-
145
- ## Validation Matrix
146
-
147
- Implemented test coverage in `tests/runtime.test.ts`:
148
-
149
- - Ready task runs through the existing control lifecycle.
150
- - Missing blocking knowledge stops before action.
151
- - Knowledge question/acquisition hooks refresh readiness before control.
152
- - Sanitized runtime telemetry redacts secrets by default.
153
- - Readiness decisions return stable `ready`, `blocked`, and `caveat` states.
154
- - SSE encoding strips unsafe control-field newlines.
155
- - Readiness SSE payloads use sanitized reports.
156
- - `runAgentTaskStream` blocks backend execution when readiness is missing.
157
- - Streaming backend creates a session, persists events, and resumes by
158
- `sessionId`.
159
- - Sanitized tool-call stream events hide payloads by default and reveal them
160
- only with `includeControlPayloads`.
161
- - Sandbox prompt events map to text/tool runtime stream events.
162
- - OpenAI-compatible streaming chat completions parse token deltas and produce a
163
- final completed event.
164
- - Knowledge question preflight emits exactly one `questions_end`.
165
- - CLI bridge streams parse NDJSON events and include session/message payloads in
166
- bridge requests.
167
- - Backend stream failure calls `backend.stop`, emits `backend_error`, and
168
- returns a failed `final` event with partial text preserved.
169
-
170
- Release verification:
171
-
172
- - `pnpm test`
173
- - `pnpm typecheck`
174
- - `pnpm build`
175
- - Published to npm as `@tangle-network/agent-runtime@0.5.0`.
176
- - Documentation validation published in `@tangle-network/agent-runtime@0.5.1`.
177
- - Hardening validation published in `@tangle-network/agent-runtime@0.5.2`.
178
-
179
- ## Completion Scorecard
180
-
181
- | Area | Status | Evidence |
182
- | --- | --- | --- |
183
- | Readiness gate | Complete | `runAgentTaskStream` blocks before backend execution when readiness is blocked. |
184
- | Stream contract | Complete | `RuntimeStreamEvent` covers readiness, session, backend, text, reasoning, tool, artifact, error, task end, final. |
185
- | Session resume contract | Complete | `RuntimeSession`, `RuntimeSessionStore`, `session_created`, `session_resumed`, `resumeToken`. |
186
- | Backend abstraction | Complete | `AgentExecutionBackend` with `start`, `resume`, `stream`, optional `stop`. |
187
- | Sandbox adapter | Complete | `createSandboxPromptBackend`; product proof in `agent-builder` PR #61. |
188
- | TCloud/OpenAI-compatible adapter | Complete | `createOpenAICompatibleBackend`; tested with streamed chat completions. Also serves cli-bridge (OpenAI-compatible) and any HTTP gateway. |
189
- | SSE framing | Complete | `runtimeStreamServerSentEvent`, newline-safe SSE encoder. |
190
- | Sanitization | Complete | Default redaction for task inputs, answers, payloads, metadata, URIs, evidence IDs. |
191
- | Failure handling | Complete | Backend exceptions produce `backend_error`, failed `task_end`, failed `final`, and call `stop` when supplied. |
192
- | Durable persistence | Contract complete, product-owned | `RuntimeSessionStore` interface exists; product repos must provide D1/Postgres/Redis implementations. |
193
- | UI rollout | Product-owned | Runtime emits stable events; product UIs decide rendering. |
194
-
195
- ## Completion Boundaries
196
-
197
- The package is done when the reusable contract is complete. The package is not
198
- responsible for product-specific state, credentials, databases, or UX. Those are
199
- adoption responsibilities.
200
-
201
- ### Complete in `agent-runtime`
202
-
203
- - Public task and stream contracts.
204
- - Readiness-gated streamed execution.
205
- - Session create/resume contract.
206
- - Backend abstraction and adapter factories.
207
- - Safe stream sanitization.
208
- - SSE encoding.
209
- - Failure normalization and backend stop hook.
210
- - Unit tests for the contract and shipped adapters.
211
- - NPM package publication.
212
-
213
- ### Not Part of the Public Kernel
214
-
215
- - Product database migrations.
216
- - Product-specific session persistence.
217
- - Product-specific auth, secrets, billing, and rate limits.
218
- - UI components for resume/readiness.
219
- - Domain-specific knowledge requirements and tool policies.
220
- - Concrete private SDK client construction.
221
-
222
- These are not deferred kernel tasks. They are downstream integration tasks.
223
-
224
- ## Critique
225
-
226
- The runtime kernel is now materially useful, but it is not magic. The most
227
- important limitations are deliberate:
228
-
229
- - It does not construct TCloud, sandbox, or CLI bridge clients. Product repos
230
- own credentials and client lifecycle.
231
- - It does not persist sessions durably unless a product supplies a durable
232
- `RuntimeSessionStore`.
233
- - It does not enforce all budgets/approvals/tool policies by itself yet. Those
234
- still live in product adapters or `agent-eval` control loops.
235
- - It does not guarantee backend resume works if the underlying backend cannot
236
- resume. It passes stable session IDs/resume tokens and records history; the
237
- backend must honor them.
238
- - It does not replace domain-specific wrappers. Tax/legal/GTM/creative still
239
- need their own requirements, tools, prompts, and report semantics.
240
-
241
- These constraints are correct for a public package. The core should define the
242
- contract and provide high-quality adapters, not absorb private product code.
243
-
244
- The main remaining architectural risk is misuse: product teams can still bypass
245
- the kernel and directly call sandbox/TCloud/CLI streams. Reviews should treat
246
- new hand-rolled readiness + stream loops as a smell unless the route has a
247
- specific reason to avoid runtime normalization.
248
-
249
- ## Downstream Adoption Checklist
250
-
251
- For product routes:
252
-
253
- - Replace direct sandbox/CLI/TCloud stream loops with `runAgentTaskStream`.
254
- - Forward `runtimeStreamServerSentEvent(event)` to UI.
255
- - Preserve legacy UI events only as compatibility shims.
256
- - Store `RuntimeSession` and `RuntimeStreamEvent[]` in the product database.
257
- - Pass `sessionId` and `resume: true` for continuation.
258
- - Persist `final.status`, readiness decision, and backend kind in run records.
259
- - Assert in tests that blocked readiness does not call the backend.
260
-
261
- For coding harnesses:
262
-
263
- - Use `createSandboxPromptBackend`, `createOpenAICompatibleBackend` (also
264
- covers cli-bridge and other OpenAI-compatible HTTP gateways), or a custom
265
- `AgentExecutionBackend`.
266
- - Require a stable `sessionId` for any long-running workspace.
267
- - Surface `session_resumed` in telemetry so product/debug views can distinguish
268
- continuation from a fresh run.
269
- - Treat missing session state as a recoverable backend/runtime failure, not a
270
- prompt failure.
271
- - Implement `stop(session, reason)` for expensive or long-lived backends.
272
-
273
- For eval and optimization:
274
-
275
- - Attach readiness decisions and stream session metadata to `RunRecord.raw`.
276
- - Classify missing knowledge/runtime/session failures separately from prompt or
277
- reasoning failures.
278
- - Do not optimize prompts when dominant failures are missing context, bad
279
- retrieval, missing credentials, or broken backend resume.
280
- - Add report slices by `backend`, `session_resumed`, `backend_error`, and
281
- `readiness_end.decision.status`.
282
-
283
- ## Completed Downstream Proof
284
-
285
- `agent-builder` has a product-path proof in PR #61:
286
-
287
- - Bumps `@tangle-network/agent-runtime` to `^0.5.0`.
288
- - Routes sandbox chat through `runAgentTaskStream`.
289
- - Uses `createSandboxPromptBackend`.
290
- - Emits sanitized runtime stream SSE.
291
- - Adds runtime session IDs to the compatibility `done` event.
292
-
293
- That validates the package against a real sandbox-backed product route, not only
294
- unit tests.
295
-
296
- ## Review Notes
297
-
298
- Validation found and fixed two issues before marking this complete:
299
-
300
- - The control-loop preflight path needed explicit coverage that
301
- `questions_end` is emitted exactly once.
302
- - The CLI bridge parser claim needed hardening. `0.5.2` tested NDJSON bridge
303
- streams instead of only SSE-style `data:` frames. `0.6.0` then removed the
304
- bespoke `createCliBridgeBackend` entirely after confirming cli-bridge is
305
- purely OpenAI-compatible at `/v1/chat/completions`; consumers now use
306
- `createOpenAICompatibleBackend`.
307
-
308
- The doc now matches shipped behavior.
309
-
310
- ## Downstream Rollout Plan
311
-
312
- This is downstream adoption work, not missing kernel work:
313
-
314
- 1. Add durable `RuntimeSessionStore` implementations in product repos.
315
- 2. Convert CLI bridge routes/harnesses to `createOpenAICompatibleBackend`
316
- pointed at the bridge's `/v1` URL (cli-bridge is OpenAI-compatible).
317
- 3. Convert simple TCloud chat routes to `createOpenAICompatibleBackend` where
318
- useful.
319
- 4. Store runtime stream events in product trace/run-record tables.
320
- 5. Add UI affordances for session resume/continuation and readiness blockers.
321
- 6. Extend failure classifiers to consume `RuntimeStreamEvent` evidence directly.
322
-
323
- The kernel is complete and ready for broad adoption. The next value comes from
324
- removing bespoke product stream loops and using the same runtime contract across
325
- product routes, coding harnesses, CLI bridge runs, evals, and optimization
326
- reports.