@a5c-ai/kradle 5.0.1-staging.3abdf9534c25

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (295) hide show
  1. package/Dockerfile +31 -0
  2. package/README.md +187 -0
  3. package/bin/kradle-demo.mjs +23 -0
  4. package/bin/kradle-server.mjs +14 -0
  5. package/dist/kradle-controller-ui.json +3482 -0
  6. package/dist/kradle-lifecycle.json +201 -0
  7. package/dist/kradle-runtime-snapshot.json +3125 -0
  8. package/dist/kradle-summary.json +724 -0
  9. package/docs/README.md +61 -0
  10. package/docs/agents/README.md +83 -0
  11. package/docs/agents/acceptance-test-matrix.md +193 -0
  12. package/docs/agents/agent-mux-adapter-contract.md +167 -0
  13. package/docs/agents/agent-mux-source-map.md +310 -0
  14. package/docs/agents/agent-run-memory-import-spec.md +256 -0
  15. package/docs/agents/agent-stack-management-spec.md +421 -0
  16. package/docs/agents/api-contract-spec.md +309 -0
  17. package/docs/agents/artifacts-writeback-spec.md +145 -0
  18. package/docs/agents/chart-packaging-spec.md +128 -0
  19. package/docs/agents/ci-orchestration-spec.md +140 -0
  20. package/docs/agents/context-assembly-spec.md +219 -0
  21. package/docs/agents/controller-reconciliation-spec.md +255 -0
  22. package/docs/agents/crd-schema-spec.md +315 -0
  23. package/docs/agents/decision-log-open-questions.md +169 -0
  24. package/docs/agents/developer-implementation-checklist.md +329 -0
  25. package/docs/agents/dispatching-design.md +262 -0
  26. package/docs/agents/gaps-agent-mux-to-kradle-crds.md +298 -0
  27. package/docs/agents/glossary.md +66 -0
  28. package/docs/agents/implementation-blueprint.md +324 -0
  29. package/docs/agents/implementation-rollout-slices.md +251 -0
  30. package/docs/agents/memory-context-integration-spec.md +194 -0
  31. package/docs/agents/memory-ontology-schema-spec.md +253 -0
  32. package/docs/agents/memory-operations-runbook.md +121 -0
  33. package/docs/agents/mvp-vertical-slice-spec.md +146 -0
  34. package/docs/agents/observability-audit-spec.md +265 -0
  35. package/docs/agents/operator-runbook.md +174 -0
  36. package/docs/agents/org-memory-api-payload-examples.md +333 -0
  37. package/docs/agents/org-memory-controller-sequence-spec.md +181 -0
  38. package/docs/agents/org-memory-e2e-fixture-plan.md +161 -0
  39. package/docs/agents/org-memory-ui-implementation-map.md +114 -0
  40. package/docs/agents/org-memory-vertical-slice-spec.md +168 -0
  41. package/docs/agents/org-resource-model-delta-spec.md +111 -0
  42. package/docs/agents/org-route-resource-model-spec.md +183 -0
  43. package/docs/agents/org-scoping-namespace-spec.md +114 -0
  44. package/docs/agents/rbac-secrets-management-spec.md +406 -0
  45. package/docs/agents/repository-page-integration-spec.md +255 -0
  46. package/docs/agents/resource-contract-examples.md +808 -0
  47. package/docs/agents/resource-relationship-map.md +190 -0
  48. package/docs/agents/security-threat-model.md +188 -0
  49. package/docs/agents/shared-memory-company-brain-spec.md +358 -0
  50. package/docs/agents/storage-migration-spec.md +168 -0
  51. package/docs/agents/subagent-orchestration-spec.md +152 -0
  52. package/docs/agents/system-overview.md +88 -0
  53. package/docs/agents/tools-mcp-skills-spec.md +189 -0
  54. package/docs/agents/traceability-matrix.md +79 -0
  55. package/docs/agents/ui-flow-spec.md +211 -0
  56. package/docs/agents/ui-ux-system-spec.md +426 -0
  57. package/docs/agents/workspace-lifecycle-spec.md +166 -0
  58. package/docs/architecture-spec.md +78 -0
  59. package/docs/architecture-v2.md +2759 -0
  60. package/docs/components/control-plane.md +78 -0
  61. package/docs/components/data-plane.md +69 -0
  62. package/docs/components/hooks-events.md +67 -0
  63. package/docs/components/identity-rbac-policy.md +73 -0
  64. package/docs/components/kubevela-oam.md +70 -0
  65. package/docs/components/operations-publishing.md +81 -0
  66. package/docs/components/runners-ci.md +66 -0
  67. package/docs/components/web-ui.md +94 -0
  68. package/docs/crd-behaviors-and-relationships.md +3926 -0
  69. package/docs/external/README.md +47 -0
  70. package/docs/external/bidirectional-sync-design.md +134 -0
  71. package/docs/external/cicd-interface.md +64 -0
  72. package/docs/external/external-backend-controllers.md +170 -0
  73. package/docs/external/external-backend-crds.md +234 -0
  74. package/docs/external/external-backend-ui-spec.md +151 -0
  75. package/docs/external/external-backend-ux-flows.md +115 -0
  76. package/docs/external/external-object-mapping.md +125 -0
  77. package/docs/external/git-forge-interface.md +68 -0
  78. package/docs/external/github-integration-design.md +151 -0
  79. package/docs/external/issue-tracking-interface.md +66 -0
  80. package/docs/external/provider-capability-manifests.md +204 -0
  81. package/docs/external/provider-catalog.md +139 -0
  82. package/docs/external/provider-rollout-testing.md +78 -0
  83. package/docs/external/research-results.md +48 -0
  84. package/docs/external/security-auth-permissions.md +81 -0
  85. package/docs/external/sync-state-machines.md +108 -0
  86. package/docs/external/unified-external-backend-model.md +107 -0
  87. package/docs/external/user-facing-changes.md +67 -0
  88. package/docs/gaps.md +161 -0
  89. package/docs/install.md +94 -0
  90. package/docs/integration-and-design-decisions.md +1530 -0
  91. package/docs/kradle-design.md +334 -0
  92. package/docs/local-minikube.md +55 -0
  93. package/docs/ontology/README.md +32 -0
  94. package/docs/ontology/bounded-contexts.md +29 -0
  95. package/docs/ontology/events-and-hooks.md +32 -0
  96. package/docs/ontology/oam-kubevela.md +32 -0
  97. package/docs/ontology/operations-and-release.md +25 -0
  98. package/docs/ontology/personas-and-actors.md +32 -0
  99. package/docs/ontology/policies-and-invariants.md +33 -0
  100. package/docs/ontology/problem-space.md +30 -0
  101. package/docs/ontology/resource-contracts.md +40 -0
  102. package/docs/ontology/resource-taxonomy.md +42 -0
  103. package/docs/ontology/runners-and-ci.md +29 -0
  104. package/docs/ontology/solution-space.md +24 -0
  105. package/docs/ontology/storage-and-data-boundaries.md +29 -0
  106. package/docs/ontology/validation-matrix.md +24 -0
  107. package/docs/ontology/web-ui-excellent-flows.md +32 -0
  108. package/docs/ontology/workflows.md +39 -0
  109. package/docs/ontology/world.md +35 -0
  110. package/docs/openapi.yaml +1291 -0
  111. package/docs/product-requirements.md +62 -0
  112. package/docs/requirements-v2.md +235 -0
  113. package/docs/roadmap-mvp.md +87 -0
  114. package/docs/sdk-api-reference.md +1108 -0
  115. package/docs/system-requirements.md +90 -0
  116. package/docs/system-spec-v2.md +1230 -0
  117. package/docs/tests/README.md +53 -0
  118. package/docs/tests/agent-qa-plan.md +63 -0
  119. package/docs/tests/browser-ui-tests.md +62 -0
  120. package/docs/tests/ci-quality-gates.md +48 -0
  121. package/docs/tests/coverage-model.md +64 -0
  122. package/docs/tests/e2e-scenario-tests.md +53 -0
  123. package/docs/tests/fixtures-test-data.md +63 -0
  124. package/docs/tests/observability-reliability-tests.md +54 -0
  125. package/docs/tests/product-test-matrix.md +145 -0
  126. package/docs/tests/qa-adoption-roadmap.md +130 -0
  127. package/docs/tests/qa-automation-plan.md +101 -0
  128. package/docs/tests/security-compliance-tests.md +57 -0
  129. package/docs/tests/test-framework-tools.md +88 -0
  130. package/docs/tests/test-suite-layout.md +121 -0
  131. package/docs/tests/unit-integration-tests.md +48 -0
  132. package/docs/todo-kyverno +714 -0
  133. package/docs/todos.md +4 -0
  134. package/docs/user-stories.md +78 -0
  135. package/docs/web-console-spec.md +533 -0
  136. package/examples/minikube-demo.yaml +190 -0
  137. package/examples/oam-application.yaml +23 -0
  138. package/examples/policy-kyverno-pr-title.yaml +18 -0
  139. package/package.json +66 -0
  140. package/scripts/build.mjs +29 -0
  141. package/scripts/setup-minikube.mjs +65 -0
  142. package/scripts/smoke.mjs +37 -0
  143. package/scripts/validate-doc-coverage.mjs +152 -0
  144. package/scripts/validate-package.mjs +95 -0
  145. package/scripts/validate-ui.mjs +305 -0
  146. package/src/agent-adapter-controller.js +169 -0
  147. package/src/agent-approval-controller.js +170 -0
  148. package/src/agent-context-bundles.js +242 -0
  149. package/src/agent-dispatch-controller.js +549 -0
  150. package/src/agent-gateway-config-controller.js +147 -0
  151. package/src/agent-identity-migration.js +115 -0
  152. package/src/agent-memory-controller.js +357 -0
  153. package/src/agent-memory-import.js +327 -0
  154. package/src/agent-memory-query.js +292 -0
  155. package/src/agent-memory-repository-source-controller.js +255 -0
  156. package/src/agent-mux-client.js +589 -0
  157. package/src/agent-permission-review.js +250 -0
  158. package/src/agent-persona-controller.js +135 -0
  159. package/src/agent-project-controller.js +117 -0
  160. package/src/agent-prompt-composition.js +55 -0
  161. package/src/agent-provider-config-controller.js +151 -0
  162. package/src/agent-secret-config-grant-controller.js +282 -0
  163. package/src/agent-session-transcript-controller.js +189 -0
  164. package/src/agent-stack-controller.js +421 -0
  165. package/src/agent-subagent-controller.js +160 -0
  166. package/src/agent-transport-binding-controller.js +121 -0
  167. package/src/agent-trigger-controller.js +387 -0
  168. package/src/agent-workspace-controller.js +702 -0
  169. package/src/agent-writeback-controller.js +302 -0
  170. package/src/api-controller.js +621 -0
  171. package/src/argocd-gitops.js +43 -0
  172. package/src/artifact-registry-controller.js +542 -0
  173. package/src/assistant-runtime.js +284 -0
  174. package/src/async-controller.js +207 -0
  175. package/src/audit-controller.js +191 -0
  176. package/src/auth.js +310 -0
  177. package/src/component-catalog.js +41 -0
  178. package/src/control-plane.js +136 -0
  179. package/src/controller-client.js +112 -0
  180. package/src/controller-ui.js +620 -0
  181. package/src/data-plane.js +179 -0
  182. package/src/event-bus.js +397 -0
  183. package/src/external/conflict-controller.js +225 -0
  184. package/src/external/github/auth.js +96 -0
  185. package/src/external/github/cicd.js +180 -0
  186. package/src/external/github/git-forge.js +240 -0
  187. package/src/external/github/index.js +144 -0
  188. package/src/external/github/issue-tracking.js +163 -0
  189. package/src/external/provider-adapter.js +161 -0
  190. package/src/external/provider-resource-factory.js +221 -0
  191. package/src/external/sync-controller.js +235 -0
  192. package/src/external/webhook-controller.js +144 -0
  193. package/src/external/write-controller.js +283 -0
  194. package/src/gitea-backend.js +131 -0
  195. package/src/gitea-service.js +173 -0
  196. package/src/handoff.js +98 -0
  197. package/src/health-probes.js +134 -0
  198. package/src/hooks-events.js +63 -0
  199. package/src/hooks-lifecycle.js +117 -0
  200. package/src/http-server.js +409 -0
  201. package/src/identity-policy.js +86 -0
  202. package/src/index.js +71 -0
  203. package/src/jitsi-agent-bridge.js +141 -0
  204. package/src/jitsi-meeting-controller.js +291 -0
  205. package/src/jitsi-sync-controller.js +198 -0
  206. package/src/kradle-inference-service-controller.js +246 -0
  207. package/src/kubernetes-controller-async.js +531 -0
  208. package/src/kubernetes-controller.js +904 -0
  209. package/src/kubernetes-resource-gateway.js +48 -0
  210. package/src/model-route-controller.js +364 -0
  211. package/src/notification-controller.js +178 -0
  212. package/src/operations.js +112 -0
  213. package/src/org-scoping.js +5 -0
  214. package/src/resource-model.js +282 -0
  215. package/src/runner-controller.js +272 -0
  216. package/src/runners-ci.js +48 -0
  217. package/src/runtime.js +196 -0
  218. package/src/snapshot-cache.js +157 -0
  219. package/src/virtual-model-controller.js +538 -0
  220. package/src/virtual-model-hook-bridge.js +200 -0
  221. package/src/web-ui.js +40 -0
  222. package/tests/agent-adapter-controller.test.js +361 -0
  223. package/tests/agent-approval-controller.test.js +173 -0
  224. package/tests/agent-context-bundles.test.js +278 -0
  225. package/tests/agent-dispatch-controller.test.js +679 -0
  226. package/tests/agent-gateway-config-controller.test.js +386 -0
  227. package/tests/agent-identity-migration.test.js +87 -0
  228. package/tests/agent-memory-controller.test.js +461 -0
  229. package/tests/agent-memory-import-snapshot.test.js +477 -0
  230. package/tests/agent-memory-query.test.js +404 -0
  231. package/tests/agent-memory-repository-source.test.js +514 -0
  232. package/tests/agent-mux-client.test.js +389 -0
  233. package/tests/agent-mux-integration.test.js +971 -0
  234. package/tests/agent-permission-review-v2.test.js +317 -0
  235. package/tests/agent-permission-review.test.js +209 -0
  236. package/tests/agent-persona-controller.test.js +127 -0
  237. package/tests/agent-project-controller.test.js +302 -0
  238. package/tests/agent-prompt-composition.test.js +76 -0
  239. package/tests/agent-provider-config-controller.test.js +376 -0
  240. package/tests/agent-resources.test.js +303 -0
  241. package/tests/agent-secret-config-grant.test.js +231 -0
  242. package/tests/agent-session-transcript-controller.test.js +499 -0
  243. package/tests/agent-stack-controller.test.js +283 -0
  244. package/tests/agent-subagent-controller.test.js +201 -0
  245. package/tests/agent-transport-binding-controller.test.js +294 -0
  246. package/tests/agent-trigger-controller.test.js +271 -0
  247. package/tests/agent-trigger-routes.test.js +190 -0
  248. package/tests/agent-trigger-sources.test.js +245 -0
  249. package/tests/agent-workspace-controller.test.js +181 -0
  250. package/tests/agent-writeback.test.js +292 -0
  251. package/tests/approval-persistence.test.js +171 -0
  252. package/tests/artifact-registry.test.js +511 -0
  253. package/tests/assistant-runtime.test.js +506 -0
  254. package/tests/async-controller.test.js +252 -0
  255. package/tests/audit-controller.test.js +227 -0
  256. package/tests/codespace-controller.test.js +318 -0
  257. package/tests/controller-client.test.js +133 -0
  258. package/tests/deployment.test.js +527 -0
  259. package/tests/e2e/lifecycle.test.js +120 -0
  260. package/tests/event-bus-integration.test.js +355 -0
  261. package/tests/external-github-forge.test.js +560 -0
  262. package/tests/external-github-issues-cicd.test.js +520 -0
  263. package/tests/external-integration.test.js +470 -0
  264. package/tests/external-persistence.test.js +415 -0
  265. package/tests/external-provider-adapter.test.js +365 -0
  266. package/tests/external-resource-model.test.js +223 -0
  267. package/tests/external-webhook-sync.test.js +287 -0
  268. package/tests/external-write-conflict.test.js +353 -0
  269. package/tests/gitea-service.test.js +253 -0
  270. package/tests/health-check-real.test.js +165 -0
  271. package/tests/health-probes.test.js +90 -0
  272. package/tests/hooks-lifecycle.test.js +364 -0
  273. package/tests/integration/full-flow.test.js +266 -0
  274. package/tests/jitsi-agent-bridge.test.js +119 -0
  275. package/tests/jitsi-helm-integration.test.js +77 -0
  276. package/tests/jitsi-meeting-controller.test.js +170 -0
  277. package/tests/jitsi-resource-model.test.js +73 -0
  278. package/tests/jitsi-sync-controller.test.js +112 -0
  279. package/tests/kradle-inference-service.test.js +689 -0
  280. package/tests/kradle.test.js +779 -0
  281. package/tests/memory-search-wiring.test.js +270 -0
  282. package/tests/model-route-controller.test.js +733 -0
  283. package/tests/notification-controller.test.js +196 -0
  284. package/tests/notification-integration.test.js +179 -0
  285. package/tests/org-scoping.test.js +687 -0
  286. package/tests/runner-controller.test.js +327 -0
  287. package/tests/runner-integration.test.js +231 -0
  288. package/tests/session-cookie-hmac.test.js +151 -0
  289. package/tests/snapshot-performance.test.js +315 -0
  290. package/tests/sse-events.test.js +107 -0
  291. package/tests/virtual-model-controller.test.js +877 -0
  292. package/tests/virtual-model-hook-bridge.test.js +384 -0
  293. package/tests/webhook-trigger.test.js +198 -0
  294. package/tests/workspace-volumes.test.js +312 -0
  295. package/tests/writeback-persistence.test.js +207 -0
@@ -0,0 +1,1530 @@
1
+ # Integration & Design Decisions
2
+
3
+ Supplementary specification covering external dependencies, scope boundaries,
4
+ architectural trade-offs, and system nuances for the Kradle project.
5
+
6
+ ---
7
+
8
+ ## 1. External Dependencies & Integration Points
9
+
10
+ ### 1.1 Agent-Mux Dependency
11
+
12
+ #### What Kradle Imports from @a5c-ai/agent-mux
13
+
14
+ Kradle does not import agent-mux as an npm dependency. Instead, it communicates
15
+ with the agent-mux gateway over HTTP. The integration surface lives entirely
16
+ within `core/src/agent-mux-client.js`, which provides:
17
+
18
+ - **queryCapabilities(adapter)** -- GET `/api/v1/agents/{adapter}/capabilities`
19
+ - **launchSession({stack, contextBundle, permissionSnapshot, workspace})** -- POST `/api/v1/sessions`
20
+ - **getSessionStatus(sessionId)** -- GET `/api/v1/sessions/{sessionId}`
21
+ - **subscribeToEvents(runId, callback)** -- GET `/api/v1/runs/{runId}/events` (SSE)
22
+ - **reconcileTranscript(sessionId, events, options)** -- local data transformation
23
+
24
+ The boundary declaration (`AGENT_MUX_CLIENT_BOUNDARY`) explicitly states:
25
+ - Owns: gateway HTTP calls, SSE event streaming, transcript reconciliation
26
+ - Delegates to: resource-model (for creating AgentSessionTranscript resources)
27
+ - Must not own: secret values, permission review, resource persistence
28
+
29
+ #### How agent-mux-client.js Connects to the Gateway
30
+
31
+ Connection is HTTP-only, using Node.js built-in `node:http` / `node:https`.
32
+ Zero external fetch or HTTP client dependencies. The internal `httpRequest()`
33
+ helper performs raw `transport.request()` calls with JSON serialization.
34
+
35
+ Connection parameters:
36
+ - `gateway` string -- full base URL (e.g. `http://agent-mux-gateway:8080`)
37
+ - `enabled` boolean -- client methods return `null` early when disabled
38
+ - Timeout: 30s default per request
39
+ - Protocol: auto-detected from URL scheme (http:// vs https://)
40
+
41
+ SSE streaming uses a persistent HTTP connection with:
42
+ - Reconnection via exponential backoff (1s, 2s, 4s... capped at 30s)
43
+ - Backoff reset on successful connection establishment
44
+ - Graceful abort via returned `{ abort }` handle
45
+ - Buffer-based SSE parsing (splits on `\n\n`, extracts `data:` lines)
46
+
47
+ #### What Works WITHOUT Agent-Mux
48
+
49
+ The following subsystems are fully operational without agent-mux:
50
+
51
+ 1. **Resource CRUD** -- All 76 resource kinds can be created, listed, updated, deleted via kubectl
52
+ 2. **Web Console** -- All 57 pages render; agent dispatch pages show "gateway unavailable" state
53
+ 3. **Auth & Sessions** -- OAuth login, cookie sessions, delegated identity all functional
54
+ 4. **Project Management** -- KradleProject, Issue, PullRequest lifecycle fully local
55
+ 5. **Memory System** -- AgentMemoryRepository, Source, Ontology, Query, Import all operate on CRDs
56
+ 6. **Workspace Provisioning** -- KradleWorkspace PVC specs, git ops, codespace generation
57
+ 7. **External Backends** -- Provider registration, webhook delivery, sync events
58
+ 8. **Audit Logging** -- Audit controller records all mutations regardless of agent-mux state
59
+ 9. **Policy Engine** -- Kyverno integration, PolicyProfile, PolicyBinding
60
+ 10. **Runner Pool Management** -- RunnerPool specs, scheduling policies
61
+ 11. **Notification System** -- Resource-change notifications via event bus
62
+ 12. **MCP Server** -- All 14 tools operational; `kradle_dispatch_agent` returns error status
63
+
64
+ #### What REQUIRES Agent-Mux
65
+
66
+ The following operations fail gracefully (return null/error) without a running gateway:
67
+
68
+ 1. **Real Session Creation** -- `launchSession()` returns null; AgentSession resource created but status stays `Pending`
69
+ 2. **Event Streaming** -- `subscribeToEvents()` reconnect loop fires indefinitely (capped at 30s intervals)
70
+ 3. **Transcript Reconciliation** -- No events to reconcile; AgentSessionTranscript never moves to `Reconciled` phase
71
+ 4. **Adapter Capability Discovery** -- `queryCapabilities()` returns null; stack readiness condition degraded
72
+ 5. **Live Session Status** -- `getSessionStatus()` returns null; UI shows "status unavailable"
73
+ 6. **Cost Calculation** -- Token usage comes from SSE events; without them, cost fields remain zero
74
+
75
+ #### Gateway URL Configuration
76
+
77
+ The gateway URL flows through the system as follows:
78
+
79
+ ```
80
+ Helm values.yaml:
81
+ agents:
82
+ agentMux:
83
+ enabled: false # Must be true for integration
84
+ gateway: "" # Full URL to agent-mux gateway
85
+
86
+ --> Rendered into Deployment env:
87
+ KRADLE_AGENT_MUX_ENABLED=true
88
+ KRADLE_AGENT_MUX_GATEWAY=http://agent-mux-gateway:8080
89
+
90
+ --> Read by API controller initialization:
91
+ createAgentMuxClient({
92
+ gateway: process.env.KRADLE_AGENT_MUX_GATEWAY || '',
93
+ enabled: process.env.KRADLE_AGENT_MUX_ENABLED === 'true'
94
+ })
95
+
96
+ --> Persisted in AgentGatewayConfig CRD:
97
+ spec.gatewayUrl (for UI display and runtime reconfiguration)
98
+ ```
99
+
100
+ The web container does NOT communicate with agent-mux directly. All agent
101
+ operations route through the API container's `/api/agents/*` endpoints, which
102
+ delegate to the mux client instance.
103
+
104
+ #### Runtime Availability Check
105
+
106
+ ```javascript
107
+ client.isAvailable() // returns: enabled && !!gateway
108
+ ```
109
+
110
+ Every client method checks `isAvailable()` first and returns `null` immediately
111
+ when the gateway is not configured. This prevents connection errors from
112
+ propagating into the resource reconciliation loops.
113
+
114
+ ---
115
+
116
+ ### 1.2 Transport-Mux Dependency
117
+
118
+ #### How Transport-Mux Handles Protocol Translation
119
+
120
+ Transport-mux is an external component (not bundled with kradle) that provides
121
+ protocol translation between different agent communication channels:
122
+
123
+ - **stdio** -- stdin/stdout JSON-RPC for local CLI agents
124
+ - **http** -- REST/SSE for cloud-hosted agents
125
+ - **websocket** -- bidirectional streaming for persistent connections
126
+ - **unix** -- Unix domain socket for same-host agents
127
+
128
+ The transport-mux runtime sits between the agent-mux gateway and the actual
129
+ agent process, handling message framing, connection lifecycle, and protocol
130
+ negotiation.
131
+
132
+ #### What Kradle's AgentTransportBinding Models
133
+
134
+ The `AgentTransportBinding` CRD captures connection configuration:
135
+
136
+ ```yaml
137
+ spec:
138
+ adapterRef: "claude-adapter-v1"
139
+ endpoint: "wss://agent.example.com/ws"
140
+ protocol: "websocket" # One of: stdio, http, websocket, unix
141
+ reconnectPolicy:
142
+ maxRetries: 3
143
+ backoffMs: 1000
144
+ maxBackoffMs: 30000
145
+ auth:
146
+ type: "bearer"
147
+ secretRef: "agent-token-secret"
148
+ healthCheck:
149
+ endpoint: "/health"
150
+ intervalMs: 30000
151
+ ```
152
+
153
+ The controller (`agent-transport-binding-controller.js`) provides:
154
+ - **Validation** -- Ensures required fields present, protocol is one of 4 valid types
155
+ - **Connection Status Tracking** -- Reads `status.connectionStatus` (set externally)
156
+ - **Reconnect Policy Resolution** -- Merges spec overrides with defaults (3 retries, 1s-30s backoff)
157
+
158
+ #### Gap: Kradle Validates Transport but Never Activates the Runtime
159
+
160
+ The transport binding controller is purely declarative. It:
161
+ - Validates the spec shape
162
+ - Reports connection status from the resource's status field
163
+ - Returns supported protocols list
164
+
165
+ It does NOT:
166
+ - Open actual TCP/WebSocket connections
167
+ - Start stdio processes
168
+ - Perform health checks (despite modeling healthCheck in spec)
169
+ - Activate the transport-mux runtime component
170
+ - Signal transport-mux to register a new binding
171
+
172
+ The intent is that transport-mux watches AgentTransportBinding resources via
173
+ Kubernetes watch and self-reconciles. Kradle's role is to persist the desired
174
+ state and present it in the UI.
175
+
176
+ ---
177
+
178
+ ### 1.3 Hooks-Mux Dependency
179
+
180
+ #### What Hooks-Mux Provides
181
+
182
+ The hooks-mux system (external to kradle) provides lifecycle event dispatching
183
+ for agent runs:
184
+
185
+ - `RUN_CREATED` -- Agent dispatch initiated
186
+ - `STEP_STARTED` -- Agent begins a tool use or reasoning step
187
+ - `STEP_COMPLETED` -- Agent finishes a step
188
+ - `APPROVAL_REQUESTED` -- Agent needs human gate
189
+ - `APPROVAL_GRANTED` / `APPROVAL_DENIED` -- Gate resolved
190
+ - `RUN_COMPLETED` -- Agent run finished (success/failure/timeout)
191
+ - `RUN_CANCELLED` -- Agent run externally terminated
192
+
193
+ These events flow to registered webhook subscribers, trigger rules, and
194
+ the notification system.
195
+
196
+ #### What Kradle's Event Bus Does Instead
197
+
198
+ Kradle has its own in-process event bus (`core/src/event-bus.js`) that provides:
199
+
200
+ ```javascript
201
+ const bus = createEventBus();
202
+ bus.subscribe(fn);
203
+ bus.emit({ type: 'resource-change', kind, name, operation, timestamp });
204
+ bus.emitResourceChange('Repository', 'my-repo', 'apply');
205
+ ```
206
+
207
+ This bus handles:
208
+ - Real-time resource change notifications to SSE clients (web console live updates)
209
+ - Cache invalidation signals
210
+ - UI refresh triggers
211
+
212
+ The event bus is limited to **resource-change events only**. It does not model:
213
+ - Agent lifecycle events (run created/completed)
214
+ - Step-level granularity (tool use, reasoning)
215
+ - Cross-service event routing
216
+ - Durable event delivery with retries
217
+
218
+ #### Gap: No HookDispatcher Integration
219
+
220
+ Kradle has a `WebhookBus` class (`core/src/hooks-events.js`) that handles
221
+ outbound webhook delivery for resource events, but it is NOT connected to
222
+ hooks-mux lifecycle events. Specifically:
223
+
224
+ - `WebhookBus.deliver()` creates `WebhookDelivery` resources for webhook subscribers
225
+ - It does NOT receive or forward agent-lifecycle events from hooks-mux
226
+ - `AgentTriggerRule` resources define event-to-stack routing but the trigger
227
+ evaluation is purely resource-driven (no real-time hook stream)
228
+ - `AgentTriggerExecution` records are created by the trigger controller but
229
+ the trigger is evaluated on resource watch events, not on hooks-mux push
230
+
231
+ The missing integration:
232
+ 1. No hooks-mux webhook receiver endpoint in kradle's HTTP server
233
+ 2. No translation layer from hooks-mux event format to kradle event bus
234
+ 3. No lifecycle event emission when AgentDispatchRun status changes
235
+ 4. No step-level event tracking (only run-level status)
236
+
237
+ ---
238
+
239
+ ### 1.4 Babysitter-SDK Dependency
240
+
241
+ #### The .a5c/processes Pattern
242
+
243
+ Babysitter-SDK uses a file-based process definition pattern:
244
+
245
+ ```
246
+ .a5c/
247
+ processes/
248
+ my-process.json # Process definition
249
+ runs/
250
+ <runId>/
251
+ journal.json # Event log
252
+ state.json # Current state
253
+ effects/ # Effect outputs
254
+ ```
255
+
256
+ Kradle processes (if any) would follow this same pattern. However, kradle does
257
+ not currently define babysitter processes for its own operations. The
258
+ integration point is on the **import** side -- kradle can ingest babysitter
259
+ run artifacts into its memory system.
260
+
261
+ #### How AgentRunMemoryImport Connects to Babysitter Journal Parsing
262
+
263
+ The `agent-memory-import.js` module provides:
264
+
265
+ ```javascript
266
+ parseJournalForImport(journal)
267
+ // Returns: { summary, keyEvents, effectSummary }
268
+ ```
269
+
270
+ This function:
271
+ 1. Accepts a babysitter `.a5c` journal array (raw event objects)
272
+ 2. Extracts structural metadata only (no raw task content, no effect payloads)
273
+ 3. Produces a summary with: runId, processId, eventCount, durationMs, runStatus
274
+ 4. Extracts key events: run_start, task_completed (with effect kind/result), breakpoint, run_end
275
+ 5. Computes effect summary: successCount, failureCount, effectKinds array
276
+
277
+ The `AgentRunMemoryImport` CRD then stores this parsed summary alongside:
278
+ - `memoryRepository` reference (where to store)
279
+ - `source` information (which babysitter run)
280
+ - `include` filters (what to import)
281
+ - Review and redaction status
282
+
283
+ #### The Orchestration Boundary
284
+
285
+ ```
286
+ Babysitter-SDK Kradle
287
+ -------------------------------------------
288
+ Process definition <--> (not used)
289
+ Run lifecycle <--> AgentDispatchRun (mirrors status)
290
+ Journal events ---> AgentRunMemoryImport (parsed summary)
291
+ Session binding <--> AgentSession (Kradle projection)
292
+ Hook-driven continuation <--> AgentApproval (approval gates)
293
+ ```
294
+
295
+ - **Babysitter owns**: run lifecycle (create, iterate, complete), journal/state, session binding
296
+ - **Kradle owns**: resource state, memory ingestion, audit trail, approval workflows
297
+ - The boundary is clean: babysitter does execution, kradle does desired-state persistence
298
+
299
+ ---
300
+
301
+ ### 1.5 Atlas Dependency
302
+
303
+ #### How the Stack Builder Queries Atlas Graph
304
+
305
+ The web console's stack builder uses Atlas as a knowledge source for populating
306
+ layer options. The query path is:
307
+
308
+ ```
309
+ Browser (stack-builder-graph.jsx)
310
+ --> /api/atlas/search (Next.js route handler)
311
+ --> fetch(ATLAS_BASE_URL + /api/v1/search?q=...&kind=...&limit=...)
312
+ --> fetch(ATLAS_BASE_URL + /api/v1/kinds/{kind}?limit=...)
313
+ ```
314
+
315
+ The proxy route (`web/app/api/atlas/search/route.js`) handles:
316
+ - **Browse mode**: fetches instances by kind (no search query needed)
317
+ - **Search mode**: full-text search via Atlas's Fuse.js-based endpoint
318
+ - **Multi-kind search**: parallel queries per kind, merged and deduplicated by id
319
+
320
+ The SDK also provides a direct client (`sdk/src/atlas-graph-client.js`):
321
+ - `fetchAtlasRecordsByKinds(atlasBaseUrl, kinds, options)` -- browse by NodeKind
322
+ - `searchAtlasGraph(atlasBaseUrl, query, options)` -- full-text with optional kind filter
323
+
324
+ #### The 11 Stack Layers + 4 Composition Facets
325
+
326
+ **STACK_LAYERS** (11 layers, each with associated Atlas NodeKinds):
327
+
328
+ | # | Layer | Atlas NodeKinds |
329
+ |---|-------|-----------------|
330
+ | 1 | Model | ModelFamily, ModelVersion, SessionModel |
331
+ | 2 | Provider | Provider, ModelProviderProduct, ModelProviderVersion |
332
+ | 3 | Transport | TransportProtocol, ModelTransportProtocol |
333
+ | 4 | Agent Core | AgentCoreImpl, Capability, CapabilitySupport |
334
+ | 5 | Agent Runtime | AgentProduct, AgentRuntimeImpl, AgentVersion, Subagent |
335
+ | 6 | Agent Platform | AgentPlatformImpl, Platform, PlatformService |
336
+ | 7 | Workspace | Workspace, Project, SharedContextSpec |
337
+ | 8 | Execution | Workflow, LibraryProcess, Phase, HookSurface |
338
+ | 9 | Sandbox | PermissionMode, DeploymentTarget |
339
+ | 10 | Interaction | Tool, ToolDescriptor, ToolServer, PluginArtifact, MCPPrompt |
340
+ | 11 | Presentation | AgentUIImpl, Page, APIEndpoint, Presentation |
341
+
342
+ **COMPOSITION_FACETS** (4 cross-cutting concerns):
343
+
344
+ | Facet | Atlas NodeKinds |
345
+ |-------|-----------------|
346
+ | Roles and Teams | Role, Responsibility, OrgUnit, AgentTeam |
347
+ | Skills and Capabilities | Skill, LibrarySkill, SkillArea, Capability |
348
+ | Evaluation and Governance | Benchmark, TestSet, EvalRun |
349
+ | Environment and Data | StackPart, VectorStore, MemoryStore |
350
+
351
+ #### Atlas Node Kinds Mapped to Stack Builder Layers
352
+
353
+ Each stack builder layer queries Atlas for specific NodeKinds. The mapping
354
+ (`atlasKinds` array per layer) drives which records appear as selectable
355
+ options in the UI. Users compose an AgentStack by picking one or more
356
+ records from each layer.
357
+
358
+ The resolution path:
359
+ 1. User opens stack builder page
360
+ 2. UI requests `/api/atlas/search?mode=browse&kinds=ModelFamily,ModelVersion`
361
+ 3. Route handler fetches from Atlas API
362
+ 4. Results rendered as selectable cards/chips in the layer panel
363
+ 5. User selections flow into AgentStack spec fields
364
+
365
+ #### What Happens When Atlas Is Unavailable
366
+
367
+ When Atlas is unreachable:
368
+ - Browse queries return empty arrays `[]` (no crash)
369
+ - Search queries return `{ total: 0, hits: [] }`
370
+ - The proxy route returns `{ total: 0, hits: [], error: <message> }` with status 502
371
+ - Stack builder layers show "No options available" state
372
+ - Users can still manually type stack configuration without Atlas suggestions
373
+ - The `ATLAS_BASE_URL` env var defaults to `https://atlas-staging.a5c.ai`
374
+
375
+ ---
376
+
377
+ ## 2. Scope Boundaries
378
+
379
+ ### 2.1 What Kradle Owns
380
+
381
+ #### Kubernetes CRD Resource Model (76 Kinds)
382
+
383
+ Kradle defines and manages 76 resource kinds across two storage tiers:
384
+
385
+ **Config kinds (44, etcd-stored):**
386
+ Organization, OrgNamespaceBinding, User, Team, Invite, IdentityMapping,
387
+ AuthProvider, Repository, SSHKey, RepositoryPermission, WebhookSubscription,
388
+ RefPolicy, BranchProtection, PolicyProfile, PolicyTemplate, PolicyBinding,
389
+ PolicyExceptionRequest, RunnerPool, View, Selector, AgentStack, AgentSubagent,
390
+ AgentToolProfile, AgentMcpServer, AgentSkill, AgentTriggerRule, AgentContextLabel,
391
+ KradleWorkspacePolicy, AgentServiceAccount, AgentRoleBinding, AgentSecretGrant,
392
+ AgentConfigGrant, AgentAdapter, AgentTransportBinding, AgentProviderConfig,
393
+ KradleProject, AgentGatewayConfig, AgentMemoryRepository, AgentMemorySource,
394
+ AgentMemoryOntology, AgentMemoryAssociation, KradleWorkspace,
395
+ ExternalBackendProvider, ExternalBackendBinding, ExternalBackendSyncPolicy,
396
+ ExternalProviderCapabilityManifest
397
+
398
+ **Aggregated kinds (32, postgres-stored):**
399
+ PullRequest, Issue, Review, Pipeline, Job, WebhookDelivery, AgentDispatchRun,
400
+ AgentDispatchAttempt, AgentSession, AgentContextBundle, KradleArtifact,
401
+ AgentApproval, AgentTriggerExecution, AgentCapabilityRequirement,
402
+ WorkItemSessionLink, WorkItemWorkspaceLink, AgentSessionTranscript,
403
+ AgentSessionAttachment, KradleWorkspaceRuntime, AgentMemorySnapshot,
404
+ AgentMemoryQuery, AgentMemoryUpdate, AgentRunMemoryImport,
405
+ ExternalWebhookDelivery, ExternalSyncEvent, ExternalSyncState,
406
+ ExternalWriteIntent, ExternalSyncConflict, ExternalObjectLink
407
+
408
+ #### Resource CRUD via kubectl
409
+
410
+ All resource operations use `spawnSync('kubectl', ...)` or async `spawn('kubectl', ...)`:
411
+ - `kubectl get <resource> -n <namespace> -o json`
412
+ - `kubectl apply -f - -o json` (with JSON manifest piped to stdin)
413
+ - `kubectl delete <resource> <name> -n <namespace>`
414
+ - `kubectl get <resource> --watch -o json` (for live updates)
415
+
416
+ #### Web Console (57+ Pages)
417
+
418
+ Seven page modules:
419
+ 1. **agent** -- Stacks, dispatch, sessions, triggers, memory, projects, adapters, workspaces
420
+ 2. **repo** -- Repositories, pull requests, issues, reviews, pipelines
421
+ 3. **manage** -- Organizations, users, teams, invites, identity mappings
422
+ 4. **settings** -- Auth providers, webhooks, runners, policies, views
423
+ 5. **external** -- Backend providers, bindings, sync policies, conflicts
424
+ 6. **lib/kradle-ui** -- Shared UI components (tables, forms, modals, badges)
425
+ 7. **lib/page-frame** -- Layout shell, navigation, breadcrumbs
426
+
427
+ #### Auth (OAuth, Session Cookies, Middleware)
428
+
429
+ - GitHub OAuth (authorization code flow)
430
+ - Workspace SSO (OIDC authorization code flow)
431
+ - Delegated identity (proxy headers: x-forwarded-user/groups/email)
432
+ - Local development bypass (auto-login for localhost)
433
+ - HMAC-SHA256 signed session cookies
434
+ - `parseSessionCookie` / `createSessionCookie` with timing-safe verification
435
+
436
+ #### Workspace Provisioning
437
+
438
+ - `KradleWorkspace` CRD: PVC specs, volume lifecycle, repository binding
439
+ - `KradleWorkspacePolicy` CRD: trust tiers, cleanup retention, provisioning mode
440
+ - `KradleWorkspaceRuntime` CRD: process status, environment, preview URLs
441
+ - Git worktree integration specs (branch/commit binding)
442
+ - Runner mount specifications
443
+
444
+ #### Memory System
445
+
446
+ - `AgentMemoryRepository`: org-level git pointer with layout profile and index policy
447
+ - `AgentMemorySource`: read policy for paths/kinds per repository, team, stack, or trigger
448
+ - `AgentMemoryOntology`: ontology with nodeKinds, edgeKinds, controlled vocabulary
449
+ - `AgentMemoryQuery`: graph/grep retrieval with ranking metadata
450
+ - `AgentRunMemoryImport`: babysitter run ingestion with redaction and review
451
+ - `AgentMemorySnapshot`: dispatch-time pin with resolved commit and digests
452
+ - `AgentMemoryUpdate`: reviewable proposed mutations with branch and validation
453
+ - `AgentMemoryAssociation`: bridge records linking memory to Kradle resources
454
+
455
+ #### External Backend Abstraction
456
+
457
+ - `ExternalBackendProvider`: registration (type, endpoint, auth, capability discovery)
458
+ - `ExternalBackendBinding`: binding to org with credential reference and sync scope
459
+ - `ExternalBackendSyncPolicy`: interval, conflict resolution, field mapping, retry policy
460
+ - `ExternalProviderCapabilityManifest`: discovered API capabilities
461
+ - `ExternalWebhookDelivery`: inbound webhook processing
462
+ - `ExternalSyncEvent`: discrete sync event with dedupe/ordering
463
+ - `ExternalSyncState`: current sync phase per resource
464
+ - `ExternalWriteIntent`: queued write-back with approval state
465
+ - `ExternalSyncConflict`: detected conflicts with resolution outcomes
466
+ - `ExternalObjectLink`: stable local-to-external ID mapping
467
+
468
+ #### Notification System
469
+
470
+ - `globalEventBus` singleton for in-process pub/sub
471
+ - SSE endpoint for real-time web console updates
472
+ - `emitResourceChange(kind, name, operation)` on every mutation
473
+ - Listener registration/deregistration for per-connection subscriptions
474
+
475
+ #### Runner Pool Management
476
+
477
+ - `RunnerPool` CRD: capacity (warmReplicas, maxReplicas), cache policy, trust boundary
478
+ - Scheduling policy specs (not execution)
479
+ - Runner identity binding via AgentServiceAccount
480
+
481
+ #### Audit Logging
482
+
483
+ - Audit controller records all resource mutations
484
+ - Queryable via MCP tool (`kradle_audit_query`)
485
+ - Correlation IDs on all snapshot fetches
486
+
487
+ #### MCP Server (14 Tools)
488
+
489
+ Exposed via stdio (`kradle mcp`):
490
+ - `kradle_snapshot`, `kradle_list_resources`, `kradle_get_resource`
491
+ - `kradle_apply_resource`, `kradle_delete_resource`, `kradle_search`
492
+ - `kradle_list_stacks`, `kradle_create_stack`, `kradle_dispatch_agent`
493
+ - `kradle_list_secrets`, `kradle_create_secret`
494
+ - `kradle_sync_external`, `kradle_resolve_conflict`, `kradle_audit_query`
495
+
496
+ ---
497
+
498
+ ### 2.2 What Agent-Mux Owns
499
+
500
+ #### Actual Agent Spawning and Management
501
+
502
+ Agent-mux is responsible for:
503
+ - Instantiating agent processes (Claude, Codex, Gemini, etc.)
504
+ - Managing process lifecycle (start, monitor, terminate)
505
+ - Resource isolation between concurrent agent sessions
506
+ - Process cleanup on timeout or cancellation
507
+
508
+ #### Session Lifecycle
509
+
510
+ - Create session from stack parameters (model, prompt, tools, workspace)
511
+ - Stream events from running session to subscribers
512
+ - Terminate sessions (graceful and forced)
513
+ - Track session state transitions (Pending, Running, Completed, Failed, Cancelled)
514
+
515
+ #### Adapter Registry
516
+
517
+ Agent-mux maintains the runtime adapter registry:
518
+ - Claude adapter (Anthropic API)
519
+ - Codex adapter (OpenAI API)
520
+ - Gemini adapter (Google API)
521
+ - Pi adapter (Inflection API)
522
+ - Custom adapters via plugin system
523
+
524
+ Each adapter implements: capabilities query, session creation, event streaming.
525
+
526
+ #### Transport Codec
527
+
528
+ - Message format translation between kradle's JSON and adapter-native formats
529
+ - Streaming frame encoding (SSE, WebSocket, stdio JSON-RPC)
530
+ - Binary attachment handling
531
+ - Compression negotiation
532
+
533
+ #### Provider Client Instantiation
534
+
535
+ - API key retrieval and injection
536
+ - Base URL resolution per provider
537
+ - Rate limiting enforcement per provider/model combination
538
+ - Retry logic for transient provider failures
539
+
540
+ #### Real-Time Event Streaming
541
+
542
+ - SSE server for `/api/v1/runs/{runId}/events`
543
+ - Event buffering and replay for reconnecting clients
544
+ - Multi-subscriber fanout (multiple UI tabs)
545
+ - Connection keep-alive and heartbeat
546
+
547
+ #### Token Counting and Cost Calculation
548
+
549
+ - Per-message input/output token counting
550
+ - Model-specific pricing application
551
+ - Cumulative cost tracking per session/run
552
+ - Cost breakdown in event payloads (`event.usage.inputTokens`, `event.usage.outputTokens`)
553
+
554
+ ---
555
+
556
+ ### 2.3 What Babysitter-SDK Owns
557
+
558
+ #### Process Definition and Orchestration
559
+
560
+ - `.a5c/processes/*.json` file format and schema
561
+ - Task graph resolution (dependencies, parallelism)
562
+ - Effect system (what a task produces)
563
+ - Breakpoint system (human gates in execution flow)
564
+
565
+ #### Run Lifecycle
566
+
567
+ - Run creation with process binding
568
+ - Task iteration (next task selection, execution, completion)
569
+ - Run completion detection (all tasks done, failure threshold)
570
+ - Run cancellation and cleanup
571
+
572
+ #### Journal and State Management
573
+
574
+ - Append-only journal (`journal.json`) with typed events
575
+ - Mutable state snapshot (`state.json`) for resumption
576
+ - Effect output persistence (`effects/`)
577
+ - Run metadata (timing, status, error details)
578
+
579
+ #### Session Binding
580
+
581
+ - Binding an agent session to a babysitter run
582
+ - Session-to-task mapping (which session handles which task)
583
+ - Multi-session orchestration (parallel tasks)
584
+
585
+ #### Hook-Driven Continuation
586
+
587
+ - Pre/post task hooks
588
+ - Breakpoint evaluation and resolution
589
+ - External trigger integration (webhook → resume)
590
+ - Timeout-based auto-continuation
591
+
592
+ ---
593
+
594
+ ### 2.4 The Gap Zone
595
+
596
+ The gap zone defines areas where kradle manages the **resource declaration** but
597
+ does not perform the **runtime execution**. This is by design -- kradle is the
598
+ "desired state" layer; execution is delegated to specialized runtimes.
599
+
600
+ #### AgentStack Exists but Isn't Resolved into a Running Adapter
601
+
602
+ - `AgentStack` CRD captures: baseAgent, adapter, model, prompt, tools, MCP servers, skills
603
+ - The stack controller reconciles readiness conditions (capability resolution, MCP health)
604
+ - But it never actually calls agent-mux to instantiate the adapter
605
+ - The gap: creating an AgentStack does not start anything
606
+
607
+ #### AgentDispatchRun Created but Agent Not Spawned by Kradle
608
+
609
+ - `AgentDispatchRun` CRD captures: repository, sourceRefs, agentStack, taskKind
610
+ - Status tracks: Queued, Running, Completed, Failed
611
+ - The dispatch controller creates the resource and validates the spec
612
+ - But the actual agent spawn happens in agent-mux (triggered by external controller)
613
+ - The gap: dispatch resource exists immediately, execution happens later
614
+
615
+ #### KradleWorkspace Generates Pod Specs but Doesn't Execute Them
616
+
617
+ - `KradleWorkspace` CRD captures: repository, volumeSpec, mount paths
618
+ - `KradleWorkspacePolicy` defines provisioning rules and trust tiers
619
+ - The workspace controller generates PVC manifests and mount specifications
620
+ - But it does not create the actual PVC or pod (that's the workspace-provisioner)
621
+ - The gap: workspace spec is declarative; provisioning is separate
622
+
623
+ #### RunnerPool Generates Schedules but Doesn't Provision Runners
624
+
625
+ - `RunnerPool` CRD captures: warmReplicas, maxReplicas, cache policy
626
+ - The runner controller validates pool specs and computes scheduling hints
627
+ - But it does not create actual runner pods or scale deployments
628
+ - The gap: pool definition is intent; ARC (Actions Runner Controller) or
629
+ similar actually provisions the runners
630
+
631
+ #### The Intent
632
+
633
+ This pattern is intentional Kubernetes-native architecture:
634
+ 1. Kradle manages CRD resources as desired state
635
+ 2. Specialized operators/controllers watch these resources
636
+ 3. Operators reconcile desired state into actual state
637
+ 4. Status fields reflect observed state back into CRDs
638
+ 5. Kradle's UI reads status to show current state
639
+
640
+ This separation enables:
641
+ - Independent scaling of control plane vs execution plane
642
+ - Pluggable execution backends (swap agent-mux implementation)
643
+ - GitOps-compatible declarative management
644
+ - Clear audit trail (every intent is a persisted resource)
645
+
646
+ ---
647
+
648
+ ## 3. Architectural Choices & Trade-offs
649
+
650
+ ### 3.1 Why CRD-First (vs Database-First)
651
+
652
+ **Decision:** All state stored as Kubernetes custom resources (CRDs for config
653
+ kinds, aggregated API server for data-plane kinds).
654
+
655
+ **Rationale:**
656
+ - GitOps-compatible: resources can be managed via Argo CD, Flux, or plain kubectl
657
+ - kubectl-native: operators and admins use familiar tooling
658
+ - Declarative: desired state vs imperative mutations
659
+ - No external DB dependency for control plane (etcd comes with K8s)
660
+ - Built-in RBAC: Kubernetes RBAC applies to all resource operations
661
+ - Watch support: built-in change notification for controllers
662
+ - Namespace isolation: multi-tenancy via namespace-per-org
663
+
664
+ **Trade-off:**
665
+ - Slower than direct database queries (kubectl spawnSync overhead)
666
+ - etcd size limits (~1.5MB per resource, cluster-wide storage cap)
667
+ - No complex queries (no JOIN, no WHERE with multiple conditions)
668
+ - No full-text search (client-side filtering only)
669
+ - Pagination via continue tokens (not offset-based)
670
+
671
+ **Mitigation:**
672
+ - Snapshot cache (30s TTL) for dashboard queries
673
+ - Per-org caching reduces repeated cross-namespace queries
674
+ - Background async snapshot refresh (stale-while-revalidate)
675
+ - Aggregated API server pattern for high-volume data (PullRequest, Pipeline, etc.)
676
+ - `getPartialSnapshot()` for pages that only need a subset of kinds
677
+
678
+ **When It Breaks:**
679
+ - Large clusters with 1000+ resources per kind (kubectl list becomes slow)
680
+ - Complex joins needed (e.g. "all runs for repositories owned by team X")
681
+ - Full-text search across resource content (need external search index)
682
+ - High-frequency writes (etcd write throughput is ~10K/s cluster-wide)
683
+ - Time-series data (audit logs, metrics -- needs separate store)
684
+
685
+ ---
686
+
687
+ ### 3.2 Why kubectl spawnSync (vs K8s client-go or @kubernetes/client-node)
688
+
689
+ **Decision:** Shell out to kubectl binary for all Kubernetes operations using
690
+ Node.js `child_process.spawnSync()` and `child_process.spawn()`.
691
+
692
+ **Rationale:**
693
+ - Zero npm dependencies for K8s operations (no `@kubernetes/client-node`)
694
+ - Works with any kubeconfig (user's local config, service account, EKS/GKE auth plugins)
695
+ - kubectl handles all auth complexity (OIDC, exec-based plugins, certificates)
696
+ - No Node.js K8s client compatibility bugs
697
+ - kubectl is battle-tested and always up-to-date with K8s API changes
698
+ - Debugging: can reproduce any operation by copying the kubectl command
699
+
700
+ **Trade-off:**
701
+ - `spawnSync` blocks the event loop (one at a time per request)
702
+ - Cold starts are slow (kubectl binary startup + API server TLS handshake)
703
+ - No watch support in sync mode (separate spawn needed)
704
+ - Process spawn overhead (~20-50ms per call)
705
+ - Max buffer limits on large responses (configurable via `KRADLE_KUBECTL_MAX_BUFFER_BYTES`)
706
+
707
+ **Mitigation:**
708
+ - `kubernetes-controller-async.js` uses `spawn()` + `Promise.all()` for parallel queries
709
+ - Snapshot cache means most page loads skip kubectl entirely
710
+ - `getPartialSnapshot()` queries only needed kinds (not all 76)
711
+ - `KRADLE_KUBECTL_TIMEOUT_MS` (default 3s) prevents hung processes
712
+ - `runKubectlAsync()` for non-blocking operations in the async controller
713
+ - In-cluster detection auto-adds `--server`, `--token`, `--certificate-authority`
714
+ flags (no kubeconfig file needed in-cluster)
715
+
716
+ **Future Options:**
717
+ - Could add in-cluster HTTP client using service account token at
718
+ `/var/run/secrets/kubernetes.io/serviceaccount/token`
719
+ - Could use `@kubernetes/client-node` for watch-only operations
720
+ - Could implement a sidecar proxy that exposes a local HTTP API
721
+
722
+ ---
723
+
724
+ ### 3.3 Why Next.js App Router (vs Pages Router or Remix)
725
+
726
+ **Decision:** Next.js 16 with App Router, React 19 server components.
727
+
728
+ **Rationale:**
729
+ - Server-side rendering for dashboard pages (no client-side waterfall)
730
+ - Streaming responses for progressive rendering
731
+ - Parallel data loading (multiple server components fetch independently)
732
+ - File-based routing matches the 7-module page structure
733
+ - React Server Components reduce client bundle size
734
+ - Built-in API routes for proxy endpoints (Atlas, auth callbacks)
735
+
736
+ **Trade-off:**
737
+ - Complex server/client boundary (`'use client'` directive management)
738
+ - Cannot pass functions or non-serializable props from server to client components
739
+ - Larger initial bundle than SPA alternatives
740
+ - Build time increases with page count
741
+ - Turbopack compatibility issues in Docker builds (fallback to webpack)
742
+
743
+ **Mitigation:**
744
+ - Clear module split: `lib/pages/` (server), `components/` (client where needed)
745
+ - `'use client'` only on interactive components (forms, drag-drop, modals)
746
+ - SDK resolveAlias for monorepo imports (relative path workaround)
747
+ - Standalone output mode reduces Docker image size
748
+ - `export const metadata` pattern for static head content
749
+
750
+ **Gotcha:**
751
+ - `export const metadata` cannot coexist with barrel re-exports
752
+ (must be in the page file itself, not re-exported from index)
753
+ - `dynamic = 'force-dynamic'` required on proxy routes (Atlas, auth)
754
+ - `process.env` access works differently in server components vs route handlers
755
+
756
+ ---
757
+
758
+ ### 3.4 Why Pure ESM JavaScript (vs TypeScript)
759
+
760
+ **Decision:** Zero TypeScript across all kradle packages. Pure `.js`/`.jsx` with
761
+ JSDoc annotations for type information.
762
+
763
+ **Rationale:**
764
+ - No build step for core package (run directly with `node`)
765
+ - Instant startup (no compilation delay in development)
766
+ - Simpler debugging (source maps not needed, line numbers match)
767
+ - No `tsconfig.json` complexity (path aliases, module resolution, strict modes)
768
+ - JSDoc types are optional and incremental (add where valuable)
769
+ - Core package uses only Node.js built-in modules (no bundler needed)
770
+
771
+ **Trade-off:**
772
+ - No compile-time type safety
773
+ - JSDoc is more verbose than TypeScript type annotations
774
+ - IDE IntelliSense weaker for complex types (generics, discriminated unions)
775
+ - Refactoring tools less reliable without type information
776
+ - No `enum` or `interface` -- must use `@typedef` or constants
777
+
778
+ **Mitigation:**
779
+ - Comprehensive test suite (1440+ tests across core, SDK, CLI, web)
780
+ - Controller boundary declarations (BOUNDARY objects) enforce contracts at runtime
781
+ - `validateResource()` checks required fields at runtime
782
+ - Every controller factory function documents its API via JSDoc
783
+ - Resource model has `requiredSpec` arrays that enforce schema at create/apply time
784
+
785
+ ---
786
+
787
+ ### 3.5 Why Stale-While-Revalidate (vs K8s Watch Streams)
788
+
789
+ **Decision:** 30s TTL cache with stale-while-revalidate pattern for all
790
+ Kubernetes resource reads.
791
+
792
+ **Rationale:**
793
+ - Simple implementation (Map-based cache, no persistent connections)
794
+ - Works with kubectl (no long-lived watch connections needed)
795
+ - Handles cold starts gracefully (first request blocks, subsequent use cache)
796
+ - Predictable memory usage (one snapshot per org)
797
+ - No reconnection logic needed for reads
798
+
799
+ **Trade-off:**
800
+ - 30s staleness window (UI may show outdated data)
801
+ - Cache miss on first request after deploy or cache clear
802
+ - Background revalidation fires even when no one is watching
803
+ - Multiple simultaneous requests may all miss cache (thundering herd)
804
+
805
+ **Mitigation:**
806
+ - `clearSnapshotCache()` called on every write (apply/delete) -- immediate consistency for mutations
807
+ - Per-org isolation: cache key includes organization parameter
808
+ - `revalidating` flag prevents duplicate background fetches
809
+ - `staleMs` threshold (5x TTL = 150s) before forcing blocking revalidation
810
+ - Configurable via `KRADLE_SNAPSHOT_CACHE_TTL_MS` env var
811
+
812
+ **Future:**
813
+ - `watchResourceChanges()` is implemented in `kubernetes-controller-async.js`
814
+ - Watches key kinds (Organization, AgentStack, AgentSession) and clears cache on change
815
+ - Not yet wired to the web layer HTTP server
816
+ - Could enable near-real-time UI updates without polling
817
+
818
+ ---
819
+
820
+ ### 3.6 Why SDK Re-export Layer (vs Direct Imports)
821
+
822
+ **Decision:** `@a5c-ai/kradle-sdk` re-exports from core, and web/CLI import
823
+ from the SDK rather than directly from core internals.
824
+
825
+ **Rationale:**
826
+ - Decouples web/CLI from internal core file paths
827
+ - Single import target for consumers (`import { ... } from '@a5c-ai/kradle-sdk'`)
828
+ - SDK can expose a stable API surface while core refactors internally
829
+ - SDK adds web-specific helpers (UI model mappers, auth wrappers)
830
+ - Clear dependency direction: web -> SDK -> core
831
+
832
+ **Trade-off:**
833
+ - Extra level of indirection for simple re-exports
834
+ - Turbopack/webpack need `resolveAlias` configuration for monorepo resolution
835
+ - Circular dependency risk if SDK imports from web accidentally
836
+ - Version coupling (SDK must update when core export shapes change)
837
+
838
+ **Gotcha:**
839
+ - Turbopack requires **relative path** in resolveAlias (not absolute path)
840
+ - The alias target is `'../sdk/src/index.js'` from the web package root
841
+ - Monorepo workspace root must be correctly set in `next.config.js`
842
+ - Build fails silently with wrong path (module not found at runtime, not build time)
843
+
844
+ ---
845
+
846
+ ### 3.7 Why x-kubernetes-preserve-unknown-fields (vs Strict Schemas)
847
+
848
+ **Decision:** All CRD specs use `x-kubernetes-preserve-unknown-fields: true`
849
+ allowing arbitrary additional fields.
850
+
851
+ **Rationale:**
852
+ - Rapid iteration: UI can add new spec fields without CRD redeployment
853
+ - Forward-compatible: older CRD versions accept newer resource manifests
854
+ - No validation failures during development cycles
855
+ - Reduces coupling between Helm chart releases and feature development
856
+ - Enables spec exploration (prototype fields in UI, formalize later)
857
+
858
+ **Trade-off:**
859
+ - No server-side validation of field names or types
860
+ - Typos in spec field names silently accepted (e.g. `organisationRef` vs `organizationRef`)
861
+ - No OpenAPI schema generation for CRD fields
862
+ - `kubectl explain` shows no field documentation
863
+ - etcd stores whatever is submitted (no normalization)
864
+
865
+ **Mitigation:**
866
+ - Client-side validation in controllers (`validate*` functions)
867
+ - `requiredSpec` arrays in RESOURCE_DEFINITIONS enforce mandatory fields at apply time
868
+ - Test suite covers all valid/invalid field combinations
869
+ - Controller boundary declarations document expected spec shapes
870
+ - UI forms constrain input to valid fields
871
+
872
+ **Note:** Helm only installs CRDs on `helm install`, not `helm upgrade`.
873
+ Explicit `kubectl apply -f packages/kradle/charts/crds/ --server-side --force-conflicts`
874
+ is needed in CI before helm upgrade to update CRD definitions.
875
+
876
+ ---
877
+
878
+ ### 3.8 Why Single-Container-Per-Role (api + controllers + web + webhook-worker)
879
+
880
+ **Decision:** 4 deployment containers (roles), each running a different entry
881
+ command from the same or related images.
882
+
883
+ **Actual Layout:**
884
+ | Role | Image | Entry Command | Port |
885
+ |------|-------|--------------|------|
886
+ | api | kradle-controller | `node src/http-server.js` | 3080 |
887
+ | controllers | kradle-controller | `node src/control-plane.js` | - |
888
+ | web | kradle-web | `node .next/standalone/server.js` | 3000 |
889
+ | webhook-worker | kradle-controller | `node src/external/webhook-controller.js` | - |
890
+
891
+ **Rationale:**
892
+ - Separation of concerns (API serving vs background reconciliation vs UI)
893
+ - Independent scaling (web can scale to 3 replicas while api stays at 1)
894
+ - Failure isolation (webhook worker crash doesn't affect UI)
895
+ - Resource tuning (web needs more memory for SSR, api needs more CPU)
896
+
897
+ **Trade-off:**
898
+ - More pods consuming cluster resources
899
+ - More complexity in Helm chart (4 Deployments, 4 Services)
900
+ - Internal service discovery needed (web → api via `KRADLE_CONTROLLER_URL`)
901
+ - Shared code must be in core package (duplicated in controller image)
902
+
903
+ **Actual State:**
904
+ - api and controllers share the `kradle-controller` image (same codebase, different entrypoint)
905
+ - web uses the `kradle-web` image (Next.js standalone build)
906
+ - webhook-worker is architecturally separate but shares the controller image
907
+
908
+ ---
909
+
910
+ ## 4. System Nuances & Gotchas
911
+
912
+ ### 4.1 Namespace Discovery Fallback Chain
913
+
914
+ The system must determine which Kubernetes namespaces to query for org-scoped
915
+ resources. The resolution follows a priority chain:
916
+
917
+ **Step 1:** Check Organization resources in platform namespace
918
+ ```javascript
919
+ organizations.map(org => org.spec?.namespaceName || org.metadata?.labels?.['kradle.a5c.ai/namespace'])
920
+ ```
921
+ If Organization CRDs exist, derive namespaces from their `spec.namespaceName` field.
922
+
923
+ **Step 2:** Check OrgNamespaceBinding resources
924
+ ```javascript
925
+ bindings.map(binding => binding.spec?.namespace || binding.metadata?.labels?.['kradle.a5c.ai/namespace'])
926
+ ```
927
+ Bindings explicitly declare the namespace for each org.
928
+
929
+ **Step 3:** Environment variable fallback
930
+ ```javascript
931
+ if (process.env.KRADLE_ADMIN_ORG) fallbackOrgs.add(orgNamespaceName(adminOrg));
932
+ fallbackOrgs.add(orgNamespaceName(process.env.KRADLE_ORG || 'default'));
933
+ // Result: ['kradle-org-admin', 'kradle-org-default']
934
+ ```
935
+
936
+ **Step 4:** Last resort
937
+ ```javascript
938
+ return [KRADLE_PLATFORM_NAMESPACE]; // 'kradle-system'
939
+ ```
940
+
941
+ **WHY this chain exists:** Fresh deployments have no Organization CRD yet
942
+ (it's created on first admin login), but the UI needs to list resources in
943
+ an org namespace. The fallback ensures the system bootstraps correctly.
944
+
945
+ **Edge cases:**
946
+ - Multiple orgs: all discovered namespaces are queried (flat merge, no hierarchy)
947
+ - Namespace doesn't exist yet: kubectl returns empty list (no error with `--ignore-not-found`)
948
+ - Stale bindings: namespace listed but org deleted → empty results (harmless)
949
+
950
+ ---
951
+
952
+ ### 4.2 KRADLE_CONTROLLER_URL Indirection
953
+
954
+ **Architecture:**
955
+ ```
956
+ Browser → Web Container (Next.js) → API Container (HTTP server)
957
+
958
+ kubectl → K8s API
959
+ ```
960
+
961
+ **How it works:**
962
+ - Web container has `KRADLE_CONTROLLER_URL` env var pointing to api's internal K8s Service URL
963
+ (e.g. `http://kradle-api.kradle-system.svc.cluster.local:80`)
964
+ - Web NEVER runs kubectl directly (no kubeconfig mounted in web container)
965
+ - All resource operations go through fetch() to the api container
966
+
967
+ **If api container is down:**
968
+ - Web's server-side data fetching returns clean error model
969
+ - Pages render with error state (not a crash/500)
970
+ - No kubectl fallback from web container (by design)
971
+
972
+ **If api returns degraded data:**
973
+ - Web may probe local snapshot for comparison (modelResourceScore heuristic)
974
+ - Used to detect api container serving stale cache vs fresh data
975
+ - Not a correctness requirement, just a freshness indicator
976
+
977
+ **Why not direct kubectl from web?**
978
+ - Security: web container is publicly exposed (ingress), kubectl access is dangerous
979
+ - Image size: web image doesn't include kubectl binary (except for auth callback)
980
+ - Separation: web handles presentation, api handles data operations
981
+ - Exception: `registerLoginProfile()` in auth callback does use kubectl (web image
982
+ includes kubectl for this single operation -- registers User/IdentityMapping on login)
983
+
984
+ ---
985
+
986
+ ### 4.3 Cache + Write Interaction
987
+
988
+ **Write path (mutation):**
989
+ ```
990
+ applyResource(resource) / deleteResource(kind, name)
991
+ → kubectl apply / delete
992
+ → clearSnapshotCache() // Invalidate ALL cached data
993
+ → globalEventBus.emitResourceChange(kind, name, 'apply'|'delete')
994
+ → SSE clients receive notification
995
+ → Next page load fetches fresh data from kubectl
996
+ ```
997
+
998
+ **Read path (query):**
999
+ ```
1000
+ Page server component calls controller
1001
+ → staleWhileRevalidate(org, revalidateFn)
1002
+ → If cache fresh (< 30s): return immediately
1003
+ → If cache stale (30s-150s): return stale, refresh in background
1004
+ → If cache too old (> 150s) or missing: block on fresh fetch
1005
+ ```
1006
+
1007
+ **Key behaviors:**
1008
+ - `clearSnapshotCache()` clears ALL orgs (global invalidation)
1009
+ - `clearOrgCache(org)` clears single org (surgical invalidation)
1010
+ - Per-org cache isolation: different orgs don't interfere
1011
+ - `revalidating` flag prevents thundering herd (only one background refresh)
1012
+ - Write + immediate read: always gets fresh data (cache cleared on write)
1013
+
1014
+ **Race condition:**
1015
+ If two users write simultaneously:
1016
+ 1. User A writes → cache cleared
1017
+ 2. User B writes → cache cleared (already empty)
1018
+ 3. User A reads → fresh fetch, sets cache
1019
+ 4. User B reads → gets User A's fresh data (which includes both writes if kubectl returned both)
1020
+
1021
+ This is safe because kubectl always returns the latest server state.
1022
+
1023
+ ---
1024
+
1025
+ ### 4.4 Auth Cookie Security
1026
+
1027
+ **Cookie creation:**
1028
+ ```javascript
1029
+ // Payload: base64url(JSON({ provider, subject, user }))
1030
+ // With secret: payload.hmac-sha256(payload, secret) → base64url
1031
+ // Without secret: payload only (plain base64url, no signature)
1032
+ ```
1033
+
1034
+ **Verification matrix:**
1035
+
1036
+ | Cookie State | Secret Configured | Result |
1037
+ |-------------|-------------------|--------|
1038
+ | Signed | Yes | Verify HMAC, constant-time compare |
1039
+ | Signed | No | Reject (can't verify) |
1040
+ | Unsigned | Yes | Reject (could be tampered) |
1041
+ | Unsigned | No | Accept (backward compatible) |
1042
+
1043
+ **Security properties:**
1044
+ - HMAC-SHA256 signing ONLY when `KRADLE_SESSION_SECRET` env var is set
1045
+ - Without secret: cookie is plain base64 (useful for development)
1046
+ - Constant-time comparison via `crypto.timingSafeEqual` (prevents timing attacks)
1047
+ - HttpOnly flag (no JavaScript access)
1048
+ - SameSite=Lax (prevents CSRF from cross-origin POST)
1049
+ - No Secure flag by default (set at ingress/proxy level)
1050
+
1051
+ **Tampered cookie handling:**
1052
+ - Invalid HMAC → `parseSessionCookie` returns `null`
1053
+ - null session → middleware rejects request
1054
+ - Rejection → 307 redirect to `/login`
1055
+ - No error message exposed (prevents oracle attacks)
1056
+
1057
+ ---
1058
+
1059
+ ### 4.5 CRD Lifecycle in CI
1060
+
1061
+ **Problem:** Helm's CRD handling has a well-known limitation:
1062
+ - `helm install` -- applies CRDs from the `crds/` directory
1063
+ - `helm upgrade` -- does NOT update CRDs (by design, to prevent data loss)
1064
+ - `helm uninstall` -- does NOT delete CRDs (by design, to prevent data loss)
1065
+
1066
+ **CI workaround:**
1067
+ ```bash
1068
+ # Before helm upgrade, explicitly apply CRDs
1069
+ kubectl apply -f packages/kradle/charts/crds/ --server-side --force-conflicts
1070
+ ```
1071
+
1072
+ `--server-side` enables server-side apply (handles field ownership correctly).
1073
+ `--force-conflicts` resolves ownership conflicts (Helm vs kubectl managers).
1074
+
1075
+ **Implications for development:**
1076
+ - Adding a new field to an existing CRD: no CRD redeployment needed (preserve-unknown-fields)
1077
+ - Adding a new CRD kind: must deploy the CRD yaml file before resources can be created
1078
+ - Removing a field from CRD: preserve-unknown-fields means old resources still have the field
1079
+ - Changing field type in CRD: no validation exists, so no conflict (but client code may break)
1080
+
1081
+ **Best practices:**
1082
+ - Always add new kinds in the same PR that adds the CRD yaml
1083
+ - CI pipeline runs CRD apply before helm upgrade
1084
+ - Never rename CRD group/version/plural (breaks all existing resources)
1085
+ - Use annotations to mark deprecated fields (spec.deprecated.fieldName: "reason")
1086
+
1087
+ ---
1088
+
1089
+ ### 4.6 Org-Scoped vs Platform-Scoped Resources
1090
+
1091
+ **Platform-scoped resources** (exist in kradle-system namespace only):
1092
+ - `Organization` -- represents an org identity
1093
+ - `OrgNamespaceBinding` -- binds org to a namespace
1094
+
1095
+ These are special because they exist "above" org namespaces -- they define
1096
+ the org structure itself.
1097
+
1098
+ **Org-scoped resources** (exist in kradle-org-{slug} namespaces):
1099
+ - All other 74 resource kinds
1100
+ - Always have `spec.organizationRef` field
1101
+ - Namespace derived from org: `kradle-org-${normalizeOrgSlug(org)}`
1102
+
1103
+ **Enforcement:**
1104
+ ```javascript
1105
+ // In withOrgScope():
1106
+ if (resource.metadata?.namespace && resource.metadata.namespace !== namespace) {
1107
+ throw new Error(`namespace ${resource.metadata.namespace} does not match organization ${org}`);
1108
+ }
1109
+ ```
1110
+
1111
+ Cross-org denial: `applyResource()` calls `withOrgScope()` which rejects any
1112
+ resource whose explicit namespace conflicts with its `organizationRef`.
1113
+
1114
+ **KRADLE_RESOURCES array has `platformScoped: true` flag:**
1115
+ - Platform-scoped: only queried from `KRADLE_PLATFORM_NAMESPACE` (kradle-system)
1116
+ - Org-scoped: queried from all discovered org namespaces
1117
+
1118
+ **Multi-org queries:**
1119
+ Snapshot fetches resources from ALL org namespaces. The flattened result includes
1120
+ resources from all orgs. UI filters by `spec.organizationRef` for the current org view.
1121
+
1122
+ ---
1123
+
1124
+ ### 4.7 Web Container Architecture
1125
+
1126
+ **Dockerfile structure (multi-stage):**
1127
+ ```dockerfile
1128
+ # Stage 1: deps - install node_modules
1129
+ FROM node:20 AS deps
1130
+ COPY package*.json ./
1131
+ RUN npm ci
1132
+
1133
+ # Stage 2: build - Next.js production build
1134
+ FROM node:20 AS build
1135
+ COPY --from=deps /app/node_modules ./node_modules
1136
+ COPY . .
1137
+ RUN npm run build
1138
+
1139
+ # Stage 3: runtime - minimal production image
1140
+ FROM node:20-slim AS runtime
1141
+ COPY --from=build /app/.next/standalone ./
1142
+ COPY --from=build /app/.next/static ./.next/static
1143
+ COPY --from=build /app/public ./public
1144
+ # kubectl for auth callback
1145
+ COPY --from=bitnami/kubectl:latest /opt/bitnami/kubectl/bin/kubectl /usr/local/bin/
1146
+ ```
1147
+
1148
+ **Build uses webpack (not turbopack) for Docker:**
1149
+ - Turbopack has issues with Docker layer caching
1150
+ - Webpack is more predictable in CI environments
1151
+ - `--webpack` flag in `next build` command
1152
+
1153
+ **Runtime image includes kubectl:**
1154
+ - Needed for `registerLoginProfile()` during auth callback
1155
+ - Called once per user login (not hot path)
1156
+ - Could be removed if auth moved fully to api container
1157
+
1158
+ **SDK resolution via turbopack resolveAlias:**
1159
+ ```javascript
1160
+ // next.config.js
1161
+ experimental: {
1162
+ turbopack: {
1163
+ resolveAlias: {
1164
+ '@a5c-ai/kradle-sdk': '../sdk/src/index.js' // MUST be relative
1165
+ }
1166
+ }
1167
+ }
1168
+ ```
1169
+
1170
+ **Standalone output mode:**
1171
+ - `.next/standalone/` contains full Node.js app (no node_modules needed at runtime)
1172
+ - Reduces Docker image from ~1GB to ~150MB
1173
+ - Entry: `node .next/standalone/server.js`
1174
+ - Static assets served from `.next/static/` (can be CDN-fronted)
1175
+
1176
+ ---
1177
+
1178
+ ## 5. Integration Gaps (Known, Documented)
1179
+
1180
+ The following integration gap categories track areas where kradle has resource
1181
+ definitions and controller logic but previously lacked (or still lacks) runtime
1182
+ integration. Items marked **RESOLVED** have been implemented in the K8s Job
1183
+ dispatch architecture.
1184
+
1185
+ ### Gap 1: Session Lifecycle Sync — RESOLVED
1186
+
1187
+ **Previous state:**
1188
+ Created `AgentSession` resources with `spec.agentMuxSessionId`. Status depended
1189
+ on mux client responses; if mux was unavailable, status stayed `Pending`.
1190
+
1191
+ **Resolution:**
1192
+ Agent pods now POST directly to the Kradle callback endpoint
1193
+ (`POST /api/orgs/{org}/agents/runs/{name}/callback`) on completion.
1194
+ `persistSessionEvent()` applies the result to `AgentSession` and `AgentDispatchRun`
1195
+ in a single atomic update. No polling or mux webhook receiver needed.
1196
+
1197
+ ---
1198
+
1199
+ ### Gap 2: Adapter Capability Caching
1200
+
1201
+ **What kradle does now:**
1202
+ `queryCapabilities(adapter)` called during stack reconciliation. Result is used
1203
+ once and discarded (no caching of adapter capabilities).
1204
+
1205
+ **What it should do:**
1206
+ Cache adapter capabilities with TTL, invalidate on adapter CRD changes.
1207
+ Reduces mux API calls during frequent stack reconciliation cycles.
1208
+
1209
+ **What blocks it:**
1210
+ No cache layer between mux client and stack controller. Low priority since
1211
+ mux is usually fast and capabilities rarely change.
1212
+
1213
+ **Estimated effort:** 0.5 day (add to snapshot cache pattern)
1214
+
1215
+ ---
1216
+
1217
+ ### Gap 3: Transport Resolution and Codec Injection — RESOLVED
1218
+
1219
+ **Previous state:**
1220
+ Validated `AgentTransportBinding` spec, but never activated the transport runtime
1221
+ or injected settings into agent processes.
1222
+
1223
+ **Resolution:**
1224
+ `resolveTransport(stack, resources)` reads the `AgentTransportBinding` referenced
1225
+ by the stack's adapter and injects `AGENT_MUX_TRANSPORT` and `TRANSPORT_MUX_CODEC`
1226
+ as environment variables directly into the `batch/v1` Job spec via `createAgentJob()`.
1227
+ Agent pods receive transport configuration at startup without any manual wiring.
1228
+
1229
+ ---
1230
+
1231
+ ### Gap 4: Lifecycle Event Emission — RESOLVED
1232
+
1233
+ **Previous state:**
1234
+ `globalEventBus.emitResourceChange()` fired on resource CRUD operations only.
1235
+ Purely internal, resource-level granularity. No agent lifecycle events.
1236
+
1237
+ **Resolution:**
1238
+ `createHooksLifecycleEmitter(bus)` emits 9 structured lifecycle events
1239
+ (RUN_CREATED, RUN_QUEUED, RUN_STARTED, STEP_STARTED, STEP_COMPLETED,
1240
+ APPROVAL_REQUESTED, APPROVAL_GRANTED, APPROVAL_DENIED, RUN_COMPLETED/RUN_FAILED)
1241
+ at the correct dispatch lifecycle points. Events are forwarded to registered
1242
+ `WebhookSubscription` endpoints via the existing webhook delivery system.
1243
+
1244
+ ---
1245
+
1246
+ ### Gap 5: Cost Aggregation — RESOLVED
1247
+
1248
+ **Previous state:**
1249
+ `reconcileTranscript()` summed token usage from SSE events. Cost fields were zero
1250
+ when no events were available.
1251
+
1252
+ **Resolution:**
1253
+ `checkBudget()` + `estimateCost()` enforce budget before dispatch. The agent pod's
1254
+ callback payload includes `costUsd` (actual incurred cost). `persistSessionEvent()`
1255
+ records this against `AgentDispatchRun.status.costUsd`. Budget ceiling is enforced
1256
+ at the infrastructure level via `activeDeadlineSeconds` on the Job spec, ensuring
1257
+ agents cannot exceed their budget even if the callback is never delivered.
1258
+
1259
+ ---
1260
+
1261
+ ### Gap 6: Real-Time Session Streaming to UI
1262
+
1263
+ **What kradle does now:**
1264
+ Web console shows session status from cached snapshot (30s staleness).
1265
+ SSE endpoint only emits resource-change events (kind/name/operation).
1266
+
1267
+ **What it should do:**
1268
+ Proxy or relay agent step events to the web console for live session
1269
+ viewing (token-by-token streaming).
1270
+
1271
+ **What blocks it:**
1272
+ Web container cannot reach agent pods directly. API container would need to relay
1273
+ agent SSE or WebSocket events to its own SSE endpoint. Significant complexity for
1274
+ multi-subscriber fanout with back-pressure.
1275
+
1276
+ **Estimated effort:** 5-8 days (SSE relay, subscriber management, back-pressure)
1277
+
1278
+ ---
1279
+
1280
+ ### Gap 7: Approval Gate Integration
1281
+
1282
+ **What kradle does now:**
1283
+ `AgentApproval` resources created with spec describing the gate (action, requestedBy).
1284
+ Status updated manually (via UI form or API call). `APPROVAL_REQUESTED` hooks event
1285
+ is now emitted when the gate is created.
1286
+
1287
+ **What it should do:**
1288
+ When approval is granted/denied, automatically resume or cancel the blocked agent Job.
1289
+ Currently approval resolution updates the AgentDispatchRun phase but does not signal
1290
+ the suspended Job pod to continue.
1291
+
1292
+ **What blocks it:**
1293
+ Agent Job pods are not designed to pause mid-execution awaiting approval. The approval
1294
+ gate must be enforced at dispatch time (before Job creation), not within a running pod.
1295
+
1296
+ **Estimated effort:** 1-2 days (pre-dispatch approval gate enforcement tightening)
1297
+
1298
+ ---
1299
+
1300
+ ### Gap 8: Context Bundle Delivery
1301
+
1302
+ **What kradle does now:**
1303
+ `AgentContextBundle` resources store prompt/context snapshots with digest.
1304
+ Created during dispatch, immutable after creation. Bundle digest is stored in
1305
+ the Job's env vars.
1306
+
1307
+ **What it should do:**
1308
+ Deliver the full bundle payload (attachments, provenance, redaction manifest) to
1309
+ the agent pod at startup, not just the digest.
1310
+
1311
+ **What blocks it:**
1312
+ Bundle payloads may be too large for env var injection. Need a signed URL or
1313
+ in-cluster object store reference that the pod can fetch at startup.
1314
+
1315
+ **Estimated effort:** 1-2 days (signed URL generation or object store integration)
1316
+
1317
+ ---
1318
+
1319
+ ### Gap 9: Workspace Mount Coordination — RESOLVED
1320
+
1321
+ **Previous state:**
1322
+ `KradleWorkspace` spec defined volume mounts. `launchSession()` passed
1323
+ `workspace.mountPath` to mux. Assumed workspace was pre-provisioned.
1324
+
1325
+ **Resolution:**
1326
+ The dispatch flow now verifies workspace phase before Job creation.
1327
+ `findReusableWorkspace()` returns only `Ready` workspaces. If none found,
1328
+ `createWorkspace()` provisions a new PVC. The Job is only submitted after the
1329
+ workspace PVC is `Bound`. The PVC is mounted at `/workspace` in the Job pod spec
1330
+ via `getMountSpec()`.
1331
+
1332
+ ---
1333
+
1334
+ ### Gap 10: Trigger-to-Dispatch Pipeline — RESOLVED
1335
+
1336
+ **Previous state:**
1337
+ `AgentTriggerRule` created `AgentDispatchRun` on match, but dispatch run creation
1338
+ was the terminal action — no agent was actually launched.
1339
+
1340
+ **Resolution:**
1341
+ `createManualDispatch()` now continues through the full dispatch flow after creating
1342
+ the run resource: it runs `checkBudget()`, calls `createAgentJob()`, and submits the
1343
+ Job to Kubernetes via `submitAgentJob()`. The trigger-to-execution pipeline is now
1344
+ complete end-to-end.
1345
+
1346
+ ---
1347
+
1348
+ ### Gap 11: Multi-Session Orchestration
1349
+
1350
+ **What kradle does now:**
1351
+ `AgentSubagent` defines child-agent roles with task kinds and tool subsets.
1352
+ Stack reconciliation resolves subagent references.
1353
+
1354
+ **What it should do:**
1355
+ Orchestrate multiple concurrent K8s Jobs (one per subagent) for a single
1356
+ dispatch run. Track progress, handle dependencies between subagent tasks.
1357
+
1358
+ **What blocks it:**
1359
+ No multi-Job coordinator. Would need significant orchestration logic
1360
+ (task graph, dependency resolution, failure handling, rollback).
1361
+
1362
+ **Estimated effort:** 10-15 days (orchestrator, state machine, failure handling)
1363
+
1364
+ ---
1365
+
1366
+ ### Gap 12: Memory Query at Dispatch Time
1367
+
1368
+ **What kradle does now:**
1369
+ `AgentMemorySnapshot` pins memory state at dispatch time. `AgentMemoryQuery`
1370
+ records retrieval requests. Both are CRD resources. A snapshot is created during
1371
+ dispatch if an `AgentMemoryRepository` exists.
1372
+
1373
+ **What it should do:**
1374
+ Automatically inject the resolved memory snapshot content into the context bundle
1375
+ delivered to the agent Job as a mounted file or env var.
1376
+
1377
+ **What blocks it:**
1378
+ Memory content may be large (full git-backed repository). Need efficient delivery
1379
+ mechanism (in-cluster object store or read-only PVC mount).
1380
+
1381
+ **Estimated effort:** 2-3 days (snapshot injection into Job spec)
1382
+
1383
+ ---
1384
+
1385
+ ### Gap 13: Provider Config Resolution
1386
+
1387
+ **What kradle does now:**
1388
+ `AgentProviderConfig` stores API base URLs, auth types, and model rate tables.
1389
+ `checkBudget()` and `estimateCost()` use these rate tables for budget enforcement.
1390
+
1391
+ **What it should do:**
1392
+ Resolve provider config at Job creation time and pass credential references to
1393
+ the agent pod via env var secret refs (`valueFrom.secretKeyRef`).
1394
+
1395
+ **What blocks it:**
1396
+ Security boundary -- kradle should not pass raw API keys in Job env vars.
1397
+ Need `secretKeyRef` protocol (reference existing K8s Secret by name/key).
1398
+
1399
+ **Estimated effort:** 1-2 days (secretKeyRef injection in createAgentJob)
1400
+
1401
+ ---
1402
+
1403
+ ### Summary of Integration Gaps
1404
+
1405
+ | # | Gap | Status | Effort | Priority |
1406
+ |---|-----|--------|--------|----------|
1407
+ | 1 | Session Lifecycle Sync | **RESOLVED** (callback endpoint) | — | — |
1408
+ | 2 | Adapter Capability Caching | Open | 0.5d | Low |
1409
+ | 3 | Transport Binding Activation | **RESOLVED** (env var injection) | — | — |
1410
+ | 4 | Lifecycle Event Emission | **RESOLVED** (9 hooks events) | — | — |
1411
+ | 5 | Cost Aggregation | **RESOLVED** (checkBudget + deadline) | — | — |
1412
+ | 6 | Real-Time Session Streaming | Open | 5-8d | High |
1413
+ | 7 | Approval Gate Integration | Partial (pre-dispatch) | 1-2d | Medium |
1414
+ | 8 | Context Bundle Delivery | Open | 1-2d | Medium |
1415
+ | 9 | Workspace Mount Coordination | **RESOLVED** (PVC mount in Job) | — | — |
1416
+ | 10 | Trigger-to-Dispatch Pipeline | **RESOLVED** (K8s Job submission) | — | — |
1417
+ | 11 | Multi-Session Orchestration | Open | 10-15d | Low |
1418
+ | 12 | Memory Query at Dispatch Time | Partial (snapshot created) | 2-3d | Medium |
1419
+ | 13 | Provider Config Resolution | Partial (rates used, creds pending) | 1-2d | Medium |
1420
+
1421
+ **Remaining open effort:** ~20-31 developer-days (down from 35-60)
1422
+
1423
+ **Resolved critical path:** Gap 10 (Trigger-to-Dispatch) is complete. The basic
1424
+ dispatch-to-completion lifecycle (Gaps 1, 9, 10) is now end-to-end operational
1425
+ via K8s Jobs. Remaining gaps (6, 11) are non-critical enhancements.
1426
+
1427
+ ---
1428
+
1429
+ ## 6. Architectural Choice: Why K8s Jobs (vs DaemonSet, Deployment, Raw Pod)
1430
+
1431
+ ### Decision
1432
+
1433
+ Agent execution uses `batch/v1` Jobs (not DaemonSets, Deployments, or raw Pods).
1434
+
1435
+ ### Rationale
1436
+
1437
+ | Option | Why Rejected |
1438
+ |--------|-------------|
1439
+ | **Raw Pod** | No automatic restart semantics; pod evictions or node failures lose the run. Pod lifecycle not tracked by K8s controller. Manual cleanup required. |
1440
+ | **Deployment** | Designed for long-lived services, not one-shot tasks. Scales by replica count, not by task. Restart policy conflicts with agent "run once to completion" semantics. |
1441
+ | **DaemonSet** | Runs one pod per node — not suitable for per-run isolation. Cannot express per-run resource limits or deadlines. |
1442
+ | **StatefulSet** | Designed for ordered, persistent services. Overkill for ephemeral agent runs. |
1443
+ | **batch/v1 Job (chosen)** | Native support for `completionMode`, `backoffLimit`, `activeDeadlineSeconds`. K8s garbage-collects succeeded Jobs automatically. Pod failure is surfaced as Job failure. Integrates with K8s scheduler for resource-aware placement. Works with cluster autoscaler for node provisioning. |
1444
+
1445
+ ### Why `activeDeadlineSeconds` for Budget Enforcement
1446
+
1447
+ Budget enforcement requires a hard ceiling that survives process crashes, pod
1448
+ restarts, and network partitions. `activeDeadlineSeconds` is enforced by the
1449
+ K8s Job controller at the infrastructure level — even if the agent pod loses
1450
+ connectivity to Kradle, Kubernetes will terminate the pod when the deadline is
1451
+ exceeded and mark the Job as Failed. This gives Kradle a guaranteed out-of-band
1452
+ termination path independent of the callback mechanism.
1453
+
1454
+ ### Why One Job per Dispatch Run (vs Shared Runner Pool)
1455
+
1456
+ - **Isolation:** Each run gets its own process namespace, filesystem, and network
1457
+ policy. A bug in one agent cannot corrupt another run's workspace.
1458
+ - **Resource accounting:** Job resource requests/limits are per-run, enabling
1459
+ accurate cost tracking and scheduler placement decisions.
1460
+ - **Auditability:** K8s Job name matches dispatch run name (`agent-{runName}`),
1461
+ making cross-referencing between Kradle resources and cluster logs trivial.
1462
+ - **Simplicity:** No need for a long-lived runner daemon to multiplex tasks; K8s
1463
+ handles pod lifecycle, log collection, and cleanup.
1464
+
1465
+ ### Tradeoffs
1466
+
1467
+ - **Cold start latency:** Each run incurs image pull + pod scheduling time (~5-30s
1468
+ depending on cluster). RunnerPool warm replicas can mitigate this for CI Jobs,
1469
+ but agent Jobs currently pay the full cold start cost.
1470
+ - **Cluster resource pressure:** Many concurrent dispatch runs create many pods.
1471
+ Cluster autoscaler must be configured to scale node groups for agent workloads.
1472
+ - **Log retention:** K8s deletes completed Job pods after TTL. Kradle must ship
1473
+ logs to an external store (or use `persistSessionEvent` artifact logging) before
1474
+ the TTL expires.
1475
+
1476
+
1477
+ ---
1478
+
1479
+ ## KServe Integration: CRD Wrapper Architecture
1480
+
1481
+ **Decision:** Wrap KServe `InferenceService` and `ServingRuntime` CRDs via Kradle resources (`KradleInferenceService`, `KradleServingRuntime`) rather than exposing KServe CRDs directly to Kradle users.
1482
+
1483
+ **Rationale:**
1484
+ - Kradle resources follow the standard Kradle CRD model (`apiVersion: kradle.a5c.ai/v1alpha1`), keeping a consistent API surface for all platform resources
1485
+ - The wrapper validates model formats, generates correct KServe manifests, and abstracts away KServe-specific versioning (`serving.kserve.io/v1beta1`)
1486
+ - `toProviderConfig()` creates a seamless bridge to `AgentProviderConfig`, allowing agent stacks to reference on-cluster models without knowing they are backed by KServe
1487
+ - Enables future backend swap (e.g., Seldon, Triton standalone) without changing consumer API
1488
+
1489
+ **Alternative considered:** Direct KServe CRD exposure via kubectl pass-through. Rejected because it breaks the Kradle CRD model and makes agent integration harder.
1490
+
1491
+ ---
1492
+
1493
+ ## Artifact Registry: Internal + External Dual-Mode Design
1494
+
1495
+ **Decision:** Support both internal storage (Kradle-managed etcd/S3/GCS/Azure Blob) and external integration (GitHub Packages, etc.) within the same `ArtifactRegistry` resource via `storageBackend` and `externalIntegration` fields.
1496
+
1497
+ **Rationale:**
1498
+ - Teams often have existing external registries (npm, PyPI, GitHub Packages) they need to access; forcing migration to internal storage would block adoption
1499
+ - Internal storage (especially S3/GCS) provides full control over retention, access policy, and cost; external-only would lose that control
1500
+ - Mirror mode syncs published versions to external backends, enabling gradual migration or dual-publish workflows
1501
+ - A single resource kind covering both modes reduces API surface complexity
1502
+
1503
+ **Alternative considered:** External-only (proxy to existing registries). Rejected because Kradle needs to enforce access policies and track download analytics that external registries do not expose.
1504
+
1505
+ ---
1506
+
1507
+ ## Assistant Runtime: In-Process vs. K8s Job
1508
+
1509
+ **Decision:** Implement the assistant as an in-process runtime (`assistant-runtime.js`) using the Anthropic API directly, rather than dispatching K8s Jobs.
1510
+
1511
+ **Rationale:**
1512
+ - Assistant chat requires low latency (< 1s to first token); K8s Job startup overhead (5-30s) is unacceptable for interactive chat
1513
+ - Session state (message history) must survive across requests within a session; in-process `globalThis` store achieves this trivially without distributed state management
1514
+ - Streaming (SSE) is native to the Anthropic API and the Next.js runtime; routing through K8s Jobs would require complex proxying
1515
+ - Assistant workloads are lightweight (API calls, no GPU) and do not benefit from the isolation that K8s Jobs provide for agent workloads
1516
+
1517
+ **Alternative considered:** K8s Job dispatch (same as agent dispatch). Rejected due to latency, streaming complexity, and session state management overhead.
1518
+
1519
+ **Trade-off accepted:** In-process runtime means the assistant does not get the resource isolation and budget enforcement of K8s Jobs. This is acceptable because the assistant uses a shared Anthropic API key with org-level rate limiting.
1520
+
1521
+ ---
1522
+
1523
+ ## Integration Gap Status Update
1524
+
1525
+ | Gap | Status | Notes |
1526
+ |-----|--------|-------|
1527
+ | KServe inference | **Implemented** | KradleInferenceService + KradleServingRuntime controllers, 7 API routes, web console pages |
1528
+ | Artifact registry | **Implemented** | 5 resource kinds, 6 API routes, web console pages |
1529
+ | Assistant runtime | **Implemented** | In-process runtime, SSE streaming, session management, 5 API routes |
1530
+ | Auth on mutating routes | **Implemented** | withAuth applied to all POST/DELETE/PUT handlers; webhook HMAC and agent callback intentionally unauthenticated |