@a5c-ai/krate 5.0.1-staging.69cb593ea → 5.0.1-staging.6be34ee2a

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,1444 @@
1
+ # Integration & Design Decisions
2
+
3
+ Supplementary specification covering external dependencies, scope boundaries,
4
+ architectural trade-offs, and system nuances for the Krate project.
5
+
6
+ ---
7
+
8
+ ## 1. External Dependencies & Integration Points
9
+
10
+ ### 1.1 Agent-Mux Dependency
11
+
12
+ #### What Krate Imports from @a5c-ai/agent-mux
13
+
14
+ Krate does not import agent-mux as an npm dependency. Instead, it communicates
15
+ with the agent-mux gateway over HTTP. The integration surface lives entirely
16
+ within `core/src/agent-mux-client.js`, which provides:
17
+
18
+ - **queryCapabilities(adapter)** -- GET `/api/v1/agents/{adapter}/capabilities`
19
+ - **launchSession({stack, contextBundle, permissionSnapshot, workspace})** -- POST `/api/v1/sessions`
20
+ - **getSessionStatus(sessionId)** -- GET `/api/v1/sessions/{sessionId}`
21
+ - **subscribeToEvents(runId, callback)** -- GET `/api/v1/runs/{runId}/events` (SSE)
22
+ - **reconcileTranscript(sessionId, events, options)** -- local data transformation
23
+
24
+ The boundary declaration (`AGENT_MUX_CLIENT_BOUNDARY`) explicitly states:
25
+ - Owns: gateway HTTP calls, SSE event streaming, transcript reconciliation
26
+ - Delegates to: resource-model (for creating AgentSessionTranscript resources)
27
+ - Must not own: secret values, permission review, resource persistence
28
+
29
+ #### How agent-mux-client.js Connects to the Gateway
30
+
31
+ Connection is HTTP-only, using Node.js built-in `node:http` / `node:https`.
32
+ Zero external fetch or HTTP client dependencies. The internal `httpRequest()`
33
+ helper performs raw `transport.request()` calls with JSON serialization.
34
+
35
+ Connection parameters:
36
+ - `gateway` string -- full base URL (e.g. `http://agent-mux-gateway:8080`)
37
+ - `enabled` boolean -- client methods return `null` early when disabled
38
+ - Timeout: 30s default per request
39
+ - Protocol: auto-detected from URL scheme (http:// vs https://)
40
+
41
+ SSE streaming uses a persistent HTTP connection with:
42
+ - Reconnection via exponential backoff (1s, 2s, 4s... capped at 30s)
43
+ - Backoff reset on successful connection establishment
44
+ - Graceful abort via returned `{ abort }` handle
45
+ - Buffer-based SSE parsing (splits on `\n\n`, extracts `data:` lines)
46
+
47
+ #### What Works WITHOUT Agent-Mux
48
+
49
+ The following subsystems are fully operational without agent-mux:
50
+
51
+ 1. **Resource CRUD** -- All 76 resource kinds can be created, listed, updated, deleted via kubectl
52
+ 2. **Web Console** -- All 57 pages render; agent dispatch pages show "gateway unavailable" state
53
+ 3. **Auth & Sessions** -- OAuth login, cookie sessions, delegated identity all functional
54
+ 4. **Project Management** -- KrateProject, Issue, PullRequest lifecycle fully local
55
+ 5. **Memory System** -- AgentMemoryRepository, Source, Ontology, Query, Import all operate on CRDs
56
+ 6. **Workspace Provisioning** -- KrateWorkspace PVC specs, git ops, codespace generation
57
+ 7. **External Backends** -- Provider registration, webhook delivery, sync events
58
+ 8. **Audit Logging** -- Audit controller records all mutations regardless of agent-mux state
59
+ 9. **Policy Engine** -- Kyverno integration, PolicyProfile, PolicyBinding
60
+ 10. **Runner Pool Management** -- RunnerPool specs, scheduling policies
61
+ 11. **Notification System** -- Resource-change notifications via event bus
62
+ 12. **MCP Server** -- All 14 tools operational; `krate_dispatch_agent` returns error status
63
+
64
+ #### What REQUIRES Agent-Mux
65
+
66
+ The following operations fail gracefully (return null/error) without a running gateway:
67
+
68
+ 1. **Real Session Creation** -- `launchSession()` returns null; AgentSession resource created but status stays `Pending`
69
+ 2. **Event Streaming** -- `subscribeToEvents()` reconnect loop fires indefinitely (capped at 30s intervals)
70
+ 3. **Transcript Reconciliation** -- No events to reconcile; AgentSessionTranscript never moves to `Reconciled` phase
71
+ 4. **Adapter Capability Discovery** -- `queryCapabilities()` returns null; stack readiness condition degraded
72
+ 5. **Live Session Status** -- `getSessionStatus()` returns null; UI shows "status unavailable"
73
+ 6. **Cost Calculation** -- Token usage comes from SSE events; without them, cost fields remain zero
74
+
75
+ #### Gateway URL Configuration
76
+
77
+ The gateway URL flows through the system as follows:
78
+
79
+ ```
80
+ Helm values.yaml:
81
+ agents:
82
+ agentMux:
83
+ enabled: false # Must be true for integration
84
+ gateway: "" # Full URL to agent-mux gateway
85
+
86
+ --> Rendered into Deployment env:
87
+ KRATE_AGENT_MUX_ENABLED=true
88
+ KRATE_AGENT_MUX_GATEWAY=http://agent-mux-gateway:8080
89
+
90
+ --> Read by API controller initialization:
91
+ createAgentMuxClient({
92
+ gateway: process.env.KRATE_AGENT_MUX_GATEWAY || '',
93
+ enabled: process.env.KRATE_AGENT_MUX_ENABLED === 'true'
94
+ })
95
+
96
+ --> Persisted in AgentGatewayConfig CRD:
97
+ spec.gatewayUrl (for UI display and runtime reconfiguration)
98
+ ```
99
+
100
+ The web container does NOT communicate with agent-mux directly. All agent
101
+ operations route through the API container's `/api/agents/*` endpoints, which
102
+ delegate to the mux client instance.
103
+
104
+ #### Runtime Availability Check
105
+
106
+ ```javascript
107
+ client.isAvailable() // returns: enabled && !!gateway
108
+ ```
109
+
110
+ Every client method checks `isAvailable()` first and returns `null` immediately
111
+ when the gateway is not configured. This prevents connection errors from
112
+ propagating into the resource reconciliation loops.
113
+
114
+ ---
115
+
116
+ ### 1.2 Transport-Mux Dependency
117
+
118
+ #### How Transport-Mux Handles Protocol Translation
119
+
120
+ Transport-mux is an external component (not bundled with krate) that provides
121
+ protocol translation between different agent communication channels:
122
+
123
+ - **stdio** -- stdin/stdout JSON-RPC for local CLI agents
124
+ - **http** -- REST/SSE for cloud-hosted agents
125
+ - **websocket** -- bidirectional streaming for persistent connections
126
+ - **unix** -- Unix domain socket for same-host agents
127
+
128
+ The transport-mux runtime sits between the agent-mux gateway and the actual
129
+ agent process, handling message framing, connection lifecycle, and protocol
130
+ negotiation.
131
+
132
+ #### What Krate's AgentTransportBinding Models
133
+
134
+ The `AgentTransportBinding` CRD captures connection configuration:
135
+
136
+ ```yaml
137
+ spec:
138
+ adapterRef: "claude-adapter-v1"
139
+ endpoint: "wss://agent.example.com/ws"
140
+ protocol: "websocket" # One of: stdio, http, websocket, unix
141
+ reconnectPolicy:
142
+ maxRetries: 3
143
+ backoffMs: 1000
144
+ maxBackoffMs: 30000
145
+ auth:
146
+ type: "bearer"
147
+ secretRef: "agent-token-secret"
148
+ healthCheck:
149
+ endpoint: "/health"
150
+ intervalMs: 30000
151
+ ```
152
+
153
+ The controller (`agent-transport-binding-controller.js`) provides:
154
+ - **Validation** -- Ensures required fields present, protocol is one of 4 valid types
155
+ - **Connection Status Tracking** -- Reads `status.connectionStatus` (set externally)
156
+ - **Reconnect Policy Resolution** -- Merges spec overrides with defaults (3 retries, 1s-30s backoff)
157
+
158
+ #### Gap: Krate Validates Transport but Never Activates the Runtime
159
+
160
+ The transport binding controller is purely declarative. It:
161
+ - Validates the spec shape
162
+ - Reports connection status from the resource's status field
163
+ - Returns supported protocols list
164
+
165
+ It does NOT:
166
+ - Open actual TCP/WebSocket connections
167
+ - Start stdio processes
168
+ - Perform health checks (despite modeling healthCheck in spec)
169
+ - Activate the transport-mux runtime component
170
+ - Signal transport-mux to register a new binding
171
+
172
+ The intent is that transport-mux watches AgentTransportBinding resources via
173
+ Kubernetes watch and self-reconciles. Krate's role is to persist the desired
174
+ state and present it in the UI.
175
+
176
+ ---
177
+
178
+ ### 1.3 Hooks-Mux Dependency
179
+
180
+ #### What Hooks-Mux Provides
181
+
182
+ The hooks-mux system (external to krate) provides lifecycle event dispatching
183
+ for agent runs:
184
+
185
+ - `RUN_CREATED` -- Agent dispatch initiated
186
+ - `STEP_STARTED` -- Agent begins a tool use or reasoning step
187
+ - `STEP_COMPLETED` -- Agent finishes a step
188
+ - `APPROVAL_REQUESTED` -- Agent needs human gate
189
+ - `APPROVAL_GRANTED` / `APPROVAL_DENIED` -- Gate resolved
190
+ - `RUN_COMPLETED` -- Agent run finished (success/failure/timeout)
191
+ - `RUN_CANCELLED` -- Agent run externally terminated
192
+
193
+ These events flow to registered webhook subscribers, trigger rules, and
194
+ the notification system.
195
+
196
+ #### What Krate's Event Bus Does Instead
197
+
198
+ Krate has its own in-process event bus (`core/src/event-bus.js`) that provides:
199
+
200
+ ```javascript
201
+ const bus = createEventBus();
202
+ bus.subscribe(fn);
203
+ bus.emit({ type: 'resource-change', kind, name, operation, timestamp });
204
+ bus.emitResourceChange('Repository', 'my-repo', 'apply');
205
+ ```
206
+
207
+ This bus handles:
208
+ - Real-time resource change notifications to SSE clients (web console live updates)
209
+ - Cache invalidation signals
210
+ - UI refresh triggers
211
+
212
+ The event bus is limited to **resource-change events only**. It does not model:
213
+ - Agent lifecycle events (run created/completed)
214
+ - Step-level granularity (tool use, reasoning)
215
+ - Cross-service event routing
216
+ - Durable event delivery with retries
217
+
218
+ #### Gap: No HookDispatcher Integration
219
+
220
+ Krate has a `WebhookBus` class (`core/src/hooks-events.js`) that handles
221
+ outbound webhook delivery for resource events, but it is NOT connected to
222
+ hooks-mux lifecycle events. Specifically:
223
+
224
+ - `WebhookBus.deliver()` creates `WebhookDelivery` resources for webhook subscribers
225
+ - It does NOT receive or forward agent-lifecycle events from hooks-mux
226
+ - `AgentTriggerRule` resources define event-to-stack routing but the trigger
227
+ evaluation is purely resource-driven (no real-time hook stream)
228
+ - `AgentTriggerExecution` records are created by the trigger controller but
229
+ the trigger is evaluated on resource watch events, not on hooks-mux push
230
+
231
+ The missing integration:
232
+ 1. No hooks-mux webhook receiver endpoint in krate's HTTP server
233
+ 2. No translation layer from hooks-mux event format to krate event bus
234
+ 3. No lifecycle event emission when AgentDispatchRun status changes
235
+ 4. No step-level event tracking (only run-level status)
236
+
237
+ ---
238
+
239
+ ### 1.4 Babysitter-SDK Dependency
240
+
241
+ #### The .a5c/processes Pattern
242
+
243
+ Babysitter-SDK uses a file-based process definition pattern:
244
+
245
+ ```
246
+ .a5c/
247
+ processes/
248
+ my-process.json # Process definition
249
+ runs/
250
+ <runId>/
251
+ journal.json # Event log
252
+ state.json # Current state
253
+ effects/ # Effect outputs
254
+ ```
255
+
256
+ Krate processes (if any) would follow this same pattern. However, krate does
257
+ not currently define babysitter processes for its own operations. The
258
+ integration point is on the **import** side -- krate can ingest babysitter
259
+ run artifacts into its memory system.
260
+
261
+ #### How AgentRunMemoryImport Connects to Babysitter Journal Parsing
262
+
263
+ The `agent-memory-import.js` module provides:
264
+
265
+ ```javascript
266
+ parseJournalForImport(journal)
267
+ // Returns: { summary, keyEvents, effectSummary }
268
+ ```
269
+
270
+ This function:
271
+ 1. Accepts a babysitter `.a5c` journal array (raw event objects)
272
+ 2. Extracts structural metadata only (no raw task content, no effect payloads)
273
+ 3. Produces a summary with: runId, processId, eventCount, durationMs, runStatus
274
+ 4. Extracts key events: run_start, task_completed (with effect kind/result), breakpoint, run_end
275
+ 5. Computes effect summary: successCount, failureCount, effectKinds array
276
+
277
+ The `AgentRunMemoryImport` CRD then stores this parsed summary alongside:
278
+ - `memoryRepository` reference (where to store)
279
+ - `source` information (which babysitter run)
280
+ - `include` filters (what to import)
281
+ - Review and redaction status
282
+
283
+ #### The Orchestration Boundary
284
+
285
+ ```
286
+ Babysitter-SDK Krate
287
+ -------------------------------------------
288
+ Process definition <--> (not used)
289
+ Run lifecycle <--> AgentDispatchRun (mirrors status)
290
+ Journal events ---> AgentRunMemoryImport (parsed summary)
291
+ Session binding <--> AgentSession (Krate projection)
292
+ Hook-driven continuation <--> AgentApproval (approval gates)
293
+ ```
294
+
295
+ - **Babysitter owns**: run lifecycle (create, iterate, complete), journal/state, session binding
296
+ - **Krate owns**: resource state, memory ingestion, audit trail, approval workflows
297
+ - The boundary is clean: babysitter does execution, krate does desired-state persistence
298
+
299
+ ---
300
+
301
+ ### 1.5 Atlas Dependency
302
+
303
+ #### How the Stack Builder Queries Atlas Graph
304
+
305
+ The web console's stack builder uses Atlas as a knowledge source for populating
306
+ layer options. The query path is:
307
+
308
+ ```
309
+ Browser (stack-builder-graph.jsx)
310
+ --> /api/atlas/search (Next.js route handler)
311
+ --> fetch(ATLAS_BASE_URL + /api/v1/search?q=...&kind=...&limit=...)
312
+ --> fetch(ATLAS_BASE_URL + /api/v1/kinds/{kind}?limit=...)
313
+ ```
314
+
315
+ The proxy route (`web/app/api/atlas/search/route.js`) handles:
316
+ - **Browse mode**: fetches instances by kind (no search query needed)
317
+ - **Search mode**: full-text search via Atlas's Fuse.js-based endpoint
318
+ - **Multi-kind search**: parallel queries per kind, merged and deduplicated by id
319
+
320
+ The SDK also provides a direct client (`sdk/src/atlas-graph-client.js`):
321
+ - `fetchAtlasRecordsByKinds(atlasBaseUrl, kinds, options)` -- browse by NodeKind
322
+ - `searchAtlasGraph(atlasBaseUrl, query, options)` -- full-text with optional kind filter
323
+
324
+ #### The 11 Stack Layers + 4 Composition Facets
325
+
326
+ **STACK_LAYERS** (11 layers, each with associated Atlas NodeKinds):
327
+
328
+ | # | Layer | Atlas NodeKinds |
329
+ |---|-------|-----------------|
330
+ | 1 | Model | ModelFamily, ModelVersion, SessionModel |
331
+ | 2 | Provider | Provider, ModelProviderProduct, ModelProviderVersion |
332
+ | 3 | Transport | TransportProtocol, ModelTransportProtocol |
333
+ | 4 | Agent Core | AgentCoreImpl, Capability, CapabilitySupport |
334
+ | 5 | Agent Runtime | AgentProduct, AgentRuntimeImpl, AgentVersion, Subagent |
335
+ | 6 | Agent Platform | AgentPlatformImpl, Platform, PlatformService |
336
+ | 7 | Workspace | Workspace, Project, SharedContextSpec |
337
+ | 8 | Execution | Workflow, LibraryProcess, Phase, HookSurface |
338
+ | 9 | Sandbox | PermissionMode, DeploymentTarget |
339
+ | 10 | Interaction | Tool, ToolDescriptor, ToolServer, PluginArtifact, MCPPrompt |
340
+ | 11 | Presentation | AgentUIImpl, Page, APIEndpoint, Presentation |
341
+
342
+ **COMPOSITION_FACETS** (4 cross-cutting concerns):
343
+
344
+ | Facet | Atlas NodeKinds |
345
+ |-------|-----------------|
346
+ | Roles and Teams | Role, Responsibility, OrgUnit, AgentTeam |
347
+ | Skills and Capabilities | Skill, LibrarySkill, SkillArea, Capability |
348
+ | Evaluation and Governance | Benchmark, TestSet, EvalRun |
349
+ | Environment and Data | StackPart, VectorStore, MemoryStore |
350
+
351
+ #### Atlas Node Kinds Mapped to Stack Builder Layers
352
+
353
+ Each stack builder layer queries Atlas for specific NodeKinds. The mapping
354
+ (`atlasKinds` array per layer) drives which records appear as selectable
355
+ options in the UI. Users compose an AgentStack by picking one or more
356
+ records from each layer.
357
+
358
+ The resolution path:
359
+ 1. User opens stack builder page
360
+ 2. UI requests `/api/atlas/search?mode=browse&kinds=ModelFamily,ModelVersion`
361
+ 3. Route handler fetches from Atlas API
362
+ 4. Results rendered as selectable cards/chips in the layer panel
363
+ 5. User selections flow into AgentStack spec fields
364
+
365
+ #### What Happens When Atlas Is Unavailable
366
+
367
+ When Atlas is unreachable:
368
+ - Browse queries return empty arrays `[]` (no crash)
369
+ - Search queries return `{ total: 0, hits: [] }`
370
+ - The proxy route returns `{ total: 0, hits: [], error: <message> }` with status 502
371
+ - Stack builder layers show "No options available" state
372
+ - Users can still manually type stack configuration without Atlas suggestions
373
+ - The `ATLAS_BASE_URL` env var defaults to `https://atlas-staging.a5c.ai`
374
+
375
+ ---
376
+
377
+ ## 2. Scope Boundaries
378
+
379
+ ### 2.1 What Krate Owns
380
+
381
+ #### Kubernetes CRD Resource Model (76 Kinds)
382
+
383
+ Krate defines and manages 76 resource kinds across two storage tiers:
384
+
385
+ **Config kinds (44, etcd-stored):**
386
+ Organization, OrgNamespaceBinding, User, Team, Invite, IdentityMapping,
387
+ AuthProvider, Repository, SSHKey, RepositoryPermission, WebhookSubscription,
388
+ RefPolicy, BranchProtection, PolicyProfile, PolicyTemplate, PolicyBinding,
389
+ PolicyExceptionRequest, RunnerPool, View, Selector, AgentStack, AgentSubagent,
390
+ AgentToolProfile, AgentMcpServer, AgentSkill, AgentTriggerRule, AgentContextLabel,
391
+ KrateWorkspacePolicy, AgentServiceAccount, AgentRoleBinding, AgentSecretGrant,
392
+ AgentConfigGrant, AgentAdapter, AgentTransportBinding, AgentProviderConfig,
393
+ KrateProject, AgentGatewayConfig, AgentMemoryRepository, AgentMemorySource,
394
+ AgentMemoryOntology, AgentMemoryAssociation, KrateWorkspace,
395
+ ExternalBackendProvider, ExternalBackendBinding, ExternalBackendSyncPolicy,
396
+ ExternalProviderCapabilityManifest
397
+
398
+ **Aggregated kinds (32, postgres-stored):**
399
+ PullRequest, Issue, Review, Pipeline, Job, WebhookDelivery, AgentDispatchRun,
400
+ AgentDispatchAttempt, AgentSession, AgentContextBundle, KrateArtifact,
401
+ AgentApproval, AgentTriggerExecution, AgentCapabilityRequirement,
402
+ WorkItemSessionLink, WorkItemWorkspaceLink, AgentSessionTranscript,
403
+ AgentSessionAttachment, KrateWorkspaceRuntime, AgentMemorySnapshot,
404
+ AgentMemoryQuery, AgentMemoryUpdate, AgentRunMemoryImport,
405
+ ExternalWebhookDelivery, ExternalSyncEvent, ExternalSyncState,
406
+ ExternalWriteIntent, ExternalSyncConflict, ExternalObjectLink
407
+
408
+ #### Resource CRUD via kubectl
409
+
410
+ All resource operations use `spawnSync('kubectl', ...)` or async `spawn('kubectl', ...)`:
411
+ - `kubectl get <resource> -n <namespace> -o json`
412
+ - `kubectl apply -f - -o json` (with JSON manifest piped to stdin)
413
+ - `kubectl delete <resource> <name> -n <namespace>`
414
+ - `kubectl get <resource> --watch -o json` (for live updates)
415
+
416
+ #### Web Console (57+ Pages)
417
+
418
+ Seven page modules:
419
+ 1. **agent** -- Stacks, dispatch, sessions, triggers, memory, projects, adapters, workspaces
420
+ 2. **repo** -- Repositories, pull requests, issues, reviews, pipelines
421
+ 3. **manage** -- Organizations, users, teams, invites, identity mappings
422
+ 4. **settings** -- Auth providers, webhooks, runners, policies, views
423
+ 5. **external** -- Backend providers, bindings, sync policies, conflicts
424
+ 6. **lib/krate-ui** -- Shared UI components (tables, forms, modals, badges)
425
+ 7. **lib/page-frame** -- Layout shell, navigation, breadcrumbs
426
+
427
+ #### Auth (OAuth, Session Cookies, Middleware)
428
+
429
+ - GitHub OAuth (authorization code flow)
430
+ - Workspace SSO (OIDC authorization code flow)
431
+ - Delegated identity (proxy headers: x-forwarded-user/groups/email)
432
+ - Local development bypass (auto-login for localhost)
433
+ - HMAC-SHA256 signed session cookies
434
+ - `parseSessionCookie` / `createSessionCookie` with timing-safe verification
435
+
436
+ #### Workspace Provisioning
437
+
438
+ - `KrateWorkspace` CRD: PVC specs, volume lifecycle, repository binding
439
+ - `KrateWorkspacePolicy` CRD: trust tiers, cleanup retention, provisioning mode
440
+ - `KrateWorkspaceRuntime` CRD: process status, environment, preview URLs
441
+ - Git worktree integration specs (branch/commit binding)
442
+ - Runner mount specifications
443
+
444
+ #### Memory System
445
+
446
+ - `AgentMemoryRepository`: org-level git pointer with layout profile and index policy
447
+ - `AgentMemorySource`: read policy for paths/kinds per repository, team, stack, or trigger
448
+ - `AgentMemoryOntology`: ontology with nodeKinds, edgeKinds, controlled vocabulary
449
+ - `AgentMemoryQuery`: graph/grep retrieval with ranking metadata
450
+ - `AgentRunMemoryImport`: babysitter run ingestion with redaction and review
451
+ - `AgentMemorySnapshot`: dispatch-time pin with resolved commit and digests
452
+ - `AgentMemoryUpdate`: reviewable proposed mutations with branch and validation
453
+ - `AgentMemoryAssociation`: bridge records linking memory to Krate resources
454
+
455
+ #### External Backend Abstraction
456
+
457
+ - `ExternalBackendProvider`: registration (type, endpoint, auth, capability discovery)
458
+ - `ExternalBackendBinding`: binding to org with credential reference and sync scope
459
+ - `ExternalBackendSyncPolicy`: interval, conflict resolution, field mapping, retry policy
460
+ - `ExternalProviderCapabilityManifest`: discovered API capabilities
461
+ - `ExternalWebhookDelivery`: inbound webhook processing
462
+ - `ExternalSyncEvent`: discrete sync event with dedupe/ordering
463
+ - `ExternalSyncState`: current sync phase per resource
464
+ - `ExternalWriteIntent`: queued write-back with approval state
465
+ - `ExternalSyncConflict`: detected conflicts with resolution outcomes
466
+ - `ExternalObjectLink`: stable local-to-external ID mapping
467
+
468
+ #### Notification System
469
+
470
+ - `globalEventBus` singleton for in-process pub/sub
471
+ - SSE endpoint for real-time web console updates
472
+ - `emitResourceChange(kind, name, operation)` on every mutation
473
+ - Listener registration/deregistration for per-connection subscriptions
474
+
475
+ #### Runner Pool Management
476
+
477
+ - `RunnerPool` CRD: capacity (warmReplicas, maxReplicas), cache policy, trust boundary
478
+ - Scheduling policy specs (not execution)
479
+ - Runner identity binding via AgentServiceAccount
480
+
481
+ #### Audit Logging
482
+
483
+ - Audit controller records all resource mutations
484
+ - Queryable via MCP tool (`krate_audit_query`)
485
+ - Correlation IDs on all snapshot fetches
486
+
487
+ #### MCP Server (14 Tools)
488
+
489
+ Exposed via stdio (`krate mcp`):
490
+ - `krate_snapshot`, `krate_list_resources`, `krate_get_resource`
491
+ - `krate_apply_resource`, `krate_delete_resource`, `krate_search`
492
+ - `krate_list_stacks`, `krate_create_stack`, `krate_dispatch_agent`
493
+ - `krate_list_secrets`, `krate_create_secret`
494
+ - `krate_sync_external`, `krate_resolve_conflict`, `krate_audit_query`
495
+
496
+ ---
497
+
498
+ ### 2.2 What Agent-Mux Owns
499
+
500
+ #### Actual Agent Spawning and Management
501
+
502
+ Agent-mux is responsible for:
503
+ - Instantiating agent processes (Claude, Codex, Gemini, etc.)
504
+ - Managing process lifecycle (start, monitor, terminate)
505
+ - Resource isolation between concurrent agent sessions
506
+ - Process cleanup on timeout or cancellation
507
+
508
+ #### Session Lifecycle
509
+
510
+ - Create session from stack parameters (model, prompt, tools, workspace)
511
+ - Stream events from running session to subscribers
512
+ - Terminate sessions (graceful and forced)
513
+ - Track session state transitions (Pending, Running, Completed, Failed, Cancelled)
514
+
515
+ #### Adapter Registry
516
+
517
+ Agent-mux maintains the runtime adapter registry:
518
+ - Claude adapter (Anthropic API)
519
+ - Codex adapter (OpenAI API)
520
+ - Gemini adapter (Google API)
521
+ - Pi adapter (Inflection API)
522
+ - Custom adapters via plugin system
523
+
524
+ Each adapter implements: capabilities query, session creation, event streaming.
525
+
526
+ #### Transport Codec
527
+
528
+ - Message format translation between krate's JSON and adapter-native formats
529
+ - Streaming frame encoding (SSE, WebSocket, stdio JSON-RPC)
530
+ - Binary attachment handling
531
+ - Compression negotiation
532
+
533
+ #### Provider Client Instantiation
534
+
535
+ - API key retrieval and injection
536
+ - Base URL resolution per provider
537
+ - Rate limiting enforcement per provider/model combination
538
+ - Retry logic for transient provider failures
539
+
540
+ #### Real-Time Event Streaming
541
+
542
+ - SSE server for `/api/v1/runs/{runId}/events`
543
+ - Event buffering and replay for reconnecting clients
544
+ - Multi-subscriber fanout (multiple UI tabs)
545
+ - Connection keep-alive and heartbeat
546
+
547
+ #### Token Counting and Cost Calculation
548
+
549
+ - Per-message input/output token counting
550
+ - Model-specific pricing application
551
+ - Cumulative cost tracking per session/run
552
+ - Cost breakdown in event payloads (`event.usage.inputTokens`, `event.usage.outputTokens`)
553
+
554
+ ---
555
+
556
+ ### 2.3 What Babysitter-SDK Owns
557
+
558
+ #### Process Definition and Orchestration
559
+
560
+ - `.a5c/processes/*.json` file format and schema
561
+ - Task graph resolution (dependencies, parallelism)
562
+ - Effect system (what a task produces)
563
+ - Breakpoint system (human gates in execution flow)
564
+
565
+ #### Run Lifecycle
566
+
567
+ - Run creation with process binding
568
+ - Task iteration (next task selection, execution, completion)
569
+ - Run completion detection (all tasks done, failure threshold)
570
+ - Run cancellation and cleanup
571
+
572
+ #### Journal and State Management
573
+
574
+ - Append-only journal (`journal.json`) with typed events
575
+ - Mutable state snapshot (`state.json`) for resumption
576
+ - Effect output persistence (`effects/`)
577
+ - Run metadata (timing, status, error details)
578
+
579
+ #### Session Binding
580
+
581
+ - Binding an agent session to a babysitter run
582
+ - Session-to-task mapping (which session handles which task)
583
+ - Multi-session orchestration (parallel tasks)
584
+
585
+ #### Hook-Driven Continuation
586
+
587
+ - Pre/post task hooks
588
+ - Breakpoint evaluation and resolution
589
+ - External trigger integration (webhook → resume)
590
+ - Timeout-based auto-continuation
591
+
592
+ ---
593
+
594
+ ### 2.4 The Gap Zone
595
+
596
+ The gap zone defines areas where krate manages the **resource declaration** but
597
+ does not perform the **runtime execution**. This is by design -- krate is the
598
+ "desired state" layer; execution is delegated to specialized runtimes.
599
+
600
+ #### AgentStack Exists but Isn't Resolved into a Running Adapter
601
+
602
+ - `AgentStack` CRD captures: baseAgent, adapter, model, prompt, tools, MCP servers, skills
603
+ - The stack controller reconciles readiness conditions (capability resolution, MCP health)
604
+ - But it never actually calls agent-mux to instantiate the adapter
605
+ - The gap: creating an AgentStack does not start anything
606
+
607
+ #### AgentDispatchRun Created but Agent Not Spawned by Krate
608
+
609
+ - `AgentDispatchRun` CRD captures: repository, sourceRefs, agentStack, taskKind
610
+ - Status tracks: Queued, Running, Completed, Failed
611
+ - The dispatch controller creates the resource and validates the spec
612
+ - But the actual agent spawn happens in agent-mux (triggered by external controller)
613
+ - The gap: dispatch resource exists immediately, execution happens later
614
+
615
+ #### KrateWorkspace Generates Pod Specs but Doesn't Execute Them
616
+
617
+ - `KrateWorkspace` CRD captures: repository, volumeSpec, mount paths
618
+ - `KrateWorkspacePolicy` defines provisioning rules and trust tiers
619
+ - The workspace controller generates PVC manifests and mount specifications
620
+ - But it does not create the actual PVC or pod (that's the workspace-provisioner)
621
+ - The gap: workspace spec is declarative; provisioning is separate
622
+
623
+ #### RunnerPool Generates Schedules but Doesn't Provision Runners
624
+
625
+ - `RunnerPool` CRD captures: warmReplicas, maxReplicas, cache policy
626
+ - The runner controller validates pool specs and computes scheduling hints
627
+ - But it does not create actual runner pods or scale deployments
628
+ - The gap: pool definition is intent; ARC (Actions Runner Controller) or
629
+ similar actually provisions the runners
630
+
631
+ #### The Intent
632
+
633
+ This pattern is intentional Kubernetes-native architecture:
634
+ 1. Krate manages CRD resources as desired state
635
+ 2. Specialized operators/controllers watch these resources
636
+ 3. Operators reconcile desired state into actual state
637
+ 4. Status fields reflect observed state back into CRDs
638
+ 5. Krate's UI reads status to show current state
639
+
640
+ This separation enables:
641
+ - Independent scaling of control plane vs execution plane
642
+ - Pluggable execution backends (swap agent-mux implementation)
643
+ - GitOps-compatible declarative management
644
+ - Clear audit trail (every intent is a persisted resource)
645
+
646
+ ---
647
+
648
+ ## 3. Architectural Choices & Trade-offs
649
+
650
+ ### 3.1 Why CRD-First (vs Database-First)
651
+
652
+ **Decision:** All state stored as Kubernetes custom resources (CRDs for config
653
+ kinds, aggregated API server for data-plane kinds).
654
+
655
+ **Rationale:**
656
+ - GitOps-compatible: resources can be managed via Argo CD, Flux, or plain kubectl
657
+ - kubectl-native: operators and admins use familiar tooling
658
+ - Declarative: desired state vs imperative mutations
659
+ - No external DB dependency for control plane (etcd comes with K8s)
660
+ - Built-in RBAC: Kubernetes RBAC applies to all resource operations
661
+ - Watch support: built-in change notification for controllers
662
+ - Namespace isolation: multi-tenancy via namespace-per-org
663
+
664
+ **Trade-off:**
665
+ - Slower than direct database queries (kubectl spawnSync overhead)
666
+ - etcd size limits (~1.5MB per resource, cluster-wide storage cap)
667
+ - No complex queries (no JOIN, no WHERE with multiple conditions)
668
+ - No full-text search (client-side filtering only)
669
+ - Pagination via continue tokens (not offset-based)
670
+
671
+ **Mitigation:**
672
+ - Snapshot cache (30s TTL) for dashboard queries
673
+ - Per-org caching reduces repeated cross-namespace queries
674
+ - Background async snapshot refresh (stale-while-revalidate)
675
+ - Aggregated API server pattern for high-volume data (PullRequest, Pipeline, etc.)
676
+ - `getPartialSnapshot()` for pages that only need a subset of kinds
677
+
678
+ **When It Breaks:**
679
+ - Large clusters with 1000+ resources per kind (kubectl list becomes slow)
680
+ - Complex joins needed (e.g. "all runs for repositories owned by team X")
681
+ - Full-text search across resource content (need external search index)
682
+ - High-frequency writes (etcd write throughput is ~10K/s cluster-wide)
683
+ - Time-series data (audit logs, metrics -- needs separate store)
684
+
685
+ ---
686
+
687
+ ### 3.2 Why kubectl spawnSync (vs K8s client-go or @kubernetes/client-node)
688
+
689
+ **Decision:** Shell out to kubectl binary for all Kubernetes operations using
690
+ Node.js `child_process.spawnSync()` and `child_process.spawn()`.
691
+
692
+ **Rationale:**
693
+ - Zero npm dependencies for K8s operations (no `@kubernetes/client-node`)
694
+ - Works with any kubeconfig (user's local config, service account, EKS/GKE auth plugins)
695
+ - kubectl handles all auth complexity (OIDC, exec-based plugins, certificates)
696
+ - No Node.js K8s client compatibility bugs
697
+ - kubectl is battle-tested and always up-to-date with K8s API changes
698
+ - Debugging: can reproduce any operation by copying the kubectl command
699
+
700
+ **Trade-off:**
701
+ - `spawnSync` blocks the event loop (one at a time per request)
702
+ - Cold starts are slow (kubectl binary startup + API server TLS handshake)
703
+ - No watch support in sync mode (separate spawn needed)
704
+ - Process spawn overhead (~20-50ms per call)
705
+ - Max buffer limits on large responses (configurable via `KRATE_KUBECTL_MAX_BUFFER_BYTES`)
706
+
707
+ **Mitigation:**
708
+ - `kubernetes-controller-async.js` uses `spawn()` + `Promise.all()` for parallel queries
709
+ - Snapshot cache means most page loads skip kubectl entirely
710
+ - `getPartialSnapshot()` queries only needed kinds (not all 76)
711
+ - `KRATE_KUBECTL_TIMEOUT_MS` (default 3s) prevents hung processes
712
+ - `runKubectlAsync()` for non-blocking operations in the async controller
713
+ - In-cluster detection auto-adds `--server`, `--token`, `--certificate-authority`
714
+ flags (no kubeconfig file needed in-cluster)
715
+
716
+ **Future Options:**
717
+ - Could add in-cluster HTTP client using service account token at
718
+ `/var/run/secrets/kubernetes.io/serviceaccount/token`
719
+ - Could use `@kubernetes/client-node` for watch-only operations
720
+ - Could implement a sidecar proxy that exposes a local HTTP API
721
+
722
+ ---
723
+
724
+ ### 3.3 Why Next.js App Router (vs Pages Router or Remix)
725
+
726
+ **Decision:** Next.js 16 with App Router, React 19 server components.
727
+
728
+ **Rationale:**
729
+ - Server-side rendering for dashboard pages (no client-side waterfall)
730
+ - Streaming responses for progressive rendering
731
+ - Parallel data loading (multiple server components fetch independently)
732
+ - File-based routing matches the 7-module page structure
733
+ - React Server Components reduce client bundle size
734
+ - Built-in API routes for proxy endpoints (Atlas, auth callbacks)
735
+
736
+ **Trade-off:**
737
+ - Complex server/client boundary (`'use client'` directive management)
738
+ - Cannot pass functions or non-serializable props from server to client components
739
+ - Larger initial bundle than SPA alternatives
740
+ - Build time increases with page count
741
+ - Turbopack compatibility issues in Docker builds (fallback to webpack)
742
+
743
+ **Mitigation:**
744
+ - Clear module split: `lib/pages/` (server), `components/` (client where needed)
745
+ - `'use client'` only on interactive components (forms, drag-drop, modals)
746
+ - SDK resolveAlias for monorepo imports (relative path workaround)
747
+ - Standalone output mode reduces Docker image size
748
+ - `export const metadata` pattern for static head content
749
+
750
+ **Gotcha:**
751
+ - `export const metadata` cannot coexist with barrel re-exports
752
+ (must be in the page file itself, not re-exported from index)
753
+ - `dynamic = 'force-dynamic'` required on proxy routes (Atlas, auth)
754
+ - `process.env` access works differently in server components vs route handlers
755
+
756
+ ---
757
+
758
+ ### 3.4 Why Pure ESM JavaScript (vs TypeScript)
759
+
760
+ **Decision:** Zero TypeScript across all krate packages. Pure `.js`/`.jsx` with
761
+ JSDoc annotations for type information.
762
+
763
+ **Rationale:**
764
+ - No build step for core package (run directly with `node`)
765
+ - Instant startup (no compilation delay in development)
766
+ - Simpler debugging (source maps not needed, line numbers match)
767
+ - No `tsconfig.json` complexity (path aliases, module resolution, strict modes)
768
+ - JSDoc types are optional and incremental (add where valuable)
769
+ - Core package uses only Node.js built-in modules (no bundler needed)
770
+
771
+ **Trade-off:**
772
+ - No compile-time type safety
773
+ - JSDoc is more verbose than TypeScript type annotations
774
+ - IDE IntelliSense weaker for complex types (generics, discriminated unions)
775
+ - Refactoring tools less reliable without type information
776
+ - No `enum` or `interface` -- must use `@typedef` or constants
777
+
778
+ **Mitigation:**
779
+ - Comprehensive test suite (1440+ tests across core, SDK, CLI, web)
780
+ - Controller boundary declarations (BOUNDARY objects) enforce contracts at runtime
781
+ - `validateResource()` checks required fields at runtime
782
+ - Every controller factory function documents its API via JSDoc
783
+ - Resource model has `requiredSpec` arrays that enforce schema at create/apply time
784
+
785
+ ---
786
+
787
+ ### 3.5 Why Stale-While-Revalidate (vs K8s Watch Streams)
788
+
789
+ **Decision:** 30s TTL cache with stale-while-revalidate pattern for all
790
+ Kubernetes resource reads.
791
+
792
+ **Rationale:**
793
+ - Simple implementation (Map-based cache, no persistent connections)
794
+ - Works with kubectl (no long-lived watch connections needed)
795
+ - Handles cold starts gracefully (first request blocks, subsequent use cache)
796
+ - Predictable memory usage (one snapshot per org)
797
+ - No reconnection logic needed for reads
798
+
799
+ **Trade-off:**
800
+ - 30s staleness window (UI may show outdated data)
801
+ - Cache miss on first request after deploy or cache clear
802
+ - Background revalidation fires even when no one is watching
803
+ - Multiple simultaneous requests may all miss cache (thundering herd)
804
+
805
+ **Mitigation:**
806
+ - `clearSnapshotCache()` called on every write (apply/delete) -- immediate consistency for mutations
807
+ - Per-org isolation: cache key includes organization parameter
808
+ - `revalidating` flag prevents duplicate background fetches
809
+ - `staleMs` threshold (5x TTL = 150s) before forcing blocking revalidation
810
+ - Configurable via `KRATE_SNAPSHOT_CACHE_TTL_MS` env var
811
+
812
+ **Future:**
813
+ - `watchResourceChanges()` is implemented in `kubernetes-controller-async.js`
814
+ - Watches key kinds (Organization, AgentStack, AgentSession) and clears cache on change
815
+ - Not yet wired to the web layer HTTP server
816
+ - Could enable near-real-time UI updates without polling
817
+
818
+ ---
819
+
820
+ ### 3.6 Why SDK Re-export Layer (vs Direct Imports)
821
+
822
+ **Decision:** `@a5c-ai/krate-sdk` re-exports from core, and web/CLI import
823
+ from the SDK rather than directly from core internals.
824
+
825
+ **Rationale:**
826
+ - Decouples web/CLI from internal core file paths
827
+ - Single import target for consumers (`import { ... } from '@a5c-ai/krate-sdk'`)
828
+ - SDK can expose a stable API surface while core refactors internally
829
+ - SDK adds web-specific helpers (UI model mappers, auth wrappers)
830
+ - Clear dependency direction: web -> SDK -> core
831
+
832
+ **Trade-off:**
833
+ - Extra level of indirection for simple re-exports
834
+ - Turbopack/webpack need `resolveAlias` configuration for monorepo resolution
835
+ - Circular dependency risk if SDK imports from web accidentally
836
+ - Version coupling (SDK must update when core export shapes change)
837
+
838
+ **Gotcha:**
839
+ - Turbopack requires **relative path** in resolveAlias (not absolute path)
840
+ - The alias target is `'../sdk/src/index.js'` from the web package root
841
+ - Monorepo workspace root must be correctly set in `next.config.js`
842
+ - Build fails silently with wrong path (module not found at runtime, not build time)
843
+
844
+ ---
845
+
846
+ ### 3.7 Why x-kubernetes-preserve-unknown-fields (vs Strict Schemas)
847
+
848
+ **Decision:** All CRD specs use `x-kubernetes-preserve-unknown-fields: true`
849
+ allowing arbitrary additional fields.
850
+
851
+ **Rationale:**
852
+ - Rapid iteration: UI can add new spec fields without CRD redeployment
853
+ - Forward-compatible: older CRD versions accept newer resource manifests
854
+ - No validation failures during development cycles
855
+ - Reduces coupling between Helm chart releases and feature development
856
+ - Enables spec exploration (prototype fields in UI, formalize later)
857
+
858
+ **Trade-off:**
859
+ - No server-side validation of field names or types
860
+ - Typos in spec field names silently accepted (e.g. `organisationRef` vs `organizationRef`)
861
+ - No OpenAPI schema generation for CRD fields
862
+ - `kubectl explain` shows no field documentation
863
+ - etcd stores whatever is submitted (no normalization)
864
+
865
+ **Mitigation:**
866
+ - Client-side validation in controllers (`validate*` functions)
867
+ - `requiredSpec` arrays in RESOURCE_DEFINITIONS enforce mandatory fields at apply time
868
+ - Test suite covers all valid/invalid field combinations
869
+ - Controller boundary declarations document expected spec shapes
870
+ - UI forms constrain input to valid fields
871
+
872
+ **Note:** Helm only installs CRDs on `helm install`, not `helm upgrade`.
873
+ Explicit `kubectl apply -f packages/krate/charts/crds/ --server-side --force-conflicts`
874
+ is needed in CI before helm upgrade to update CRD definitions.
875
+
876
+ ---
877
+
878
+ ### 3.8 Why Single-Container-Per-Role (api + controllers + web + webhook-worker)
879
+
880
+ **Decision:** 4 deployment containers (roles), each running a different entry
881
+ command from the same or related images.
882
+
883
+ **Actual Layout:**
884
+ | Role | Image | Entry Command | Port |
885
+ |------|-------|--------------|------|
886
+ | api | krate-controller | `node src/http-server.js` | 3080 |
887
+ | controllers | krate-controller | `node src/control-plane.js` | - |
888
+ | web | krate-web | `node .next/standalone/server.js` | 3000 |
889
+ | webhook-worker | krate-controller | `node src/external/webhook-controller.js` | - |
890
+
891
+ **Rationale:**
892
+ - Separation of concerns (API serving vs background reconciliation vs UI)
893
+ - Independent scaling (web can scale to 3 replicas while api stays at 1)
894
+ - Failure isolation (webhook worker crash doesn't affect UI)
895
+ - Resource tuning (web needs more memory for SSR, api needs more CPU)
896
+
897
+ **Trade-off:**
898
+ - More pods consuming cluster resources
899
+ - More complexity in Helm chart (4 Deployments, 4 Services)
900
+ - Internal service discovery needed (web → api via `KRATE_CONTROLLER_URL`)
901
+ - Shared code must be in core package (duplicated in controller image)
902
+
903
+ **Actual State:**
904
+ - api and controllers share the `krate-controller` image (same codebase, different entrypoint)
905
+ - web uses the `krate-web` image (Next.js standalone build)
906
+ - webhook-worker is architecturally separate but shares the controller image
907
+
908
+ ---
909
+
910
+ ## 4. System Nuances & Gotchas
911
+
912
+ ### 4.1 Namespace Discovery Fallback Chain
913
+
914
+ The system must determine which Kubernetes namespaces to query for org-scoped
915
+ resources. The resolution follows a priority chain:
916
+
917
+ **Step 1:** Check Organization resources in platform namespace
918
+ ```javascript
919
+ organizations.map(org => org.spec?.namespaceName || org.metadata?.labels?.['krate.a5c.ai/namespace'])
920
+ ```
921
+ If Organization CRDs exist, derive namespaces from their `spec.namespaceName` field.
922
+
923
+ **Step 2:** Check OrgNamespaceBinding resources
924
+ ```javascript
925
+ bindings.map(binding => binding.spec?.namespace || binding.metadata?.labels?.['krate.a5c.ai/namespace'])
926
+ ```
927
+ Bindings explicitly declare the namespace for each org.
928
+
929
+ **Step 3:** Environment variable fallback
930
+ ```javascript
931
+ if (process.env.KRATE_ADMIN_ORG) fallbackOrgs.add(orgNamespaceName(adminOrg));
932
+ fallbackOrgs.add(orgNamespaceName(process.env.KRATE_ORG || 'default'));
933
+ // Result: ['krate-org-admin', 'krate-org-default']
934
+ ```
935
+
936
+ **Step 4:** Last resort
937
+ ```javascript
938
+ return [KRATE_PLATFORM_NAMESPACE]; // 'krate-system'
939
+ ```
940
+
941
+ **WHY this chain exists:** Fresh deployments have no Organization CRD yet
942
+ (it's created on first admin login), but the UI needs to list resources in
943
+ an org namespace. The fallback ensures the system bootstraps correctly.
944
+
945
+ **Edge cases:**
946
+ - Multiple orgs: all discovered namespaces are queried (flat merge, no hierarchy)
947
+ - Namespace doesn't exist yet: kubectl returns empty list (no error with `--ignore-not-found`)
948
+ - Stale bindings: namespace listed but org deleted → empty results (harmless)
949
+
950
+ ---
951
+
952
+ ### 4.2 KRATE_CONTROLLER_URL Indirection
953
+
954
+ **Architecture:**
955
+ ```
956
+ Browser → Web Container (Next.js) → API Container (HTTP server)
957
+
958
+ kubectl → K8s API
959
+ ```
960
+
961
+ **How it works:**
962
+ - Web container has `KRATE_CONTROLLER_URL` env var pointing to api's internal K8s Service URL
963
+ (e.g. `http://krate-api.krate-system.svc.cluster.local:80`)
964
+ - Web NEVER runs kubectl directly (no kubeconfig mounted in web container)
965
+ - All resource operations go through fetch() to the api container
966
+
967
+ **If api container is down:**
968
+ - Web's server-side data fetching returns clean error model
969
+ - Pages render with error state (not a crash/500)
970
+ - No kubectl fallback from web container (by design)
971
+
972
+ **If api returns degraded data:**
973
+ - Web may probe local snapshot for comparison (modelResourceScore heuristic)
974
+ - Used to detect api container serving stale cache vs fresh data
975
+ - Not a correctness requirement, just a freshness indicator
976
+
977
+ **Why not direct kubectl from web?**
978
+ - Security: web container is publicly exposed (ingress), kubectl access is dangerous
979
+ - Image size: web image doesn't include kubectl binary (except for auth callback)
980
+ - Separation: web handles presentation, api handles data operations
981
+ - Exception: `registerLoginProfile()` in auth callback does use kubectl (web image
982
+ includes kubectl for this single operation -- registers User/IdentityMapping on login)
983
+
984
+ ---
985
+
986
+ ### 4.3 Cache + Write Interaction
987
+
988
+ **Write path (mutation):**
989
+ ```
990
+ applyResource(resource) / deleteResource(kind, name)
991
+ → kubectl apply / delete
992
+ → clearSnapshotCache() // Invalidate ALL cached data
993
+ → globalEventBus.emitResourceChange(kind, name, 'apply'|'delete')
994
+ → SSE clients receive notification
995
+ → Next page load fetches fresh data from kubectl
996
+ ```
997
+
998
+ **Read path (query):**
999
+ ```
1000
+ Page server component calls controller
1001
+ → staleWhileRevalidate(org, revalidateFn)
1002
+ → If cache fresh (< 30s): return immediately
1003
+ → If cache stale (30s-150s): return stale, refresh in background
1004
+ → If cache too old (> 150s) or missing: block on fresh fetch
1005
+ ```
1006
+
1007
+ **Key behaviors:**
1008
+ - `clearSnapshotCache()` clears ALL orgs (global invalidation)
1009
+ - `clearOrgCache(org)` clears single org (surgical invalidation)
1010
+ - Per-org cache isolation: different orgs don't interfere
1011
+ - `revalidating` flag prevents thundering herd (only one background refresh)
1012
+ - Write + immediate read: always gets fresh data (cache cleared on write)
1013
+
1014
+ **Race condition:**
1015
+ If two users write simultaneously:
1016
+ 1. User A writes → cache cleared
1017
+ 2. User B writes → cache cleared (already empty)
1018
+ 3. User A reads → fresh fetch, sets cache
1019
+ 4. User B reads → gets User A's fresh data (which includes both writes if kubectl returned both)
1020
+
1021
+ This is safe because kubectl always returns the latest server state.
1022
+
1023
+ ---
1024
+
1025
+ ### 4.4 Auth Cookie Security
1026
+
1027
+ **Cookie creation:**
1028
+ ```javascript
1029
+ // Payload: base64url(JSON({ provider, subject, user }))
1030
+ // With secret: payload.hmac-sha256(payload, secret) → base64url
1031
+ // Without secret: payload only (plain base64url, no signature)
1032
+ ```
1033
+
1034
+ **Verification matrix:**
1035
+
1036
+ | Cookie State | Secret Configured | Result |
1037
+ |-------------|-------------------|--------|
1038
+ | Signed | Yes | Verify HMAC, constant-time compare |
1039
+ | Signed | No | Reject (can't verify) |
1040
+ | Unsigned | Yes | Reject (could be tampered) |
1041
+ | Unsigned | No | Accept (backward compatible) |
1042
+
1043
+ **Security properties:**
1044
+ - HMAC-SHA256 signing ONLY when `KRATE_SESSION_SECRET` env var is set
1045
+ - Without secret: cookie is plain base64 (useful for development)
1046
+ - Constant-time comparison via `crypto.timingSafeEqual` (prevents timing attacks)
1047
+ - HttpOnly flag (no JavaScript access)
1048
+ - SameSite=Lax (prevents CSRF from cross-origin POST)
1049
+ - No Secure flag by default (set at ingress/proxy level)
1050
+
1051
+ **Tampered cookie handling:**
1052
+ - Invalid HMAC → `parseSessionCookie` returns `null`
1053
+ - null session → middleware rejects request
1054
+ - Rejection → 307 redirect to `/login`
1055
+ - No error message exposed (prevents oracle attacks)
1056
+
1057
+ ---
1058
+
1059
+ ### 4.5 CRD Lifecycle in CI
1060
+
1061
+ **Problem:** Helm's CRD handling has a well-known limitation:
1062
+ - `helm install` -- applies CRDs from the `crds/` directory
1063
+ - `helm upgrade` -- does NOT update CRDs (by design, to prevent data loss)
1064
+ - `helm uninstall` -- does NOT delete CRDs (by design, to prevent data loss)
1065
+
1066
+ **CI workaround:**
1067
+ ```bash
1068
+ # Before helm upgrade, explicitly apply CRDs
1069
+ kubectl apply -f packages/krate/charts/crds/ --server-side --force-conflicts
1070
+ ```
1071
+
1072
+ `--server-side` enables server-side apply (handles field ownership correctly).
1073
+ `--force-conflicts` resolves ownership conflicts (Helm vs kubectl managers).
1074
+
1075
+ **Implications for development:**
1076
+ - Adding a new field to an existing CRD: no CRD redeployment needed (preserve-unknown-fields)
1077
+ - Adding a new CRD kind: must deploy the CRD yaml file before resources can be created
1078
+ - Removing a field from CRD: preserve-unknown-fields means old resources still have the field
1079
+ - Changing field type in CRD: no validation exists, so no conflict (but client code may break)
1080
+
1081
+ **Best practices:**
1082
+ - Always add new kinds in the same PR that adds the CRD yaml
1083
+ - CI pipeline runs CRD apply before helm upgrade
1084
+ - Never rename CRD group/version/plural (breaks all existing resources)
1085
+ - Use annotations to mark deprecated fields (spec.deprecated.fieldName: "reason")
1086
+
1087
+ ---
1088
+
1089
+ ### 4.6 Org-Scoped vs Platform-Scoped Resources
1090
+
1091
+ **Platform-scoped resources** (exist in krate-system namespace only):
1092
+ - `Organization` -- represents an org identity
1093
+ - `OrgNamespaceBinding` -- binds org to a namespace
1094
+
1095
+ These are special because they exist "above" org namespaces -- they define
1096
+ the org structure itself.
1097
+
1098
+ **Org-scoped resources** (exist in krate-org-{slug} namespaces):
1099
+ - All other 74 resource kinds
1100
+ - Always have `spec.organizationRef` field
1101
+ - Namespace derived from org: `krate-org-${normalizeOrgSlug(org)}`
1102
+
1103
+ **Enforcement:**
1104
+ ```javascript
1105
+ // In withOrgScope():
1106
+ if (resource.metadata?.namespace && resource.metadata.namespace !== namespace) {
1107
+ throw new Error(`namespace ${resource.metadata.namespace} does not match organization ${org}`);
1108
+ }
1109
+ ```
1110
+
1111
+ Cross-org denial: `applyResource()` calls `withOrgScope()` which rejects any
1112
+ resource whose explicit namespace conflicts with its `organizationRef`.
1113
+
1114
+ **KRATE_RESOURCES array has `platformScoped: true` flag:**
1115
+ - Platform-scoped: only queried from `KRATE_PLATFORM_NAMESPACE` (krate-system)
1116
+ - Org-scoped: queried from all discovered org namespaces
1117
+
1118
+ **Multi-org queries:**
1119
+ Snapshot fetches resources from ALL org namespaces. The flattened result includes
1120
+ resources from all orgs. UI filters by `spec.organizationRef` for the current org view.
1121
+
1122
+ ---
1123
+
1124
+ ### 4.7 Web Container Architecture
1125
+
1126
+ **Dockerfile structure (multi-stage):**
1127
+ ```dockerfile
1128
+ # Stage 1: deps - install node_modules
1129
+ FROM node:20 AS deps
1130
+ COPY package*.json ./
1131
+ RUN npm ci
1132
+
1133
+ # Stage 2: build - Next.js production build
1134
+ FROM node:20 AS build
1135
+ COPY --from=deps /app/node_modules ./node_modules
1136
+ COPY . .
1137
+ RUN npm run build
1138
+
1139
+ # Stage 3: runtime - minimal production image
1140
+ FROM node:20-slim AS runtime
1141
+ COPY --from=build /app/.next/standalone ./
1142
+ COPY --from=build /app/.next/static ./.next/static
1143
+ COPY --from=build /app/public ./public
1144
+ # kubectl for auth callback
1145
+ COPY --from=bitnami/kubectl:latest /opt/bitnami/kubectl/bin/kubectl /usr/local/bin/
1146
+ ```
1147
+
1148
+ **Build uses webpack (not turbopack) for Docker:**
1149
+ - Turbopack has issues with Docker layer caching
1150
+ - Webpack is more predictable in CI environments
1151
+ - `--webpack` flag in `next build` command
1152
+
1153
+ **Runtime image includes kubectl:**
1154
+ - Needed for `registerLoginProfile()` during auth callback
1155
+ - Called once per user login (not hot path)
1156
+ - Could be removed if auth moved fully to api container
1157
+
1158
+ **SDK resolution via turbopack resolveAlias:**
1159
+ ```javascript
1160
+ // next.config.js
1161
+ experimental: {
1162
+ turbopack: {
1163
+ resolveAlias: {
1164
+ '@a5c-ai/krate-sdk': '../sdk/src/index.js' // MUST be relative
1165
+ }
1166
+ }
1167
+ }
1168
+ ```
1169
+
1170
+ **Standalone output mode:**
1171
+ - `.next/standalone/` contains full Node.js app (no node_modules needed at runtime)
1172
+ - Reduces Docker image from ~1GB to ~150MB
1173
+ - Entry: `node .next/standalone/server.js`
1174
+ - Static assets served from `.next/static/` (can be CDN-fronted)
1175
+
1176
+ ---
1177
+
1178
+ ## 5. Integration Gaps (Known, Documented)
1179
+
1180
+ The following 13 agent-mux integration gap categories represent areas where
1181
+ krate has resource definitions and controller logic but lacks runtime integration
1182
+ with the agent-mux ecosystem.
1183
+
1184
+ ### Gap 1: Session Lifecycle Sync
1185
+
1186
+ **What krate does now:**
1187
+ Creates `AgentSession` resources with `spec.agentMuxSessionId`. Status is set
1188
+ based on mux client responses. If mux is unavailable, status stays `Pending`.
1189
+
1190
+ **What it should do:**
1191
+ Bidirectional sync -- mux session state changes should update krate's AgentSession
1192
+ status in real-time. Currently only krate → mux (launch), no mux → krate (status updates).
1193
+
1194
+ **What blocks it:**
1195
+ No webhook receiver or watch mechanism for mux-initiated status changes.
1196
+ Would need either a callback URL registered during launch or a polling loop.
1197
+
1198
+ **Estimated effort:** 2-3 days (webhook receiver + reconciliation loop)
1199
+
1200
+ ---
1201
+
1202
+ ### Gap 2: Adapter Capability Caching
1203
+
1204
+ **What krate does now:**
1205
+ `queryCapabilities(adapter)` called during stack reconciliation. Result is used
1206
+ once and discarded (no caching of adapter capabilities).
1207
+
1208
+ **What it should do:**
1209
+ Cache adapter capabilities with TTL, invalidate on adapter CRD changes.
1210
+ Reduces mux API calls during frequent stack reconciliation cycles.
1211
+
1212
+ **What blocks it:**
1213
+ No cache layer between mux client and stack controller. Low priority since
1214
+ mux is usually fast and capabilities rarely change.
1215
+
1216
+ **Estimated effort:** 0.5 day (add to snapshot cache pattern)
1217
+
1218
+ ---
1219
+
1220
+ ### Gap 3: Transport Binding Activation
1221
+
1222
+ **What krate does now:**
1223
+ Validates `AgentTransportBinding` spec, reports connection status from
1224
+ resource status field (externally set).
1225
+
1226
+ **What it should do:**
1227
+ Signal transport-mux to activate/deactivate bindings when they are created/deleted.
1228
+ Currently transport-mux must independently discover binding changes.
1229
+
1230
+ **What blocks it:**
1231
+ No communication channel to transport-mux. Could be solved with K8s watch
1232
+ on the transport-mux side, or webhook from krate on binding mutation.
1233
+
1234
+ **Estimated effort:** 1-2 days (webhook or watch-based notification)
1235
+
1236
+ ---
1237
+
1238
+ ### Gap 4: Lifecycle Event Emission
1239
+
1240
+ **What krate does now:**
1241
+ `globalEventBus.emitResourceChange()` fires on resource CRUD operations.
1242
+ Purely internal, resource-level granularity.
1243
+
1244
+ **What it should do:**
1245
+ Emit hooks-mux-compatible lifecycle events (RUN_CREATED, STEP_STARTED, etc.)
1246
+ when agent-related resources change status.
1247
+
1248
+ **What blocks it:**
1249
+ No hooks-mux event format adapter. Need to define mapping from resource status
1250
+ transitions to lifecycle event types.
1251
+
1252
+ **Estimated effort:** 3-5 days (event mapper + hooks-mux client + delivery tracking)
1253
+
1254
+ ---
1255
+
1256
+ ### Gap 5: Cost Aggregation
1257
+
1258
+ **What krate does now:**
1259
+ `reconcileTranscript()` sums token usage from SSE events into a single
1260
+ AgentSessionTranscript. Cost fields are zero when no events available.
1261
+
1262
+ **What it should do:**
1263
+ Aggregate cost across multiple sessions in a dispatch run. Roll up to
1264
+ organization level for billing/budgeting.
1265
+
1266
+ **What blocks it:**
1267
+ No cost aggregation controller. Need periodic job that sums transcript costs
1268
+ per run, per org, per time window.
1269
+
1270
+ **Estimated effort:** 2-3 days (aggregation controller + UI dashboard)
1271
+
1272
+ ---
1273
+
1274
+ ### Gap 6: Real-Time Session Streaming to UI
1275
+
1276
+ **What krate does now:**
1277
+ Web console shows session status from cached snapshot (30s staleness).
1278
+ SSE endpoint only emits resource-change events (kind/name/operation).
1279
+
1280
+ **What it should do:**
1281
+ Proxy or relay agent-mux SSE events to the web console for live session
1282
+ viewing (token-by-token streaming).
1283
+
1284
+ **What blocks it:**
1285
+ Web container cannot reach mux directly. API container would need to relay
1286
+ mux SSE to its own SSE endpoint. Significant complexity for multi-subscriber
1287
+ fanout with back-pressure.
1288
+
1289
+ **Estimated effort:** 5-8 days (SSE relay, subscriber management, back-pressure)
1290
+
1291
+ ---
1292
+
1293
+ ### Gap 7: Approval Gate Integration
1294
+
1295
+ **What krate does now:**
1296
+ `AgentApproval` resources created with spec describing the gate (action, requestedBy).
1297
+ Status updated manually (via UI form or API call).
1298
+
1299
+ **What it should do:**
1300
+ When approval is granted/denied, signal back to agent-mux to resume/cancel
1301
+ the blocked session. Currently approval resolution doesn't trigger mux action.
1302
+
1303
+ **What blocks it:**
1304
+ No callback mechanism from krate to mux on approval state change. Mux would
1305
+ need to poll AgentApproval status or register a webhook.
1306
+
1307
+ **Estimated effort:** 2-3 days (approval webhook + mux integration)
1308
+
1309
+ ---
1310
+
1311
+ ### Gap 8: Context Bundle Delivery
1312
+
1313
+ **What krate does now:**
1314
+ `AgentContextBundle` resources store prompt/context snapshots with digest.
1315
+ Created during dispatch, immutable after creation.
1316
+
1317
+ **What it should do:**
1318
+ Deliver the context bundle payload to agent-mux during session launch.
1319
+ Currently `launchSession()` passes `contextBundle.prompt` and `contextBundle.systemPrompt`
1320
+ but not the full bundle (attachments, provenance, redaction manifest).
1321
+
1322
+ **What blocks it:**
1323
+ Mux API may not support full bundle format. Need API contract alignment
1324
+ between krate's AgentContextBundle spec and mux's session creation payload.
1325
+
1326
+ **Estimated effort:** 1-2 days (payload mapping + mux API extension)
1327
+
1328
+ ---
1329
+
1330
+ ### Gap 9: Workspace Mount Coordination
1331
+
1332
+ **What krate does now:**
1333
+ `KrateWorkspace` spec defines volume mounts and repository bindings.
1334
+ `launchSession()` passes `workspace.mountPath` to mux.
1335
+
1336
+ **What it should do:**
1337
+ Coordinate actual workspace provisioning (PVC creation, git clone, branch checkout)
1338
+ before signaling mux to start. Currently assumes workspace is pre-provisioned.
1339
+
1340
+ **What blocks it:**
1341
+ Workspace provisioning is a separate controller's responsibility. Need
1342
+ orchestration to ensure workspace is ready before session starts.
1343
+
1344
+ **Estimated effort:** 3-5 days (provisioning coordination, readiness gates)
1345
+
1346
+ ---
1347
+
1348
+ ### Gap 10: Trigger-to-Dispatch Pipeline
1349
+
1350
+ **What krate does now:**
1351
+ `AgentTriggerRule` defines event-to-stack routing. `AgentTriggerExecution`
1352
+ records evaluation decisions. Creates `AgentDispatchRun` on match.
1353
+
1354
+ **What it should do:**
1355
+ After creating AgentDispatchRun, automatically progress to session launch
1356
+ via agent-mux. Currently dispatch run creation is the terminal action.
1357
+
1358
+ **What blocks it:**
1359
+ No "dispatch controller" that watches AgentDispatchRun resources in Queued
1360
+ state and calls `launchSession()`. This is the critical missing piece.
1361
+
1362
+ **Estimated effort:** 3-5 days (dispatch reconciliation controller)
1363
+
1364
+ ---
1365
+
1366
+ ### Gap 11: Multi-Session Orchestration
1367
+
1368
+ **What krate does now:**
1369
+ `AgentSubagent` defines child-agent roles with task kinds and tool subsets.
1370
+ Stack reconciliation resolves subagent references.
1371
+
1372
+ **What it should do:**
1373
+ Orchestrate multiple concurrent mux sessions (one per subagent) for a single
1374
+ dispatch run. Track progress, handle dependencies between subagent tasks.
1375
+
1376
+ **What blocks it:**
1377
+ No multi-session coordinator. Would need significant orchestration logic
1378
+ (task graph, dependency resolution, failure handling, rollback).
1379
+
1380
+ **Estimated effort:** 10-15 days (orchestrator, state machine, failure handling)
1381
+
1382
+ ---
1383
+
1384
+ ### Gap 12: Memory Query at Dispatch Time
1385
+
1386
+ **What krate does now:**
1387
+ `AgentMemorySnapshot` pins memory state at dispatch time. `AgentMemoryQuery`
1388
+ records retrieval requests. Both are CRD resources.
1389
+
1390
+ **What it should do:**
1391
+ At dispatch time, automatically query memory repositories, create snapshot,
1392
+ include relevant memory in the context bundle delivered to mux.
1393
+
1394
+ **What blocks it:**
1395
+ No automated memory query pipeline. Currently snapshots/queries are created
1396
+ manually or by external automation. Need dispatch-time hook.
1397
+
1398
+ **Estimated effort:** 3-5 days (dispatch hook, query execution, bundle injection)
1399
+
1400
+ ---
1401
+
1402
+ ### Gap 13: Provider Config Resolution
1403
+
1404
+ **What krate does now:**
1405
+ `AgentProviderConfig` stores API base URLs, auth types, rate limits per provider.
1406
+ `AgentGatewayConfig` stores gateway connection settings.
1407
+
1408
+ **What it should do:**
1409
+ Resolve provider config at session launch time and pass credentials/settings
1410
+ to mux. Currently mux must independently manage provider credentials.
1411
+
1412
+ **What blocks it:**
1413
+ Security boundary -- krate should not pass raw API keys to mux (secret
1414
+ handling is mux's responsibility). Need credential reference protocol
1415
+ (e.g. "use secret X from namespace Y").
1416
+
1417
+ **Estimated effort:** 2-3 days (credential reference protocol, secret ref forwarding)
1418
+
1419
+ ---
1420
+
1421
+ ### Summary of Integration Gaps
1422
+
1423
+ | # | Gap | Effort | Priority |
1424
+ |---|-----|--------|----------|
1425
+ | 1 | Session Lifecycle Sync | 2-3d | High |
1426
+ | 2 | Adapter Capability Caching | 0.5d | Low |
1427
+ | 3 | Transport Binding Activation | 1-2d | Medium |
1428
+ | 4 | Lifecycle Event Emission | 3-5d | Medium |
1429
+ | 5 | Cost Aggregation | 2-3d | Medium |
1430
+ | 6 | Real-Time Session Streaming | 5-8d | High |
1431
+ | 7 | Approval Gate Integration | 2-3d | High |
1432
+ | 8 | Context Bundle Delivery | 1-2d | High |
1433
+ | 9 | Workspace Mount Coordination | 3-5d | Medium |
1434
+ | 10 | Trigger-to-Dispatch Pipeline | 3-5d | Critical |
1435
+ | 11 | Multi-Session Orchestration | 10-15d | Low |
1436
+ | 12 | Memory Query at Dispatch Time | 3-5d | Medium |
1437
+ | 13 | Provider Config Resolution | 2-3d | Medium |
1438
+
1439
+ **Total estimated integration effort:** 35-60 developer-days
1440
+
1441
+ **Critical path:** Gap 10 (Trigger-to-Dispatch) is the minimal viable
1442
+ integration -- without it, no trigger rule can automatically launch an agent.
1443
+ Gaps 1, 7, and 8 are next priority as they complete the basic dispatch-to-completion
1444
+ lifecycle.