@a5c-ai/kradle 5.0.1-staging.3abdf9534c25
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/Dockerfile +31 -0
- package/README.md +187 -0
- package/bin/kradle-demo.mjs +23 -0
- package/bin/kradle-server.mjs +14 -0
- package/dist/kradle-controller-ui.json +3482 -0
- package/dist/kradle-lifecycle.json +201 -0
- package/dist/kradle-runtime-snapshot.json +3125 -0
- package/dist/kradle-summary.json +724 -0
- package/docs/README.md +61 -0
- package/docs/agents/README.md +83 -0
- package/docs/agents/acceptance-test-matrix.md +193 -0
- package/docs/agents/agent-mux-adapter-contract.md +167 -0
- package/docs/agents/agent-mux-source-map.md +310 -0
- package/docs/agents/agent-run-memory-import-spec.md +256 -0
- package/docs/agents/agent-stack-management-spec.md +421 -0
- package/docs/agents/api-contract-spec.md +309 -0
- package/docs/agents/artifacts-writeback-spec.md +145 -0
- package/docs/agents/chart-packaging-spec.md +128 -0
- package/docs/agents/ci-orchestration-spec.md +140 -0
- package/docs/agents/context-assembly-spec.md +219 -0
- package/docs/agents/controller-reconciliation-spec.md +255 -0
- package/docs/agents/crd-schema-spec.md +315 -0
- package/docs/agents/decision-log-open-questions.md +169 -0
- package/docs/agents/developer-implementation-checklist.md +329 -0
- package/docs/agents/dispatching-design.md +262 -0
- package/docs/agents/gaps-agent-mux-to-kradle-crds.md +298 -0
- package/docs/agents/glossary.md +66 -0
- package/docs/agents/implementation-blueprint.md +324 -0
- package/docs/agents/implementation-rollout-slices.md +251 -0
- package/docs/agents/memory-context-integration-spec.md +194 -0
- package/docs/agents/memory-ontology-schema-spec.md +253 -0
- package/docs/agents/memory-operations-runbook.md +121 -0
- package/docs/agents/mvp-vertical-slice-spec.md +146 -0
- package/docs/agents/observability-audit-spec.md +265 -0
- package/docs/agents/operator-runbook.md +174 -0
- package/docs/agents/org-memory-api-payload-examples.md +333 -0
- package/docs/agents/org-memory-controller-sequence-spec.md +181 -0
- package/docs/agents/org-memory-e2e-fixture-plan.md +161 -0
- package/docs/agents/org-memory-ui-implementation-map.md +114 -0
- package/docs/agents/org-memory-vertical-slice-spec.md +168 -0
- package/docs/agents/org-resource-model-delta-spec.md +111 -0
- package/docs/agents/org-route-resource-model-spec.md +183 -0
- package/docs/agents/org-scoping-namespace-spec.md +114 -0
- package/docs/agents/rbac-secrets-management-spec.md +406 -0
- package/docs/agents/repository-page-integration-spec.md +255 -0
- package/docs/agents/resource-contract-examples.md +808 -0
- package/docs/agents/resource-relationship-map.md +190 -0
- package/docs/agents/security-threat-model.md +188 -0
- package/docs/agents/shared-memory-company-brain-spec.md +358 -0
- package/docs/agents/storage-migration-spec.md +168 -0
- package/docs/agents/subagent-orchestration-spec.md +152 -0
- package/docs/agents/system-overview.md +88 -0
- package/docs/agents/tools-mcp-skills-spec.md +189 -0
- package/docs/agents/traceability-matrix.md +79 -0
- package/docs/agents/ui-flow-spec.md +211 -0
- package/docs/agents/ui-ux-system-spec.md +426 -0
- package/docs/agents/workspace-lifecycle-spec.md +166 -0
- package/docs/architecture-spec.md +78 -0
- package/docs/architecture-v2.md +2759 -0
- package/docs/components/control-plane.md +78 -0
- package/docs/components/data-plane.md +69 -0
- package/docs/components/hooks-events.md +67 -0
- package/docs/components/identity-rbac-policy.md +73 -0
- package/docs/components/kubevela-oam.md +70 -0
- package/docs/components/operations-publishing.md +81 -0
- package/docs/components/runners-ci.md +66 -0
- package/docs/components/web-ui.md +94 -0
- package/docs/crd-behaviors-and-relationships.md +3926 -0
- package/docs/external/README.md +47 -0
- package/docs/external/bidirectional-sync-design.md +134 -0
- package/docs/external/cicd-interface.md +64 -0
- package/docs/external/external-backend-controllers.md +170 -0
- package/docs/external/external-backend-crds.md +234 -0
- package/docs/external/external-backend-ui-spec.md +151 -0
- package/docs/external/external-backend-ux-flows.md +115 -0
- package/docs/external/external-object-mapping.md +125 -0
- package/docs/external/git-forge-interface.md +68 -0
- package/docs/external/github-integration-design.md +151 -0
- package/docs/external/issue-tracking-interface.md +66 -0
- package/docs/external/provider-capability-manifests.md +204 -0
- package/docs/external/provider-catalog.md +139 -0
- package/docs/external/provider-rollout-testing.md +78 -0
- package/docs/external/research-results.md +48 -0
- package/docs/external/security-auth-permissions.md +81 -0
- package/docs/external/sync-state-machines.md +108 -0
- package/docs/external/unified-external-backend-model.md +107 -0
- package/docs/external/user-facing-changes.md +67 -0
- package/docs/gaps.md +161 -0
- package/docs/install.md +94 -0
- package/docs/integration-and-design-decisions.md +1530 -0
- package/docs/kradle-design.md +334 -0
- package/docs/local-minikube.md +55 -0
- package/docs/ontology/README.md +32 -0
- package/docs/ontology/bounded-contexts.md +29 -0
- package/docs/ontology/events-and-hooks.md +32 -0
- package/docs/ontology/oam-kubevela.md +32 -0
- package/docs/ontology/operations-and-release.md +25 -0
- package/docs/ontology/personas-and-actors.md +32 -0
- package/docs/ontology/policies-and-invariants.md +33 -0
- package/docs/ontology/problem-space.md +30 -0
- package/docs/ontology/resource-contracts.md +40 -0
- package/docs/ontology/resource-taxonomy.md +42 -0
- package/docs/ontology/runners-and-ci.md +29 -0
- package/docs/ontology/solution-space.md +24 -0
- package/docs/ontology/storage-and-data-boundaries.md +29 -0
- package/docs/ontology/validation-matrix.md +24 -0
- package/docs/ontology/web-ui-excellent-flows.md +32 -0
- package/docs/ontology/workflows.md +39 -0
- package/docs/ontology/world.md +35 -0
- package/docs/openapi.yaml +1291 -0
- package/docs/product-requirements.md +62 -0
- package/docs/requirements-v2.md +235 -0
- package/docs/roadmap-mvp.md +87 -0
- package/docs/sdk-api-reference.md +1108 -0
- package/docs/system-requirements.md +90 -0
- package/docs/system-spec-v2.md +1230 -0
- package/docs/tests/README.md +53 -0
- package/docs/tests/agent-qa-plan.md +63 -0
- package/docs/tests/browser-ui-tests.md +62 -0
- package/docs/tests/ci-quality-gates.md +48 -0
- package/docs/tests/coverage-model.md +64 -0
- package/docs/tests/e2e-scenario-tests.md +53 -0
- package/docs/tests/fixtures-test-data.md +63 -0
- package/docs/tests/observability-reliability-tests.md +54 -0
- package/docs/tests/product-test-matrix.md +145 -0
- package/docs/tests/qa-adoption-roadmap.md +130 -0
- package/docs/tests/qa-automation-plan.md +101 -0
- package/docs/tests/security-compliance-tests.md +57 -0
- package/docs/tests/test-framework-tools.md +88 -0
- package/docs/tests/test-suite-layout.md +121 -0
- package/docs/tests/unit-integration-tests.md +48 -0
- package/docs/todo-kyverno +714 -0
- package/docs/todos.md +4 -0
- package/docs/user-stories.md +78 -0
- package/docs/web-console-spec.md +533 -0
- package/examples/minikube-demo.yaml +190 -0
- package/examples/oam-application.yaml +23 -0
- package/examples/policy-kyverno-pr-title.yaml +18 -0
- package/package.json +66 -0
- package/scripts/build.mjs +29 -0
- package/scripts/setup-minikube.mjs +65 -0
- package/scripts/smoke.mjs +37 -0
- package/scripts/validate-doc-coverage.mjs +152 -0
- package/scripts/validate-package.mjs +95 -0
- package/scripts/validate-ui.mjs +305 -0
- package/src/agent-adapter-controller.js +169 -0
- package/src/agent-approval-controller.js +170 -0
- package/src/agent-context-bundles.js +242 -0
- package/src/agent-dispatch-controller.js +549 -0
- package/src/agent-gateway-config-controller.js +147 -0
- package/src/agent-identity-migration.js +115 -0
- package/src/agent-memory-controller.js +357 -0
- package/src/agent-memory-import.js +327 -0
- package/src/agent-memory-query.js +292 -0
- package/src/agent-memory-repository-source-controller.js +255 -0
- package/src/agent-mux-client.js +589 -0
- package/src/agent-permission-review.js +250 -0
- package/src/agent-persona-controller.js +135 -0
- package/src/agent-project-controller.js +117 -0
- package/src/agent-prompt-composition.js +55 -0
- package/src/agent-provider-config-controller.js +151 -0
- package/src/agent-secret-config-grant-controller.js +282 -0
- package/src/agent-session-transcript-controller.js +189 -0
- package/src/agent-stack-controller.js +421 -0
- package/src/agent-subagent-controller.js +160 -0
- package/src/agent-transport-binding-controller.js +121 -0
- package/src/agent-trigger-controller.js +387 -0
- package/src/agent-workspace-controller.js +702 -0
- package/src/agent-writeback-controller.js +302 -0
- package/src/api-controller.js +621 -0
- package/src/argocd-gitops.js +43 -0
- package/src/artifact-registry-controller.js +542 -0
- package/src/assistant-runtime.js +284 -0
- package/src/async-controller.js +207 -0
- package/src/audit-controller.js +191 -0
- package/src/auth.js +310 -0
- package/src/component-catalog.js +41 -0
- package/src/control-plane.js +136 -0
- package/src/controller-client.js +112 -0
- package/src/controller-ui.js +620 -0
- package/src/data-plane.js +179 -0
- package/src/event-bus.js +397 -0
- package/src/external/conflict-controller.js +225 -0
- package/src/external/github/auth.js +96 -0
- package/src/external/github/cicd.js +180 -0
- package/src/external/github/git-forge.js +240 -0
- package/src/external/github/index.js +144 -0
- package/src/external/github/issue-tracking.js +163 -0
- package/src/external/provider-adapter.js +161 -0
- package/src/external/provider-resource-factory.js +221 -0
- package/src/external/sync-controller.js +235 -0
- package/src/external/webhook-controller.js +144 -0
- package/src/external/write-controller.js +283 -0
- package/src/gitea-backend.js +131 -0
- package/src/gitea-service.js +173 -0
- package/src/handoff.js +98 -0
- package/src/health-probes.js +134 -0
- package/src/hooks-events.js +63 -0
- package/src/hooks-lifecycle.js +117 -0
- package/src/http-server.js +409 -0
- package/src/identity-policy.js +86 -0
- package/src/index.js +71 -0
- package/src/jitsi-agent-bridge.js +141 -0
- package/src/jitsi-meeting-controller.js +291 -0
- package/src/jitsi-sync-controller.js +198 -0
- package/src/kradle-inference-service-controller.js +246 -0
- package/src/kubernetes-controller-async.js +531 -0
- package/src/kubernetes-controller.js +904 -0
- package/src/kubernetes-resource-gateway.js +48 -0
- package/src/model-route-controller.js +364 -0
- package/src/notification-controller.js +178 -0
- package/src/operations.js +112 -0
- package/src/org-scoping.js +5 -0
- package/src/resource-model.js +282 -0
- package/src/runner-controller.js +272 -0
- package/src/runners-ci.js +48 -0
- package/src/runtime.js +196 -0
- package/src/snapshot-cache.js +157 -0
- package/src/virtual-model-controller.js +538 -0
- package/src/virtual-model-hook-bridge.js +200 -0
- package/src/web-ui.js +40 -0
- package/tests/agent-adapter-controller.test.js +361 -0
- package/tests/agent-approval-controller.test.js +173 -0
- package/tests/agent-context-bundles.test.js +278 -0
- package/tests/agent-dispatch-controller.test.js +679 -0
- package/tests/agent-gateway-config-controller.test.js +386 -0
- package/tests/agent-identity-migration.test.js +87 -0
- package/tests/agent-memory-controller.test.js +461 -0
- package/tests/agent-memory-import-snapshot.test.js +477 -0
- package/tests/agent-memory-query.test.js +404 -0
- package/tests/agent-memory-repository-source.test.js +514 -0
- package/tests/agent-mux-client.test.js +389 -0
- package/tests/agent-mux-integration.test.js +971 -0
- package/tests/agent-permission-review-v2.test.js +317 -0
- package/tests/agent-permission-review.test.js +209 -0
- package/tests/agent-persona-controller.test.js +127 -0
- package/tests/agent-project-controller.test.js +302 -0
- package/tests/agent-prompt-composition.test.js +76 -0
- package/tests/agent-provider-config-controller.test.js +376 -0
- package/tests/agent-resources.test.js +303 -0
- package/tests/agent-secret-config-grant.test.js +231 -0
- package/tests/agent-session-transcript-controller.test.js +499 -0
- package/tests/agent-stack-controller.test.js +283 -0
- package/tests/agent-subagent-controller.test.js +201 -0
- package/tests/agent-transport-binding-controller.test.js +294 -0
- package/tests/agent-trigger-controller.test.js +271 -0
- package/tests/agent-trigger-routes.test.js +190 -0
- package/tests/agent-trigger-sources.test.js +245 -0
- package/tests/agent-workspace-controller.test.js +181 -0
- package/tests/agent-writeback.test.js +292 -0
- package/tests/approval-persistence.test.js +171 -0
- package/tests/artifact-registry.test.js +511 -0
- package/tests/assistant-runtime.test.js +506 -0
- package/tests/async-controller.test.js +252 -0
- package/tests/audit-controller.test.js +227 -0
- package/tests/codespace-controller.test.js +318 -0
- package/tests/controller-client.test.js +133 -0
- package/tests/deployment.test.js +527 -0
- package/tests/e2e/lifecycle.test.js +120 -0
- package/tests/event-bus-integration.test.js +355 -0
- package/tests/external-github-forge.test.js +560 -0
- package/tests/external-github-issues-cicd.test.js +520 -0
- package/tests/external-integration.test.js +470 -0
- package/tests/external-persistence.test.js +415 -0
- package/tests/external-provider-adapter.test.js +365 -0
- package/tests/external-resource-model.test.js +223 -0
- package/tests/external-webhook-sync.test.js +287 -0
- package/tests/external-write-conflict.test.js +353 -0
- package/tests/gitea-service.test.js +253 -0
- package/tests/health-check-real.test.js +165 -0
- package/tests/health-probes.test.js +90 -0
- package/tests/hooks-lifecycle.test.js +364 -0
- package/tests/integration/full-flow.test.js +266 -0
- package/tests/jitsi-agent-bridge.test.js +119 -0
- package/tests/jitsi-helm-integration.test.js +77 -0
- package/tests/jitsi-meeting-controller.test.js +170 -0
- package/tests/jitsi-resource-model.test.js +73 -0
- package/tests/jitsi-sync-controller.test.js +112 -0
- package/tests/kradle-inference-service.test.js +689 -0
- package/tests/kradle.test.js +779 -0
- package/tests/memory-search-wiring.test.js +270 -0
- package/tests/model-route-controller.test.js +733 -0
- package/tests/notification-controller.test.js +196 -0
- package/tests/notification-integration.test.js +179 -0
- package/tests/org-scoping.test.js +687 -0
- package/tests/runner-controller.test.js +327 -0
- package/tests/runner-integration.test.js +231 -0
- package/tests/session-cookie-hmac.test.js +151 -0
- package/tests/snapshot-performance.test.js +315 -0
- package/tests/sse-events.test.js +107 -0
- package/tests/virtual-model-controller.test.js +877 -0
- package/tests/virtual-model-hook-bridge.test.js +384 -0
- package/tests/webhook-trigger.test.js +198 -0
- package/tests/workspace-volumes.test.js +312 -0
- package/tests/writeback-persistence.test.js +207 -0
|
@@ -0,0 +1,1530 @@
|
|
|
1
|
+
# Integration & Design Decisions
|
|
2
|
+
|
|
3
|
+
Supplementary specification covering external dependencies, scope boundaries,
|
|
4
|
+
architectural trade-offs, and system nuances for the Kradle project.
|
|
5
|
+
|
|
6
|
+
---
|
|
7
|
+
|
|
8
|
+
## 1. External Dependencies & Integration Points
|
|
9
|
+
|
|
10
|
+
### 1.1 Agent-Mux Dependency
|
|
11
|
+
|
|
12
|
+
#### What Kradle Imports from @a5c-ai/agent-mux
|
|
13
|
+
|
|
14
|
+
Kradle does not import agent-mux as an npm dependency. Instead, it communicates
|
|
15
|
+
with the agent-mux gateway over HTTP. The integration surface lives entirely
|
|
16
|
+
within `core/src/agent-mux-client.js`, which provides:
|
|
17
|
+
|
|
18
|
+
- **queryCapabilities(adapter)** -- GET `/api/v1/agents/{adapter}/capabilities`
|
|
19
|
+
- **launchSession({stack, contextBundle, permissionSnapshot, workspace})** -- POST `/api/v1/sessions`
|
|
20
|
+
- **getSessionStatus(sessionId)** -- GET `/api/v1/sessions/{sessionId}`
|
|
21
|
+
- **subscribeToEvents(runId, callback)** -- GET `/api/v1/runs/{runId}/events` (SSE)
|
|
22
|
+
- **reconcileTranscript(sessionId, events, options)** -- local data transformation
|
|
23
|
+
|
|
24
|
+
The boundary declaration (`AGENT_MUX_CLIENT_BOUNDARY`) explicitly states:
|
|
25
|
+
- Owns: gateway HTTP calls, SSE event streaming, transcript reconciliation
|
|
26
|
+
- Delegates to: resource-model (for creating AgentSessionTranscript resources)
|
|
27
|
+
- Must not own: secret values, permission review, resource persistence
|
|
28
|
+
|
|
29
|
+
#### How agent-mux-client.js Connects to the Gateway
|
|
30
|
+
|
|
31
|
+
Connection is HTTP-only, using Node.js built-in `node:http` / `node:https`.
|
|
32
|
+
Zero external fetch or HTTP client dependencies. The internal `httpRequest()`
|
|
33
|
+
helper performs raw `transport.request()` calls with JSON serialization.
|
|
34
|
+
|
|
35
|
+
Connection parameters:
|
|
36
|
+
- `gateway` string -- full base URL (e.g. `http://agent-mux-gateway:8080`)
|
|
37
|
+
- `enabled` boolean -- client methods return `null` early when disabled
|
|
38
|
+
- Timeout: 30s default per request
|
|
39
|
+
- Protocol: auto-detected from URL scheme (http:// vs https://)
|
|
40
|
+
|
|
41
|
+
SSE streaming uses a persistent HTTP connection with:
|
|
42
|
+
- Reconnection via exponential backoff (1s, 2s, 4s... capped at 30s)
|
|
43
|
+
- Backoff reset on successful connection establishment
|
|
44
|
+
- Graceful abort via returned `{ abort }` handle
|
|
45
|
+
- Buffer-based SSE parsing (splits on `\n\n`, extracts `data:` lines)
|
|
46
|
+
|
|
47
|
+
#### What Works WITHOUT Agent-Mux
|
|
48
|
+
|
|
49
|
+
The following subsystems are fully operational without agent-mux:
|
|
50
|
+
|
|
51
|
+
1. **Resource CRUD** -- All 76 resource kinds can be created, listed, updated, deleted via kubectl
|
|
52
|
+
2. **Web Console** -- All 57 pages render; agent dispatch pages show "gateway unavailable" state
|
|
53
|
+
3. **Auth & Sessions** -- OAuth login, cookie sessions, delegated identity all functional
|
|
54
|
+
4. **Project Management** -- KradleProject, Issue, PullRequest lifecycle fully local
|
|
55
|
+
5. **Memory System** -- AgentMemoryRepository, Source, Ontology, Query, Import all operate on CRDs
|
|
56
|
+
6. **Workspace Provisioning** -- KradleWorkspace PVC specs, git ops, codespace generation
|
|
57
|
+
7. **External Backends** -- Provider registration, webhook delivery, sync events
|
|
58
|
+
8. **Audit Logging** -- Audit controller records all mutations regardless of agent-mux state
|
|
59
|
+
9. **Policy Engine** -- Kyverno integration, PolicyProfile, PolicyBinding
|
|
60
|
+
10. **Runner Pool Management** -- RunnerPool specs, scheduling policies
|
|
61
|
+
11. **Notification System** -- Resource-change notifications via event bus
|
|
62
|
+
12. **MCP Server** -- All 14 tools operational; `kradle_dispatch_agent` returns error status
|
|
63
|
+
|
|
64
|
+
#### What REQUIRES Agent-Mux
|
|
65
|
+
|
|
66
|
+
The following operations fail gracefully (return null/error) without a running gateway:
|
|
67
|
+
|
|
68
|
+
1. **Real Session Creation** -- `launchSession()` returns null; AgentSession resource created but status stays `Pending`
|
|
69
|
+
2. **Event Streaming** -- `subscribeToEvents()` reconnect loop fires indefinitely (capped at 30s intervals)
|
|
70
|
+
3. **Transcript Reconciliation** -- No events to reconcile; AgentSessionTranscript never moves to `Reconciled` phase
|
|
71
|
+
4. **Adapter Capability Discovery** -- `queryCapabilities()` returns null; stack readiness condition degraded
|
|
72
|
+
5. **Live Session Status** -- `getSessionStatus()` returns null; UI shows "status unavailable"
|
|
73
|
+
6. **Cost Calculation** -- Token usage comes from SSE events; without them, cost fields remain zero
|
|
74
|
+
|
|
75
|
+
#### Gateway URL Configuration
|
|
76
|
+
|
|
77
|
+
The gateway URL flows through the system as follows:
|
|
78
|
+
|
|
79
|
+
```
|
|
80
|
+
Helm values.yaml:
|
|
81
|
+
agents:
|
|
82
|
+
agentMux:
|
|
83
|
+
enabled: false # Must be true for integration
|
|
84
|
+
gateway: "" # Full URL to agent-mux gateway
|
|
85
|
+
|
|
86
|
+
--> Rendered into Deployment env:
|
|
87
|
+
KRADLE_AGENT_MUX_ENABLED=true
|
|
88
|
+
KRADLE_AGENT_MUX_GATEWAY=http://agent-mux-gateway:8080
|
|
89
|
+
|
|
90
|
+
--> Read by API controller initialization:
|
|
91
|
+
createAgentMuxClient({
|
|
92
|
+
gateway: process.env.KRADLE_AGENT_MUX_GATEWAY || '',
|
|
93
|
+
enabled: process.env.KRADLE_AGENT_MUX_ENABLED === 'true'
|
|
94
|
+
})
|
|
95
|
+
|
|
96
|
+
--> Persisted in AgentGatewayConfig CRD:
|
|
97
|
+
spec.gatewayUrl (for UI display and runtime reconfiguration)
|
|
98
|
+
```
|
|
99
|
+
|
|
100
|
+
The web container does NOT communicate with agent-mux directly. All agent
|
|
101
|
+
operations route through the API container's `/api/agents/*` endpoints, which
|
|
102
|
+
delegate to the mux client instance.
|
|
103
|
+
|
|
104
|
+
#### Runtime Availability Check
|
|
105
|
+
|
|
106
|
+
```javascript
|
|
107
|
+
client.isAvailable() // returns: enabled && !!gateway
|
|
108
|
+
```
|
|
109
|
+
|
|
110
|
+
Every client method checks `isAvailable()` first and returns `null` immediately
|
|
111
|
+
when the gateway is not configured. This prevents connection errors from
|
|
112
|
+
propagating into the resource reconciliation loops.
|
|
113
|
+
|
|
114
|
+
---
|
|
115
|
+
|
|
116
|
+
### 1.2 Transport-Mux Dependency
|
|
117
|
+
|
|
118
|
+
#### How Transport-Mux Handles Protocol Translation
|
|
119
|
+
|
|
120
|
+
Transport-mux is an external component (not bundled with kradle) that provides
|
|
121
|
+
protocol translation between different agent communication channels:
|
|
122
|
+
|
|
123
|
+
- **stdio** -- stdin/stdout JSON-RPC for local CLI agents
|
|
124
|
+
- **http** -- REST/SSE for cloud-hosted agents
|
|
125
|
+
- **websocket** -- bidirectional streaming for persistent connections
|
|
126
|
+
- **unix** -- Unix domain socket for same-host agents
|
|
127
|
+
|
|
128
|
+
The transport-mux runtime sits between the agent-mux gateway and the actual
|
|
129
|
+
agent process, handling message framing, connection lifecycle, and protocol
|
|
130
|
+
negotiation.
|
|
131
|
+
|
|
132
|
+
#### What Kradle's AgentTransportBinding Models
|
|
133
|
+
|
|
134
|
+
The `AgentTransportBinding` CRD captures connection configuration:
|
|
135
|
+
|
|
136
|
+
```yaml
|
|
137
|
+
spec:
|
|
138
|
+
adapterRef: "claude-adapter-v1"
|
|
139
|
+
endpoint: "wss://agent.example.com/ws"
|
|
140
|
+
protocol: "websocket" # One of: stdio, http, websocket, unix
|
|
141
|
+
reconnectPolicy:
|
|
142
|
+
maxRetries: 3
|
|
143
|
+
backoffMs: 1000
|
|
144
|
+
maxBackoffMs: 30000
|
|
145
|
+
auth:
|
|
146
|
+
type: "bearer"
|
|
147
|
+
secretRef: "agent-token-secret"
|
|
148
|
+
healthCheck:
|
|
149
|
+
endpoint: "/health"
|
|
150
|
+
intervalMs: 30000
|
|
151
|
+
```
|
|
152
|
+
|
|
153
|
+
The controller (`agent-transport-binding-controller.js`) provides:
|
|
154
|
+
- **Validation** -- Ensures required fields present, protocol is one of 4 valid types
|
|
155
|
+
- **Connection Status Tracking** -- Reads `status.connectionStatus` (set externally)
|
|
156
|
+
- **Reconnect Policy Resolution** -- Merges spec overrides with defaults (3 retries, 1s-30s backoff)
|
|
157
|
+
|
|
158
|
+
#### Gap: Kradle Validates Transport but Never Activates the Runtime
|
|
159
|
+
|
|
160
|
+
The transport binding controller is purely declarative. It:
|
|
161
|
+
- Validates the spec shape
|
|
162
|
+
- Reports connection status from the resource's status field
|
|
163
|
+
- Returns supported protocols list
|
|
164
|
+
|
|
165
|
+
It does NOT:
|
|
166
|
+
- Open actual TCP/WebSocket connections
|
|
167
|
+
- Start stdio processes
|
|
168
|
+
- Perform health checks (despite modeling healthCheck in spec)
|
|
169
|
+
- Activate the transport-mux runtime component
|
|
170
|
+
- Signal transport-mux to register a new binding
|
|
171
|
+
|
|
172
|
+
The intent is that transport-mux watches AgentTransportBinding resources via
|
|
173
|
+
Kubernetes watch and self-reconciles. Kradle's role is to persist the desired
|
|
174
|
+
state and present it in the UI.
|
|
175
|
+
|
|
176
|
+
---
|
|
177
|
+
|
|
178
|
+
### 1.3 Hooks-Mux Dependency
|
|
179
|
+
|
|
180
|
+
#### What Hooks-Mux Provides
|
|
181
|
+
|
|
182
|
+
The hooks-mux system (external to kradle) provides lifecycle event dispatching
|
|
183
|
+
for agent runs:
|
|
184
|
+
|
|
185
|
+
- `RUN_CREATED` -- Agent dispatch initiated
|
|
186
|
+
- `STEP_STARTED` -- Agent begins a tool use or reasoning step
|
|
187
|
+
- `STEP_COMPLETED` -- Agent finishes a step
|
|
188
|
+
- `APPROVAL_REQUESTED` -- Agent needs human gate
|
|
189
|
+
- `APPROVAL_GRANTED` / `APPROVAL_DENIED` -- Gate resolved
|
|
190
|
+
- `RUN_COMPLETED` -- Agent run finished (success/failure/timeout)
|
|
191
|
+
- `RUN_CANCELLED` -- Agent run externally terminated
|
|
192
|
+
|
|
193
|
+
These events flow to registered webhook subscribers, trigger rules, and
|
|
194
|
+
the notification system.
|
|
195
|
+
|
|
196
|
+
#### What Kradle's Event Bus Does Instead
|
|
197
|
+
|
|
198
|
+
Kradle has its own in-process event bus (`core/src/event-bus.js`) that provides:
|
|
199
|
+
|
|
200
|
+
```javascript
|
|
201
|
+
const bus = createEventBus();
|
|
202
|
+
bus.subscribe(fn);
|
|
203
|
+
bus.emit({ type: 'resource-change', kind, name, operation, timestamp });
|
|
204
|
+
bus.emitResourceChange('Repository', 'my-repo', 'apply');
|
|
205
|
+
```
|
|
206
|
+
|
|
207
|
+
This bus handles:
|
|
208
|
+
- Real-time resource change notifications to SSE clients (web console live updates)
|
|
209
|
+
- Cache invalidation signals
|
|
210
|
+
- UI refresh triggers
|
|
211
|
+
|
|
212
|
+
The event bus is limited to **resource-change events only**. It does not model:
|
|
213
|
+
- Agent lifecycle events (run created/completed)
|
|
214
|
+
- Step-level granularity (tool use, reasoning)
|
|
215
|
+
- Cross-service event routing
|
|
216
|
+
- Durable event delivery with retries
|
|
217
|
+
|
|
218
|
+
#### Gap: No HookDispatcher Integration
|
|
219
|
+
|
|
220
|
+
Kradle has a `WebhookBus` class (`core/src/hooks-events.js`) that handles
|
|
221
|
+
outbound webhook delivery for resource events, but it is NOT connected to
|
|
222
|
+
hooks-mux lifecycle events. Specifically:
|
|
223
|
+
|
|
224
|
+
- `WebhookBus.deliver()` creates `WebhookDelivery` resources for webhook subscribers
|
|
225
|
+
- It does NOT receive or forward agent-lifecycle events from hooks-mux
|
|
226
|
+
- `AgentTriggerRule` resources define event-to-stack routing but the trigger
|
|
227
|
+
evaluation is purely resource-driven (no real-time hook stream)
|
|
228
|
+
- `AgentTriggerExecution` records are created by the trigger controller but
|
|
229
|
+
the trigger is evaluated on resource watch events, not on hooks-mux push
|
|
230
|
+
|
|
231
|
+
The missing integration:
|
|
232
|
+
1. No hooks-mux webhook receiver endpoint in kradle's HTTP server
|
|
233
|
+
2. No translation layer from hooks-mux event format to kradle event bus
|
|
234
|
+
3. No lifecycle event emission when AgentDispatchRun status changes
|
|
235
|
+
4. No step-level event tracking (only run-level status)
|
|
236
|
+
|
|
237
|
+
---
|
|
238
|
+
|
|
239
|
+
### 1.4 Babysitter-SDK Dependency
|
|
240
|
+
|
|
241
|
+
#### The .a5c/processes Pattern
|
|
242
|
+
|
|
243
|
+
Babysitter-SDK uses a file-based process definition pattern:
|
|
244
|
+
|
|
245
|
+
```
|
|
246
|
+
.a5c/
|
|
247
|
+
processes/
|
|
248
|
+
my-process.json # Process definition
|
|
249
|
+
runs/
|
|
250
|
+
<runId>/
|
|
251
|
+
journal.json # Event log
|
|
252
|
+
state.json # Current state
|
|
253
|
+
effects/ # Effect outputs
|
|
254
|
+
```
|
|
255
|
+
|
|
256
|
+
Kradle processes (if any) would follow this same pattern. However, kradle does
|
|
257
|
+
not currently define babysitter processes for its own operations. The
|
|
258
|
+
integration point is on the **import** side -- kradle can ingest babysitter
|
|
259
|
+
run artifacts into its memory system.
|
|
260
|
+
|
|
261
|
+
#### How AgentRunMemoryImport Connects to Babysitter Journal Parsing
|
|
262
|
+
|
|
263
|
+
The `agent-memory-import.js` module provides:
|
|
264
|
+
|
|
265
|
+
```javascript
|
|
266
|
+
parseJournalForImport(journal)
|
|
267
|
+
// Returns: { summary, keyEvents, effectSummary }
|
|
268
|
+
```
|
|
269
|
+
|
|
270
|
+
This function:
|
|
271
|
+
1. Accepts a babysitter `.a5c` journal array (raw event objects)
|
|
272
|
+
2. Extracts structural metadata only (no raw task content, no effect payloads)
|
|
273
|
+
3. Produces a summary with: runId, processId, eventCount, durationMs, runStatus
|
|
274
|
+
4. Extracts key events: run_start, task_completed (with effect kind/result), breakpoint, run_end
|
|
275
|
+
5. Computes effect summary: successCount, failureCount, effectKinds array
|
|
276
|
+
|
|
277
|
+
The `AgentRunMemoryImport` CRD then stores this parsed summary alongside:
|
|
278
|
+
- `memoryRepository` reference (where to store)
|
|
279
|
+
- `source` information (which babysitter run)
|
|
280
|
+
- `include` filters (what to import)
|
|
281
|
+
- Review and redaction status
|
|
282
|
+
|
|
283
|
+
#### The Orchestration Boundary
|
|
284
|
+
|
|
285
|
+
```
|
|
286
|
+
Babysitter-SDK Kradle
|
|
287
|
+
-------------------------------------------
|
|
288
|
+
Process definition <--> (not used)
|
|
289
|
+
Run lifecycle <--> AgentDispatchRun (mirrors status)
|
|
290
|
+
Journal events ---> AgentRunMemoryImport (parsed summary)
|
|
291
|
+
Session binding <--> AgentSession (Kradle projection)
|
|
292
|
+
Hook-driven continuation <--> AgentApproval (approval gates)
|
|
293
|
+
```
|
|
294
|
+
|
|
295
|
+
- **Babysitter owns**: run lifecycle (create, iterate, complete), journal/state, session binding
|
|
296
|
+
- **Kradle owns**: resource state, memory ingestion, audit trail, approval workflows
|
|
297
|
+
- The boundary is clean: babysitter does execution, kradle does desired-state persistence
|
|
298
|
+
|
|
299
|
+
---
|
|
300
|
+
|
|
301
|
+
### 1.5 Atlas Dependency
|
|
302
|
+
|
|
303
|
+
#### How the Stack Builder Queries Atlas Graph
|
|
304
|
+
|
|
305
|
+
The web console's stack builder uses Atlas as a knowledge source for populating
|
|
306
|
+
layer options. The query path is:
|
|
307
|
+
|
|
308
|
+
```
|
|
309
|
+
Browser (stack-builder-graph.jsx)
|
|
310
|
+
--> /api/atlas/search (Next.js route handler)
|
|
311
|
+
--> fetch(ATLAS_BASE_URL + /api/v1/search?q=...&kind=...&limit=...)
|
|
312
|
+
--> fetch(ATLAS_BASE_URL + /api/v1/kinds/{kind}?limit=...)
|
|
313
|
+
```
|
|
314
|
+
|
|
315
|
+
The proxy route (`web/app/api/atlas/search/route.js`) handles:
|
|
316
|
+
- **Browse mode**: fetches instances by kind (no search query needed)
|
|
317
|
+
- **Search mode**: full-text search via Atlas's Fuse.js-based endpoint
|
|
318
|
+
- **Multi-kind search**: parallel queries per kind, merged and deduplicated by id
|
|
319
|
+
|
|
320
|
+
The SDK also provides a direct client (`sdk/src/atlas-graph-client.js`):
|
|
321
|
+
- `fetchAtlasRecordsByKinds(atlasBaseUrl, kinds, options)` -- browse by NodeKind
|
|
322
|
+
- `searchAtlasGraph(atlasBaseUrl, query, options)` -- full-text with optional kind filter
|
|
323
|
+
|
|
324
|
+
#### The 11 Stack Layers + 4 Composition Facets
|
|
325
|
+
|
|
326
|
+
**STACK_LAYERS** (11 layers, each with associated Atlas NodeKinds):
|
|
327
|
+
|
|
328
|
+
| # | Layer | Atlas NodeKinds |
|
|
329
|
+
|---|-------|-----------------|
|
|
330
|
+
| 1 | Model | ModelFamily, ModelVersion, SessionModel |
|
|
331
|
+
| 2 | Provider | Provider, ModelProviderProduct, ModelProviderVersion |
|
|
332
|
+
| 3 | Transport | TransportProtocol, ModelTransportProtocol |
|
|
333
|
+
| 4 | Agent Core | AgentCoreImpl, Capability, CapabilitySupport |
|
|
334
|
+
| 5 | Agent Runtime | AgentProduct, AgentRuntimeImpl, AgentVersion, Subagent |
|
|
335
|
+
| 6 | Agent Platform | AgentPlatformImpl, Platform, PlatformService |
|
|
336
|
+
| 7 | Workspace | Workspace, Project, SharedContextSpec |
|
|
337
|
+
| 8 | Execution | Workflow, LibraryProcess, Phase, HookSurface |
|
|
338
|
+
| 9 | Sandbox | PermissionMode, DeploymentTarget |
|
|
339
|
+
| 10 | Interaction | Tool, ToolDescriptor, ToolServer, PluginArtifact, MCPPrompt |
|
|
340
|
+
| 11 | Presentation | AgentUIImpl, Page, APIEndpoint, Presentation |
|
|
341
|
+
|
|
342
|
+
**COMPOSITION_FACETS** (4 cross-cutting concerns):
|
|
343
|
+
|
|
344
|
+
| Facet | Atlas NodeKinds |
|
|
345
|
+
|-------|-----------------|
|
|
346
|
+
| Roles and Teams | Role, Responsibility, OrgUnit, AgentTeam |
|
|
347
|
+
| Skills and Capabilities | Skill, LibrarySkill, SkillArea, Capability |
|
|
348
|
+
| Evaluation and Governance | Benchmark, TestSet, EvalRun |
|
|
349
|
+
| Environment and Data | StackPart, VectorStore, MemoryStore |
|
|
350
|
+
|
|
351
|
+
#### Atlas Node Kinds Mapped to Stack Builder Layers
|
|
352
|
+
|
|
353
|
+
Each stack builder layer queries Atlas for specific NodeKinds. The mapping
|
|
354
|
+
(`atlasKinds` array per layer) drives which records appear as selectable
|
|
355
|
+
options in the UI. Users compose an AgentStack by picking one or more
|
|
356
|
+
records from each layer.
|
|
357
|
+
|
|
358
|
+
The resolution path:
|
|
359
|
+
1. User opens stack builder page
|
|
360
|
+
2. UI requests `/api/atlas/search?mode=browse&kinds=ModelFamily,ModelVersion`
|
|
361
|
+
3. Route handler fetches from Atlas API
|
|
362
|
+
4. Results rendered as selectable cards/chips in the layer panel
|
|
363
|
+
5. User selections flow into AgentStack spec fields
|
|
364
|
+
|
|
365
|
+
#### What Happens When Atlas Is Unavailable
|
|
366
|
+
|
|
367
|
+
When Atlas is unreachable:
|
|
368
|
+
- Browse queries return empty arrays `[]` (no crash)
|
|
369
|
+
- Search queries return `{ total: 0, hits: [] }`
|
|
370
|
+
- The proxy route returns `{ total: 0, hits: [], error: <message> }` with status 502
|
|
371
|
+
- Stack builder layers show "No options available" state
|
|
372
|
+
- Users can still manually type stack configuration without Atlas suggestions
|
|
373
|
+
- The `ATLAS_BASE_URL` env var defaults to `https://atlas-staging.a5c.ai`
|
|
374
|
+
|
|
375
|
+
---
|
|
376
|
+
|
|
377
|
+
## 2. Scope Boundaries
|
|
378
|
+
|
|
379
|
+
### 2.1 What Kradle Owns
|
|
380
|
+
|
|
381
|
+
#### Kubernetes CRD Resource Model (76 Kinds)
|
|
382
|
+
|
|
383
|
+
Kradle defines and manages 76 resource kinds across two storage tiers:
|
|
384
|
+
|
|
385
|
+
**Config kinds (44, etcd-stored):**
|
|
386
|
+
Organization, OrgNamespaceBinding, User, Team, Invite, IdentityMapping,
|
|
387
|
+
AuthProvider, Repository, SSHKey, RepositoryPermission, WebhookSubscription,
|
|
388
|
+
RefPolicy, BranchProtection, PolicyProfile, PolicyTemplate, PolicyBinding,
|
|
389
|
+
PolicyExceptionRequest, RunnerPool, View, Selector, AgentStack, AgentSubagent,
|
|
390
|
+
AgentToolProfile, AgentMcpServer, AgentSkill, AgentTriggerRule, AgentContextLabel,
|
|
391
|
+
KradleWorkspacePolicy, AgentServiceAccount, AgentRoleBinding, AgentSecretGrant,
|
|
392
|
+
AgentConfigGrant, AgentAdapter, AgentTransportBinding, AgentProviderConfig,
|
|
393
|
+
KradleProject, AgentGatewayConfig, AgentMemoryRepository, AgentMemorySource,
|
|
394
|
+
AgentMemoryOntology, AgentMemoryAssociation, KradleWorkspace,
|
|
395
|
+
ExternalBackendProvider, ExternalBackendBinding, ExternalBackendSyncPolicy,
|
|
396
|
+
ExternalProviderCapabilityManifest
|
|
397
|
+
|
|
398
|
+
**Aggregated kinds (32, postgres-stored):**
|
|
399
|
+
PullRequest, Issue, Review, Pipeline, Job, WebhookDelivery, AgentDispatchRun,
|
|
400
|
+
AgentDispatchAttempt, AgentSession, AgentContextBundle, KradleArtifact,
|
|
401
|
+
AgentApproval, AgentTriggerExecution, AgentCapabilityRequirement,
|
|
402
|
+
WorkItemSessionLink, WorkItemWorkspaceLink, AgentSessionTranscript,
|
|
403
|
+
AgentSessionAttachment, KradleWorkspaceRuntime, AgentMemorySnapshot,
|
|
404
|
+
AgentMemoryQuery, AgentMemoryUpdate, AgentRunMemoryImport,
|
|
405
|
+
ExternalWebhookDelivery, ExternalSyncEvent, ExternalSyncState,
|
|
406
|
+
ExternalWriteIntent, ExternalSyncConflict, ExternalObjectLink
|
|
407
|
+
|
|
408
|
+
#### Resource CRUD via kubectl
|
|
409
|
+
|
|
410
|
+
All resource operations use `spawnSync('kubectl', ...)` or async `spawn('kubectl', ...)`:
|
|
411
|
+
- `kubectl get <resource> -n <namespace> -o json`
|
|
412
|
+
- `kubectl apply -f - -o json` (with JSON manifest piped to stdin)
|
|
413
|
+
- `kubectl delete <resource> <name> -n <namespace>`
|
|
414
|
+
- `kubectl get <resource> --watch -o json` (for live updates)
|
|
415
|
+
|
|
416
|
+
#### Web Console (57+ Pages)
|
|
417
|
+
|
|
418
|
+
Seven page modules:
|
|
419
|
+
1. **agent** -- Stacks, dispatch, sessions, triggers, memory, projects, adapters, workspaces
|
|
420
|
+
2. **repo** -- Repositories, pull requests, issues, reviews, pipelines
|
|
421
|
+
3. **manage** -- Organizations, users, teams, invites, identity mappings
|
|
422
|
+
4. **settings** -- Auth providers, webhooks, runners, policies, views
|
|
423
|
+
5. **external** -- Backend providers, bindings, sync policies, conflicts
|
|
424
|
+
6. **lib/kradle-ui** -- Shared UI components (tables, forms, modals, badges)
|
|
425
|
+
7. **lib/page-frame** -- Layout shell, navigation, breadcrumbs
|
|
426
|
+
|
|
427
|
+
#### Auth (OAuth, Session Cookies, Middleware)
|
|
428
|
+
|
|
429
|
+
- GitHub OAuth (authorization code flow)
|
|
430
|
+
- Workspace SSO (OIDC authorization code flow)
|
|
431
|
+
- Delegated identity (proxy headers: x-forwarded-user/groups/email)
|
|
432
|
+
- Local development bypass (auto-login for localhost)
|
|
433
|
+
- HMAC-SHA256 signed session cookies
|
|
434
|
+
- `parseSessionCookie` / `createSessionCookie` with timing-safe verification
|
|
435
|
+
|
|
436
|
+
#### Workspace Provisioning
|
|
437
|
+
|
|
438
|
+
- `KradleWorkspace` CRD: PVC specs, volume lifecycle, repository binding
|
|
439
|
+
- `KradleWorkspacePolicy` CRD: trust tiers, cleanup retention, provisioning mode
|
|
440
|
+
- `KradleWorkspaceRuntime` CRD: process status, environment, preview URLs
|
|
441
|
+
- Git worktree integration specs (branch/commit binding)
|
|
442
|
+
- Runner mount specifications
|
|
443
|
+
|
|
444
|
+
#### Memory System
|
|
445
|
+
|
|
446
|
+
- `AgentMemoryRepository`: org-level git pointer with layout profile and index policy
|
|
447
|
+
- `AgentMemorySource`: read policy for paths/kinds per repository, team, stack, or trigger
|
|
448
|
+
- `AgentMemoryOntology`: ontology with nodeKinds, edgeKinds, controlled vocabulary
|
|
449
|
+
- `AgentMemoryQuery`: graph/grep retrieval with ranking metadata
|
|
450
|
+
- `AgentRunMemoryImport`: babysitter run ingestion with redaction and review
|
|
451
|
+
- `AgentMemorySnapshot`: dispatch-time pin with resolved commit and digests
|
|
452
|
+
- `AgentMemoryUpdate`: reviewable proposed mutations with branch and validation
|
|
453
|
+
- `AgentMemoryAssociation`: bridge records linking memory to Kradle resources
|
|
454
|
+
|
|
455
|
+
#### External Backend Abstraction
|
|
456
|
+
|
|
457
|
+
- `ExternalBackendProvider`: registration (type, endpoint, auth, capability discovery)
|
|
458
|
+
- `ExternalBackendBinding`: binding to org with credential reference and sync scope
|
|
459
|
+
- `ExternalBackendSyncPolicy`: interval, conflict resolution, field mapping, retry policy
|
|
460
|
+
- `ExternalProviderCapabilityManifest`: discovered API capabilities
|
|
461
|
+
- `ExternalWebhookDelivery`: inbound webhook processing
|
|
462
|
+
- `ExternalSyncEvent`: discrete sync event with dedupe/ordering
|
|
463
|
+
- `ExternalSyncState`: current sync phase per resource
|
|
464
|
+
- `ExternalWriteIntent`: queued write-back with approval state
|
|
465
|
+
- `ExternalSyncConflict`: detected conflicts with resolution outcomes
|
|
466
|
+
- `ExternalObjectLink`: stable local-to-external ID mapping
|
|
467
|
+
|
|
468
|
+
#### Notification System
|
|
469
|
+
|
|
470
|
+
- `globalEventBus` singleton for in-process pub/sub
|
|
471
|
+
- SSE endpoint for real-time web console updates
|
|
472
|
+
- `emitResourceChange(kind, name, operation)` on every mutation
|
|
473
|
+
- Listener registration/deregistration for per-connection subscriptions
|
|
474
|
+
|
|
475
|
+
#### Runner Pool Management
|
|
476
|
+
|
|
477
|
+
- `RunnerPool` CRD: capacity (warmReplicas, maxReplicas), cache policy, trust boundary
|
|
478
|
+
- Scheduling policy specs (not execution)
|
|
479
|
+
- Runner identity binding via AgentServiceAccount
|
|
480
|
+
|
|
481
|
+
#### Audit Logging
|
|
482
|
+
|
|
483
|
+
- Audit controller records all resource mutations
|
|
484
|
+
- Queryable via MCP tool (`kradle_audit_query`)
|
|
485
|
+
- Correlation IDs on all snapshot fetches
|
|
486
|
+
|
|
487
|
+
#### MCP Server (14 Tools)
|
|
488
|
+
|
|
489
|
+
Exposed via stdio (`kradle mcp`):
|
|
490
|
+
- `kradle_snapshot`, `kradle_list_resources`, `kradle_get_resource`
|
|
491
|
+
- `kradle_apply_resource`, `kradle_delete_resource`, `kradle_search`
|
|
492
|
+
- `kradle_list_stacks`, `kradle_create_stack`, `kradle_dispatch_agent`
|
|
493
|
+
- `kradle_list_secrets`, `kradle_create_secret`
|
|
494
|
+
- `kradle_sync_external`, `kradle_resolve_conflict`, `kradle_audit_query`
|
|
495
|
+
|
|
496
|
+
---
|
|
497
|
+
|
|
498
|
+
### 2.2 What Agent-Mux Owns
|
|
499
|
+
|
|
500
|
+
#### Actual Agent Spawning and Management
|
|
501
|
+
|
|
502
|
+
Agent-mux is responsible for:
|
|
503
|
+
- Instantiating agent processes (Claude, Codex, Gemini, etc.)
|
|
504
|
+
- Managing process lifecycle (start, monitor, terminate)
|
|
505
|
+
- Resource isolation between concurrent agent sessions
|
|
506
|
+
- Process cleanup on timeout or cancellation
|
|
507
|
+
|
|
508
|
+
#### Session Lifecycle
|
|
509
|
+
|
|
510
|
+
- Create session from stack parameters (model, prompt, tools, workspace)
|
|
511
|
+
- Stream events from running session to subscribers
|
|
512
|
+
- Terminate sessions (graceful and forced)
|
|
513
|
+
- Track session state transitions (Pending, Running, Completed, Failed, Cancelled)
|
|
514
|
+
|
|
515
|
+
#### Adapter Registry
|
|
516
|
+
|
|
517
|
+
Agent-mux maintains the runtime adapter registry:
|
|
518
|
+
- Claude adapter (Anthropic API)
|
|
519
|
+
- Codex adapter (OpenAI API)
|
|
520
|
+
- Gemini adapter (Google API)
|
|
521
|
+
- Pi adapter (Inflection API)
|
|
522
|
+
- Custom adapters via plugin system
|
|
523
|
+
|
|
524
|
+
Each adapter implements: capabilities query, session creation, event streaming.
|
|
525
|
+
|
|
526
|
+
#### Transport Codec
|
|
527
|
+
|
|
528
|
+
- Message format translation between kradle's JSON and adapter-native formats
|
|
529
|
+
- Streaming frame encoding (SSE, WebSocket, stdio JSON-RPC)
|
|
530
|
+
- Binary attachment handling
|
|
531
|
+
- Compression negotiation
|
|
532
|
+
|
|
533
|
+
#### Provider Client Instantiation
|
|
534
|
+
|
|
535
|
+
- API key retrieval and injection
|
|
536
|
+
- Base URL resolution per provider
|
|
537
|
+
- Rate limiting enforcement per provider/model combination
|
|
538
|
+
- Retry logic for transient provider failures
|
|
539
|
+
|
|
540
|
+
#### Real-Time Event Streaming
|
|
541
|
+
|
|
542
|
+
- SSE server for `/api/v1/runs/{runId}/events`
|
|
543
|
+
- Event buffering and replay for reconnecting clients
|
|
544
|
+
- Multi-subscriber fanout (multiple UI tabs)
|
|
545
|
+
- Connection keep-alive and heartbeat
|
|
546
|
+
|
|
547
|
+
#### Token Counting and Cost Calculation
|
|
548
|
+
|
|
549
|
+
- Per-message input/output token counting
|
|
550
|
+
- Model-specific pricing application
|
|
551
|
+
- Cumulative cost tracking per session/run
|
|
552
|
+
- Cost breakdown in event payloads (`event.usage.inputTokens`, `event.usage.outputTokens`)
|
|
553
|
+
|
|
554
|
+
---
|
|
555
|
+
|
|
556
|
+
### 2.3 What Babysitter-SDK Owns
|
|
557
|
+
|
|
558
|
+
#### Process Definition and Orchestration
|
|
559
|
+
|
|
560
|
+
- `.a5c/processes/*.json` file format and schema
|
|
561
|
+
- Task graph resolution (dependencies, parallelism)
|
|
562
|
+
- Effect system (what a task produces)
|
|
563
|
+
- Breakpoint system (human gates in execution flow)
|
|
564
|
+
|
|
565
|
+
#### Run Lifecycle
|
|
566
|
+
|
|
567
|
+
- Run creation with process binding
|
|
568
|
+
- Task iteration (next task selection, execution, completion)
|
|
569
|
+
- Run completion detection (all tasks done, failure threshold)
|
|
570
|
+
- Run cancellation and cleanup
|
|
571
|
+
|
|
572
|
+
#### Journal and State Management
|
|
573
|
+
|
|
574
|
+
- Append-only journal (`journal.json`) with typed events
|
|
575
|
+
- Mutable state snapshot (`state.json`) for resumption
|
|
576
|
+
- Effect output persistence (`effects/`)
|
|
577
|
+
- Run metadata (timing, status, error details)
|
|
578
|
+
|
|
579
|
+
#### Session Binding
|
|
580
|
+
|
|
581
|
+
- Binding an agent session to a babysitter run
|
|
582
|
+
- Session-to-task mapping (which session handles which task)
|
|
583
|
+
- Multi-session orchestration (parallel tasks)
|
|
584
|
+
|
|
585
|
+
#### Hook-Driven Continuation
|
|
586
|
+
|
|
587
|
+
- Pre/post task hooks
|
|
588
|
+
- Breakpoint evaluation and resolution
|
|
589
|
+
- External trigger integration (webhook → resume)
|
|
590
|
+
- Timeout-based auto-continuation
|
|
591
|
+
|
|
592
|
+
---
|
|
593
|
+
|
|
594
|
+
### 2.4 The Gap Zone
|
|
595
|
+
|
|
596
|
+
The gap zone defines areas where kradle manages the **resource declaration** but
|
|
597
|
+
does not perform the **runtime execution**. This is by design -- kradle is the
|
|
598
|
+
"desired state" layer; execution is delegated to specialized runtimes.
|
|
599
|
+
|
|
600
|
+
#### AgentStack Exists but Isn't Resolved into a Running Adapter
|
|
601
|
+
|
|
602
|
+
- `AgentStack` CRD captures: baseAgent, adapter, model, prompt, tools, MCP servers, skills
|
|
603
|
+
- The stack controller reconciles readiness conditions (capability resolution, MCP health)
|
|
604
|
+
- But it never actually calls agent-mux to instantiate the adapter
|
|
605
|
+
- The gap: creating an AgentStack does not start anything
|
|
606
|
+
|
|
607
|
+
#### AgentDispatchRun Created but Agent Not Spawned by Kradle
|
|
608
|
+
|
|
609
|
+
- `AgentDispatchRun` CRD captures: repository, sourceRefs, agentStack, taskKind
|
|
610
|
+
- Status tracks: Queued, Running, Completed, Failed
|
|
611
|
+
- The dispatch controller creates the resource and validates the spec
|
|
612
|
+
- But the actual agent spawn happens in agent-mux (triggered by external controller)
|
|
613
|
+
- The gap: dispatch resource exists immediately, execution happens later
|
|
614
|
+
|
|
615
|
+
#### KradleWorkspace Generates Pod Specs but Doesn't Execute Them
|
|
616
|
+
|
|
617
|
+
- `KradleWorkspace` CRD captures: repository, volumeSpec, mount paths
|
|
618
|
+
- `KradleWorkspacePolicy` defines provisioning rules and trust tiers
|
|
619
|
+
- The workspace controller generates PVC manifests and mount specifications
|
|
620
|
+
- But it does not create the actual PVC or pod (that's the workspace-provisioner)
|
|
621
|
+
- The gap: workspace spec is declarative; provisioning is separate
|
|
622
|
+
|
|
623
|
+
#### RunnerPool Generates Schedules but Doesn't Provision Runners
|
|
624
|
+
|
|
625
|
+
- `RunnerPool` CRD captures: warmReplicas, maxReplicas, cache policy
|
|
626
|
+
- The runner controller validates pool specs and computes scheduling hints
|
|
627
|
+
- But it does not create actual runner pods or scale deployments
|
|
628
|
+
- The gap: pool definition is intent; ARC (Actions Runner Controller) or
|
|
629
|
+
similar actually provisions the runners
|
|
630
|
+
|
|
631
|
+
#### The Intent
|
|
632
|
+
|
|
633
|
+
This pattern is intentional Kubernetes-native architecture:
|
|
634
|
+
1. Kradle manages CRD resources as desired state
|
|
635
|
+
2. Specialized operators/controllers watch these resources
|
|
636
|
+
3. Operators reconcile desired state into actual state
|
|
637
|
+
4. Status fields reflect observed state back into CRDs
|
|
638
|
+
5. Kradle's UI reads status to show current state
|
|
639
|
+
|
|
640
|
+
This separation enables:
|
|
641
|
+
- Independent scaling of control plane vs execution plane
|
|
642
|
+
- Pluggable execution backends (swap agent-mux implementation)
|
|
643
|
+
- GitOps-compatible declarative management
|
|
644
|
+
- Clear audit trail (every intent is a persisted resource)
|
|
645
|
+
|
|
646
|
+
---
|
|
647
|
+
|
|
648
|
+
## 3. Architectural Choices & Trade-offs
|
|
649
|
+
|
|
650
|
+
### 3.1 Why CRD-First (vs Database-First)
|
|
651
|
+
|
|
652
|
+
**Decision:** All state stored as Kubernetes custom resources (CRDs for config
|
|
653
|
+
kinds, aggregated API server for data-plane kinds).
|
|
654
|
+
|
|
655
|
+
**Rationale:**
|
|
656
|
+
- GitOps-compatible: resources can be managed via Argo CD, Flux, or plain kubectl
|
|
657
|
+
- kubectl-native: operators and admins use familiar tooling
|
|
658
|
+
- Declarative: desired state vs imperative mutations
|
|
659
|
+
- No external DB dependency for control plane (etcd comes with K8s)
|
|
660
|
+
- Built-in RBAC: Kubernetes RBAC applies to all resource operations
|
|
661
|
+
- Watch support: built-in change notification for controllers
|
|
662
|
+
- Namespace isolation: multi-tenancy via namespace-per-org
|
|
663
|
+
|
|
664
|
+
**Trade-off:**
|
|
665
|
+
- Slower than direct database queries (kubectl spawnSync overhead)
|
|
666
|
+
- etcd size limits (~1.5MB per resource, cluster-wide storage cap)
|
|
667
|
+
- No complex queries (no JOIN, no WHERE with multiple conditions)
|
|
668
|
+
- No full-text search (client-side filtering only)
|
|
669
|
+
- Pagination via continue tokens (not offset-based)
|
|
670
|
+
|
|
671
|
+
**Mitigation:**
|
|
672
|
+
- Snapshot cache (30s TTL) for dashboard queries
|
|
673
|
+
- Per-org caching reduces repeated cross-namespace queries
|
|
674
|
+
- Background async snapshot refresh (stale-while-revalidate)
|
|
675
|
+
- Aggregated API server pattern for high-volume data (PullRequest, Pipeline, etc.)
|
|
676
|
+
- `getPartialSnapshot()` for pages that only need a subset of kinds
|
|
677
|
+
|
|
678
|
+
**When It Breaks:**
|
|
679
|
+
- Large clusters with 1000+ resources per kind (kubectl list becomes slow)
|
|
680
|
+
- Complex joins needed (e.g. "all runs for repositories owned by team X")
|
|
681
|
+
- Full-text search across resource content (need external search index)
|
|
682
|
+
- High-frequency writes (etcd write throughput is ~10K/s cluster-wide)
|
|
683
|
+
- Time-series data (audit logs, metrics -- needs separate store)
|
|
684
|
+
|
|
685
|
+
---
|
|
686
|
+
|
|
687
|
+
### 3.2 Why kubectl spawnSync (vs K8s client-go or @kubernetes/client-node)
|
|
688
|
+
|
|
689
|
+
**Decision:** Shell out to kubectl binary for all Kubernetes operations using
|
|
690
|
+
Node.js `child_process.spawnSync()` and `child_process.spawn()`.
|
|
691
|
+
|
|
692
|
+
**Rationale:**
|
|
693
|
+
- Zero npm dependencies for K8s operations (no `@kubernetes/client-node`)
|
|
694
|
+
- Works with any kubeconfig (user's local config, service account, EKS/GKE auth plugins)
|
|
695
|
+
- kubectl handles all auth complexity (OIDC, exec-based plugins, certificates)
|
|
696
|
+
- No Node.js K8s client compatibility bugs
|
|
697
|
+
- kubectl is battle-tested and always up-to-date with K8s API changes
|
|
698
|
+
- Debugging: can reproduce any operation by copying the kubectl command
|
|
699
|
+
|
|
700
|
+
**Trade-off:**
|
|
701
|
+
- `spawnSync` blocks the event loop (one at a time per request)
|
|
702
|
+
- Cold starts are slow (kubectl binary startup + API server TLS handshake)
|
|
703
|
+
- No watch support in sync mode (separate spawn needed)
|
|
704
|
+
- Process spawn overhead (~20-50ms per call)
|
|
705
|
+
- Max buffer limits on large responses (configurable via `KRADLE_KUBECTL_MAX_BUFFER_BYTES`)
|
|
706
|
+
|
|
707
|
+
**Mitigation:**
|
|
708
|
+
- `kubernetes-controller-async.js` uses `spawn()` + `Promise.all()` for parallel queries
|
|
709
|
+
- Snapshot cache means most page loads skip kubectl entirely
|
|
710
|
+
- `getPartialSnapshot()` queries only needed kinds (not all 76)
|
|
711
|
+
- `KRADLE_KUBECTL_TIMEOUT_MS` (default 3s) prevents hung processes
|
|
712
|
+
- `runKubectlAsync()` for non-blocking operations in the async controller
|
|
713
|
+
- In-cluster detection auto-adds `--server`, `--token`, `--certificate-authority`
|
|
714
|
+
flags (no kubeconfig file needed in-cluster)
|
|
715
|
+
|
|
716
|
+
**Future Options:**
|
|
717
|
+
- Could add in-cluster HTTP client using service account token at
|
|
718
|
+
`/var/run/secrets/kubernetes.io/serviceaccount/token`
|
|
719
|
+
- Could use `@kubernetes/client-node` for watch-only operations
|
|
720
|
+
- Could implement a sidecar proxy that exposes a local HTTP API
|
|
721
|
+
|
|
722
|
+
---
|
|
723
|
+
|
|
724
|
+
### 3.3 Why Next.js App Router (vs Pages Router or Remix)
|
|
725
|
+
|
|
726
|
+
**Decision:** Next.js 16 with App Router, React 19 server components.
|
|
727
|
+
|
|
728
|
+
**Rationale:**
|
|
729
|
+
- Server-side rendering for dashboard pages (no client-side waterfall)
|
|
730
|
+
- Streaming responses for progressive rendering
|
|
731
|
+
- Parallel data loading (multiple server components fetch independently)
|
|
732
|
+
- File-based routing matches the 7-module page structure
|
|
733
|
+
- React Server Components reduce client bundle size
|
|
734
|
+
- Built-in API routes for proxy endpoints (Atlas, auth callbacks)
|
|
735
|
+
|
|
736
|
+
**Trade-off:**
|
|
737
|
+
- Complex server/client boundary (`'use client'` directive management)
|
|
738
|
+
- Cannot pass functions or non-serializable props from server to client components
|
|
739
|
+
- Larger initial bundle than SPA alternatives
|
|
740
|
+
- Build time increases with page count
|
|
741
|
+
- Turbopack compatibility issues in Docker builds (fallback to webpack)
|
|
742
|
+
|
|
743
|
+
**Mitigation:**
|
|
744
|
+
- Clear module split: `lib/pages/` (server), `components/` (client where needed)
|
|
745
|
+
- `'use client'` only on interactive components (forms, drag-drop, modals)
|
|
746
|
+
- SDK resolveAlias for monorepo imports (relative path workaround)
|
|
747
|
+
- Standalone output mode reduces Docker image size
|
|
748
|
+
- `export const metadata` pattern for static head content
|
|
749
|
+
|
|
750
|
+
**Gotcha:**
|
|
751
|
+
- `export const metadata` cannot coexist with barrel re-exports
|
|
752
|
+
(must be in the page file itself, not re-exported from index)
|
|
753
|
+
- `dynamic = 'force-dynamic'` required on proxy routes (Atlas, auth)
|
|
754
|
+
- `process.env` access works differently in server components vs route handlers
|
|
755
|
+
|
|
756
|
+
---
|
|
757
|
+
|
|
758
|
+
### 3.4 Why Pure ESM JavaScript (vs TypeScript)
|
|
759
|
+
|
|
760
|
+
**Decision:** Zero TypeScript across all kradle packages. Pure `.js`/`.jsx` with
|
|
761
|
+
JSDoc annotations for type information.
|
|
762
|
+
|
|
763
|
+
**Rationale:**
|
|
764
|
+
- No build step for core package (run directly with `node`)
|
|
765
|
+
- Instant startup (no compilation delay in development)
|
|
766
|
+
- Simpler debugging (source maps not needed, line numbers match)
|
|
767
|
+
- No `tsconfig.json` complexity (path aliases, module resolution, strict modes)
|
|
768
|
+
- JSDoc types are optional and incremental (add where valuable)
|
|
769
|
+
- Core package uses only Node.js built-in modules (no bundler needed)
|
|
770
|
+
|
|
771
|
+
**Trade-off:**
|
|
772
|
+
- No compile-time type safety
|
|
773
|
+
- JSDoc is more verbose than TypeScript type annotations
|
|
774
|
+
- IDE IntelliSense weaker for complex types (generics, discriminated unions)
|
|
775
|
+
- Refactoring tools less reliable without type information
|
|
776
|
+
- No `enum` or `interface` -- must use `@typedef` or constants
|
|
777
|
+
|
|
778
|
+
**Mitigation:**
|
|
779
|
+
- Comprehensive test suite (1440+ tests across core, SDK, CLI, web)
|
|
780
|
+
- Controller boundary declarations (BOUNDARY objects) enforce contracts at runtime
|
|
781
|
+
- `validateResource()` checks required fields at runtime
|
|
782
|
+
- Every controller factory function documents its API via JSDoc
|
|
783
|
+
- Resource model has `requiredSpec` arrays that enforce schema at create/apply time
|
|
784
|
+
|
|
785
|
+
---
|
|
786
|
+
|
|
787
|
+
### 3.5 Why Stale-While-Revalidate (vs K8s Watch Streams)
|
|
788
|
+
|
|
789
|
+
**Decision:** 30s TTL cache with stale-while-revalidate pattern for all
|
|
790
|
+
Kubernetes resource reads.
|
|
791
|
+
|
|
792
|
+
**Rationale:**
|
|
793
|
+
- Simple implementation (Map-based cache, no persistent connections)
|
|
794
|
+
- Works with kubectl (no long-lived watch connections needed)
|
|
795
|
+
- Handles cold starts gracefully (first request blocks, subsequent use cache)
|
|
796
|
+
- Predictable memory usage (one snapshot per org)
|
|
797
|
+
- No reconnection logic needed for reads
|
|
798
|
+
|
|
799
|
+
**Trade-off:**
|
|
800
|
+
- 30s staleness window (UI may show outdated data)
|
|
801
|
+
- Cache miss on first request after deploy or cache clear
|
|
802
|
+
- Background revalidation fires even when no one is watching
|
|
803
|
+
- Multiple simultaneous requests may all miss cache (thundering herd)
|
|
804
|
+
|
|
805
|
+
**Mitigation:**
|
|
806
|
+
- `clearSnapshotCache()` called on every write (apply/delete) -- immediate consistency for mutations
|
|
807
|
+
- Per-org isolation: cache key includes organization parameter
|
|
808
|
+
- `revalidating` flag prevents duplicate background fetches
|
|
809
|
+
- `staleMs` threshold (5x TTL = 150s) before forcing blocking revalidation
|
|
810
|
+
- Configurable via `KRADLE_SNAPSHOT_CACHE_TTL_MS` env var
|
|
811
|
+
|
|
812
|
+
**Future:**
|
|
813
|
+
- `watchResourceChanges()` is implemented in `kubernetes-controller-async.js`
|
|
814
|
+
- Watches key kinds (Organization, AgentStack, AgentSession) and clears cache on change
|
|
815
|
+
- Not yet wired to the web layer HTTP server
|
|
816
|
+
- Could enable near-real-time UI updates without polling
|
|
817
|
+
|
|
818
|
+
---
|
|
819
|
+
|
|
820
|
+
### 3.6 Why SDK Re-export Layer (vs Direct Imports)
|
|
821
|
+
|
|
822
|
+
**Decision:** `@a5c-ai/kradle-sdk` re-exports from core, and web/CLI import
|
|
823
|
+
from the SDK rather than directly from core internals.
|
|
824
|
+
|
|
825
|
+
**Rationale:**
|
|
826
|
+
- Decouples web/CLI from internal core file paths
|
|
827
|
+
- Single import target for consumers (`import { ... } from '@a5c-ai/kradle-sdk'`)
|
|
828
|
+
- SDK can expose a stable API surface while core refactors internally
|
|
829
|
+
- SDK adds web-specific helpers (UI model mappers, auth wrappers)
|
|
830
|
+
- Clear dependency direction: web -> SDK -> core
|
|
831
|
+
|
|
832
|
+
**Trade-off:**
|
|
833
|
+
- Extra level of indirection for simple re-exports
|
|
834
|
+
- Turbopack/webpack need `resolveAlias` configuration for monorepo resolution
|
|
835
|
+
- Circular dependency risk if SDK imports from web accidentally
|
|
836
|
+
- Version coupling (SDK must update when core export shapes change)
|
|
837
|
+
|
|
838
|
+
**Gotcha:**
|
|
839
|
+
- Turbopack requires **relative path** in resolveAlias (not absolute path)
|
|
840
|
+
- The alias target is `'../sdk/src/index.js'` from the web package root
|
|
841
|
+
- Monorepo workspace root must be correctly set in `next.config.js`
|
|
842
|
+
- Build fails silently with wrong path (module not found at runtime, not build time)
|
|
843
|
+
|
|
844
|
+
---
|
|
845
|
+
|
|
846
|
+
### 3.7 Why x-kubernetes-preserve-unknown-fields (vs Strict Schemas)
|
|
847
|
+
|
|
848
|
+
**Decision:** All CRD specs use `x-kubernetes-preserve-unknown-fields: true`
|
|
849
|
+
allowing arbitrary additional fields.
|
|
850
|
+
|
|
851
|
+
**Rationale:**
|
|
852
|
+
- Rapid iteration: UI can add new spec fields without CRD redeployment
|
|
853
|
+
- Forward-compatible: older CRD versions accept newer resource manifests
|
|
854
|
+
- No validation failures during development cycles
|
|
855
|
+
- Reduces coupling between Helm chart releases and feature development
|
|
856
|
+
- Enables spec exploration (prototype fields in UI, formalize later)
|
|
857
|
+
|
|
858
|
+
**Trade-off:**
|
|
859
|
+
- No server-side validation of field names or types
|
|
860
|
+
- Typos in spec field names silently accepted (e.g. `organisationRef` vs `organizationRef`)
|
|
861
|
+
- No OpenAPI schema generation for CRD fields
|
|
862
|
+
- `kubectl explain` shows no field documentation
|
|
863
|
+
- etcd stores whatever is submitted (no normalization)
|
|
864
|
+
|
|
865
|
+
**Mitigation:**
|
|
866
|
+
- Client-side validation in controllers (`validate*` functions)
|
|
867
|
+
- `requiredSpec` arrays in RESOURCE_DEFINITIONS enforce mandatory fields at apply time
|
|
868
|
+
- Test suite covers all valid/invalid field combinations
|
|
869
|
+
- Controller boundary declarations document expected spec shapes
|
|
870
|
+
- UI forms constrain input to valid fields
|
|
871
|
+
|
|
872
|
+
**Note:** Helm only installs CRDs on `helm install`, not `helm upgrade`.
|
|
873
|
+
Explicit `kubectl apply -f packages/kradle/charts/crds/ --server-side --force-conflicts`
|
|
874
|
+
is needed in CI before helm upgrade to update CRD definitions.
|
|
875
|
+
|
|
876
|
+
---
|
|
877
|
+
|
|
878
|
+
### 3.8 Why Single-Container-Per-Role (api + controllers + web + webhook-worker)
|
|
879
|
+
|
|
880
|
+
**Decision:** 4 deployment containers (roles), each running a different entry
|
|
881
|
+
command from the same or related images.
|
|
882
|
+
|
|
883
|
+
**Actual Layout:**
|
|
884
|
+
| Role | Image | Entry Command | Port |
|
|
885
|
+
|------|-------|--------------|------|
|
|
886
|
+
| api | kradle-controller | `node src/http-server.js` | 3080 |
|
|
887
|
+
| controllers | kradle-controller | `node src/control-plane.js` | - |
|
|
888
|
+
| web | kradle-web | `node .next/standalone/server.js` | 3000 |
|
|
889
|
+
| webhook-worker | kradle-controller | `node src/external/webhook-controller.js` | - |
|
|
890
|
+
|
|
891
|
+
**Rationale:**
|
|
892
|
+
- Separation of concerns (API serving vs background reconciliation vs UI)
|
|
893
|
+
- Independent scaling (web can scale to 3 replicas while api stays at 1)
|
|
894
|
+
- Failure isolation (webhook worker crash doesn't affect UI)
|
|
895
|
+
- Resource tuning (web needs more memory for SSR, api needs more CPU)
|
|
896
|
+
|
|
897
|
+
**Trade-off:**
|
|
898
|
+
- More pods consuming cluster resources
|
|
899
|
+
- More complexity in Helm chart (4 Deployments, 4 Services)
|
|
900
|
+
- Internal service discovery needed (web → api via `KRADLE_CONTROLLER_URL`)
|
|
901
|
+
- Shared code must be in core package (duplicated in controller image)
|
|
902
|
+
|
|
903
|
+
**Actual State:**
|
|
904
|
+
- api and controllers share the `kradle-controller` image (same codebase, different entrypoint)
|
|
905
|
+
- web uses the `kradle-web` image (Next.js standalone build)
|
|
906
|
+
- webhook-worker is architecturally separate but shares the controller image
|
|
907
|
+
|
|
908
|
+
---
|
|
909
|
+
|
|
910
|
+
## 4. System Nuances & Gotchas
|
|
911
|
+
|
|
912
|
+
### 4.1 Namespace Discovery Fallback Chain
|
|
913
|
+
|
|
914
|
+
The system must determine which Kubernetes namespaces to query for org-scoped
|
|
915
|
+
resources. The resolution follows a priority chain:
|
|
916
|
+
|
|
917
|
+
**Step 1:** Check Organization resources in platform namespace
|
|
918
|
+
```javascript
|
|
919
|
+
organizations.map(org => org.spec?.namespaceName || org.metadata?.labels?.['kradle.a5c.ai/namespace'])
|
|
920
|
+
```
|
|
921
|
+
If Organization CRDs exist, derive namespaces from their `spec.namespaceName` field.
|
|
922
|
+
|
|
923
|
+
**Step 2:** Check OrgNamespaceBinding resources
|
|
924
|
+
```javascript
|
|
925
|
+
bindings.map(binding => binding.spec?.namespace || binding.metadata?.labels?.['kradle.a5c.ai/namespace'])
|
|
926
|
+
```
|
|
927
|
+
Bindings explicitly declare the namespace for each org.
|
|
928
|
+
|
|
929
|
+
**Step 3:** Environment variable fallback
|
|
930
|
+
```javascript
|
|
931
|
+
if (process.env.KRADLE_ADMIN_ORG) fallbackOrgs.add(orgNamespaceName(adminOrg));
|
|
932
|
+
fallbackOrgs.add(orgNamespaceName(process.env.KRADLE_ORG || 'default'));
|
|
933
|
+
// Result: ['kradle-org-admin', 'kradle-org-default']
|
|
934
|
+
```
|
|
935
|
+
|
|
936
|
+
**Step 4:** Last resort
|
|
937
|
+
```javascript
|
|
938
|
+
return [KRADLE_PLATFORM_NAMESPACE]; // 'kradle-system'
|
|
939
|
+
```
|
|
940
|
+
|
|
941
|
+
**WHY this chain exists:** Fresh deployments have no Organization CRD yet
|
|
942
|
+
(it's created on first admin login), but the UI needs to list resources in
|
|
943
|
+
an org namespace. The fallback ensures the system bootstraps correctly.
|
|
944
|
+
|
|
945
|
+
**Edge cases:**
|
|
946
|
+
- Multiple orgs: all discovered namespaces are queried (flat merge, no hierarchy)
|
|
947
|
+
- Namespace doesn't exist yet: kubectl returns empty list (no error with `--ignore-not-found`)
|
|
948
|
+
- Stale bindings: namespace listed but org deleted → empty results (harmless)
|
|
949
|
+
|
|
950
|
+
---
|
|
951
|
+
|
|
952
|
+
### 4.2 KRADLE_CONTROLLER_URL Indirection
|
|
953
|
+
|
|
954
|
+
**Architecture:**
|
|
955
|
+
```
|
|
956
|
+
Browser → Web Container (Next.js) → API Container (HTTP server)
|
|
957
|
+
↓
|
|
958
|
+
kubectl → K8s API
|
|
959
|
+
```
|
|
960
|
+
|
|
961
|
+
**How it works:**
|
|
962
|
+
- Web container has `KRADLE_CONTROLLER_URL` env var pointing to api's internal K8s Service URL
|
|
963
|
+
(e.g. `http://kradle-api.kradle-system.svc.cluster.local:80`)
|
|
964
|
+
- Web NEVER runs kubectl directly (no kubeconfig mounted in web container)
|
|
965
|
+
- All resource operations go through fetch() to the api container
|
|
966
|
+
|
|
967
|
+
**If api container is down:**
|
|
968
|
+
- Web's server-side data fetching returns clean error model
|
|
969
|
+
- Pages render with error state (not a crash/500)
|
|
970
|
+
- No kubectl fallback from web container (by design)
|
|
971
|
+
|
|
972
|
+
**If api returns degraded data:**
|
|
973
|
+
- Web may probe local snapshot for comparison (modelResourceScore heuristic)
|
|
974
|
+
- Used to detect api container serving stale cache vs fresh data
|
|
975
|
+
- Not a correctness requirement, just a freshness indicator
|
|
976
|
+
|
|
977
|
+
**Why not direct kubectl from web?**
|
|
978
|
+
- Security: web container is publicly exposed (ingress), kubectl access is dangerous
|
|
979
|
+
- Image size: web image doesn't include kubectl binary (except for auth callback)
|
|
980
|
+
- Separation: web handles presentation, api handles data operations
|
|
981
|
+
- Exception: `registerLoginProfile()` in auth callback does use kubectl (web image
|
|
982
|
+
includes kubectl for this single operation -- registers User/IdentityMapping on login)
|
|
983
|
+
|
|
984
|
+
---
|
|
985
|
+
|
|
986
|
+
### 4.3 Cache + Write Interaction
|
|
987
|
+
|
|
988
|
+
**Write path (mutation):**
|
|
989
|
+
```
|
|
990
|
+
applyResource(resource) / deleteResource(kind, name)
|
|
991
|
+
→ kubectl apply / delete
|
|
992
|
+
→ clearSnapshotCache() // Invalidate ALL cached data
|
|
993
|
+
→ globalEventBus.emitResourceChange(kind, name, 'apply'|'delete')
|
|
994
|
+
→ SSE clients receive notification
|
|
995
|
+
→ Next page load fetches fresh data from kubectl
|
|
996
|
+
```
|
|
997
|
+
|
|
998
|
+
**Read path (query):**
|
|
999
|
+
```
|
|
1000
|
+
Page server component calls controller
|
|
1001
|
+
→ staleWhileRevalidate(org, revalidateFn)
|
|
1002
|
+
→ If cache fresh (< 30s): return immediately
|
|
1003
|
+
→ If cache stale (30s-150s): return stale, refresh in background
|
|
1004
|
+
→ If cache too old (> 150s) or missing: block on fresh fetch
|
|
1005
|
+
```
|
|
1006
|
+
|
|
1007
|
+
**Key behaviors:**
|
|
1008
|
+
- `clearSnapshotCache()` clears ALL orgs (global invalidation)
|
|
1009
|
+
- `clearOrgCache(org)` clears single org (surgical invalidation)
|
|
1010
|
+
- Per-org cache isolation: different orgs don't interfere
|
|
1011
|
+
- `revalidating` flag prevents thundering herd (only one background refresh)
|
|
1012
|
+
- Write + immediate read: always gets fresh data (cache cleared on write)
|
|
1013
|
+
|
|
1014
|
+
**Race condition:**
|
|
1015
|
+
If two users write simultaneously:
|
|
1016
|
+
1. User A writes → cache cleared
|
|
1017
|
+
2. User B writes → cache cleared (already empty)
|
|
1018
|
+
3. User A reads → fresh fetch, sets cache
|
|
1019
|
+
4. User B reads → gets User A's fresh data (which includes both writes if kubectl returned both)
|
|
1020
|
+
|
|
1021
|
+
This is safe because kubectl always returns the latest server state.
|
|
1022
|
+
|
|
1023
|
+
---
|
|
1024
|
+
|
|
1025
|
+
### 4.4 Auth Cookie Security
|
|
1026
|
+
|
|
1027
|
+
**Cookie creation:**
|
|
1028
|
+
```javascript
|
|
1029
|
+
// Payload: base64url(JSON({ provider, subject, user }))
|
|
1030
|
+
// With secret: payload.hmac-sha256(payload, secret) → base64url
|
|
1031
|
+
// Without secret: payload only (plain base64url, no signature)
|
|
1032
|
+
```
|
|
1033
|
+
|
|
1034
|
+
**Verification matrix:**
|
|
1035
|
+
|
|
1036
|
+
| Cookie State | Secret Configured | Result |
|
|
1037
|
+
|-------------|-------------------|--------|
|
|
1038
|
+
| Signed | Yes | Verify HMAC, constant-time compare |
|
|
1039
|
+
| Signed | No | Reject (can't verify) |
|
|
1040
|
+
| Unsigned | Yes | Reject (could be tampered) |
|
|
1041
|
+
| Unsigned | No | Accept (backward compatible) |
|
|
1042
|
+
|
|
1043
|
+
**Security properties:**
|
|
1044
|
+
- HMAC-SHA256 signing ONLY when `KRADLE_SESSION_SECRET` env var is set
|
|
1045
|
+
- Without secret: cookie is plain base64 (useful for development)
|
|
1046
|
+
- Constant-time comparison via `crypto.timingSafeEqual` (prevents timing attacks)
|
|
1047
|
+
- HttpOnly flag (no JavaScript access)
|
|
1048
|
+
- SameSite=Lax (prevents CSRF from cross-origin POST)
|
|
1049
|
+
- No Secure flag by default (set at ingress/proxy level)
|
|
1050
|
+
|
|
1051
|
+
**Tampered cookie handling:**
|
|
1052
|
+
- Invalid HMAC → `parseSessionCookie` returns `null`
|
|
1053
|
+
- null session → middleware rejects request
|
|
1054
|
+
- Rejection → 307 redirect to `/login`
|
|
1055
|
+
- No error message exposed (prevents oracle attacks)
|
|
1056
|
+
|
|
1057
|
+
---
|
|
1058
|
+
|
|
1059
|
+
### 4.5 CRD Lifecycle in CI
|
|
1060
|
+
|
|
1061
|
+
**Problem:** Helm's CRD handling has a well-known limitation:
|
|
1062
|
+
- `helm install` -- applies CRDs from the `crds/` directory
|
|
1063
|
+
- `helm upgrade` -- does NOT update CRDs (by design, to prevent data loss)
|
|
1064
|
+
- `helm uninstall` -- does NOT delete CRDs (by design, to prevent data loss)
|
|
1065
|
+
|
|
1066
|
+
**CI workaround:**
|
|
1067
|
+
```bash
|
|
1068
|
+
# Before helm upgrade, explicitly apply CRDs
|
|
1069
|
+
kubectl apply -f packages/kradle/charts/crds/ --server-side --force-conflicts
|
|
1070
|
+
```
|
|
1071
|
+
|
|
1072
|
+
`--server-side` enables server-side apply (handles field ownership correctly).
|
|
1073
|
+
`--force-conflicts` resolves ownership conflicts (Helm vs kubectl managers).
|
|
1074
|
+
|
|
1075
|
+
**Implications for development:**
|
|
1076
|
+
- Adding a new field to an existing CRD: no CRD redeployment needed (preserve-unknown-fields)
|
|
1077
|
+
- Adding a new CRD kind: must deploy the CRD yaml file before resources can be created
|
|
1078
|
+
- Removing a field from CRD: preserve-unknown-fields means old resources still have the field
|
|
1079
|
+
- Changing field type in CRD: no validation exists, so no conflict (but client code may break)
|
|
1080
|
+
|
|
1081
|
+
**Best practices:**
|
|
1082
|
+
- Always add new kinds in the same PR that adds the CRD yaml
|
|
1083
|
+
- CI pipeline runs CRD apply before helm upgrade
|
|
1084
|
+
- Never rename CRD group/version/plural (breaks all existing resources)
|
|
1085
|
+
- Use annotations to mark deprecated fields (spec.deprecated.fieldName: "reason")
|
|
1086
|
+
|
|
1087
|
+
---
|
|
1088
|
+
|
|
1089
|
+
### 4.6 Org-Scoped vs Platform-Scoped Resources
|
|
1090
|
+
|
|
1091
|
+
**Platform-scoped resources** (exist in kradle-system namespace only):
|
|
1092
|
+
- `Organization` -- represents an org identity
|
|
1093
|
+
- `OrgNamespaceBinding` -- binds org to a namespace
|
|
1094
|
+
|
|
1095
|
+
These are special because they exist "above" org namespaces -- they define
|
|
1096
|
+
the org structure itself.
|
|
1097
|
+
|
|
1098
|
+
**Org-scoped resources** (exist in kradle-org-{slug} namespaces):
|
|
1099
|
+
- All other 74 resource kinds
|
|
1100
|
+
- Always have `spec.organizationRef` field
|
|
1101
|
+
- Namespace derived from org: `kradle-org-${normalizeOrgSlug(org)}`
|
|
1102
|
+
|
|
1103
|
+
**Enforcement:**
|
|
1104
|
+
```javascript
|
|
1105
|
+
// In withOrgScope():
|
|
1106
|
+
if (resource.metadata?.namespace && resource.metadata.namespace !== namespace) {
|
|
1107
|
+
throw new Error(`namespace ${resource.metadata.namespace} does not match organization ${org}`);
|
|
1108
|
+
}
|
|
1109
|
+
```
|
|
1110
|
+
|
|
1111
|
+
Cross-org denial: `applyResource()` calls `withOrgScope()` which rejects any
|
|
1112
|
+
resource whose explicit namespace conflicts with its `organizationRef`.
|
|
1113
|
+
|
|
1114
|
+
**KRADLE_RESOURCES array has `platformScoped: true` flag:**
|
|
1115
|
+
- Platform-scoped: only queried from `KRADLE_PLATFORM_NAMESPACE` (kradle-system)
|
|
1116
|
+
- Org-scoped: queried from all discovered org namespaces
|
|
1117
|
+
|
|
1118
|
+
**Multi-org queries:**
|
|
1119
|
+
Snapshot fetches resources from ALL org namespaces. The flattened result includes
|
|
1120
|
+
resources from all orgs. UI filters by `spec.organizationRef` for the current org view.
|
|
1121
|
+
|
|
1122
|
+
---
|
|
1123
|
+
|
|
1124
|
+
### 4.7 Web Container Architecture
|
|
1125
|
+
|
|
1126
|
+
**Dockerfile structure (multi-stage):**
|
|
1127
|
+
```dockerfile
|
|
1128
|
+
# Stage 1: deps - install node_modules
|
|
1129
|
+
FROM node:20 AS deps
|
|
1130
|
+
COPY package*.json ./
|
|
1131
|
+
RUN npm ci
|
|
1132
|
+
|
|
1133
|
+
# Stage 2: build - Next.js production build
|
|
1134
|
+
FROM node:20 AS build
|
|
1135
|
+
COPY --from=deps /app/node_modules ./node_modules
|
|
1136
|
+
COPY . .
|
|
1137
|
+
RUN npm run build
|
|
1138
|
+
|
|
1139
|
+
# Stage 3: runtime - minimal production image
|
|
1140
|
+
FROM node:20-slim AS runtime
|
|
1141
|
+
COPY --from=build /app/.next/standalone ./
|
|
1142
|
+
COPY --from=build /app/.next/static ./.next/static
|
|
1143
|
+
COPY --from=build /app/public ./public
|
|
1144
|
+
# kubectl for auth callback
|
|
1145
|
+
COPY --from=bitnami/kubectl:latest /opt/bitnami/kubectl/bin/kubectl /usr/local/bin/
|
|
1146
|
+
```
|
|
1147
|
+
|
|
1148
|
+
**Build uses webpack (not turbopack) for Docker:**
|
|
1149
|
+
- Turbopack has issues with Docker layer caching
|
|
1150
|
+
- Webpack is more predictable in CI environments
|
|
1151
|
+
- `--webpack` flag in `next build` command
|
|
1152
|
+
|
|
1153
|
+
**Runtime image includes kubectl:**
|
|
1154
|
+
- Needed for `registerLoginProfile()` during auth callback
|
|
1155
|
+
- Called once per user login (not hot path)
|
|
1156
|
+
- Could be removed if auth moved fully to api container
|
|
1157
|
+
|
|
1158
|
+
**SDK resolution via turbopack resolveAlias:**
|
|
1159
|
+
```javascript
|
|
1160
|
+
// next.config.js
|
|
1161
|
+
experimental: {
|
|
1162
|
+
turbopack: {
|
|
1163
|
+
resolveAlias: {
|
|
1164
|
+
'@a5c-ai/kradle-sdk': '../sdk/src/index.js' // MUST be relative
|
|
1165
|
+
}
|
|
1166
|
+
}
|
|
1167
|
+
}
|
|
1168
|
+
```
|
|
1169
|
+
|
|
1170
|
+
**Standalone output mode:**
|
|
1171
|
+
- `.next/standalone/` contains full Node.js app (no node_modules needed at runtime)
|
|
1172
|
+
- Reduces Docker image from ~1GB to ~150MB
|
|
1173
|
+
- Entry: `node .next/standalone/server.js`
|
|
1174
|
+
- Static assets served from `.next/static/` (can be CDN-fronted)
|
|
1175
|
+
|
|
1176
|
+
---
|
|
1177
|
+
|
|
1178
|
+
## 5. Integration Gaps (Known, Documented)
|
|
1179
|
+
|
|
1180
|
+
The following integration gap categories track areas where kradle has resource
|
|
1181
|
+
definitions and controller logic but previously lacked (or still lacks) runtime
|
|
1182
|
+
integration. Items marked **RESOLVED** have been implemented in the K8s Job
|
|
1183
|
+
dispatch architecture.
|
|
1184
|
+
|
|
1185
|
+
### Gap 1: Session Lifecycle Sync — RESOLVED
|
|
1186
|
+
|
|
1187
|
+
**Previous state:**
|
|
1188
|
+
Created `AgentSession` resources with `spec.agentMuxSessionId`. Status depended
|
|
1189
|
+
on mux client responses; if mux was unavailable, status stayed `Pending`.
|
|
1190
|
+
|
|
1191
|
+
**Resolution:**
|
|
1192
|
+
Agent pods now POST directly to the Kradle callback endpoint
|
|
1193
|
+
(`POST /api/orgs/{org}/agents/runs/{name}/callback`) on completion.
|
|
1194
|
+
`persistSessionEvent()` applies the result to `AgentSession` and `AgentDispatchRun`
|
|
1195
|
+
in a single atomic update. No polling or mux webhook receiver needed.
|
|
1196
|
+
|
|
1197
|
+
---
|
|
1198
|
+
|
|
1199
|
+
### Gap 2: Adapter Capability Caching
|
|
1200
|
+
|
|
1201
|
+
**What kradle does now:**
|
|
1202
|
+
`queryCapabilities(adapter)` called during stack reconciliation. Result is used
|
|
1203
|
+
once and discarded (no caching of adapter capabilities).
|
|
1204
|
+
|
|
1205
|
+
**What it should do:**
|
|
1206
|
+
Cache adapter capabilities with TTL, invalidate on adapter CRD changes.
|
|
1207
|
+
Reduces mux API calls during frequent stack reconciliation cycles.
|
|
1208
|
+
|
|
1209
|
+
**What blocks it:**
|
|
1210
|
+
No cache layer between mux client and stack controller. Low priority since
|
|
1211
|
+
mux is usually fast and capabilities rarely change.
|
|
1212
|
+
|
|
1213
|
+
**Estimated effort:** 0.5 day (add to snapshot cache pattern)
|
|
1214
|
+
|
|
1215
|
+
---
|
|
1216
|
+
|
|
1217
|
+
### Gap 3: Transport Resolution and Codec Injection — RESOLVED
|
|
1218
|
+
|
|
1219
|
+
**Previous state:**
|
|
1220
|
+
Validated `AgentTransportBinding` spec, but never activated the transport runtime
|
|
1221
|
+
or injected settings into agent processes.
|
|
1222
|
+
|
|
1223
|
+
**Resolution:**
|
|
1224
|
+
`resolveTransport(stack, resources)` reads the `AgentTransportBinding` referenced
|
|
1225
|
+
by the stack's adapter and injects `AGENT_MUX_TRANSPORT` and `TRANSPORT_MUX_CODEC`
|
|
1226
|
+
as environment variables directly into the `batch/v1` Job spec via `createAgentJob()`.
|
|
1227
|
+
Agent pods receive transport configuration at startup without any manual wiring.
|
|
1228
|
+
|
|
1229
|
+
---
|
|
1230
|
+
|
|
1231
|
+
### Gap 4: Lifecycle Event Emission — RESOLVED
|
|
1232
|
+
|
|
1233
|
+
**Previous state:**
|
|
1234
|
+
`globalEventBus.emitResourceChange()` fired on resource CRUD operations only.
|
|
1235
|
+
Purely internal, resource-level granularity. No agent lifecycle events.
|
|
1236
|
+
|
|
1237
|
+
**Resolution:**
|
|
1238
|
+
`createHooksLifecycleEmitter(bus)` emits 9 structured lifecycle events
|
|
1239
|
+
(RUN_CREATED, RUN_QUEUED, RUN_STARTED, STEP_STARTED, STEP_COMPLETED,
|
|
1240
|
+
APPROVAL_REQUESTED, APPROVAL_GRANTED, APPROVAL_DENIED, RUN_COMPLETED/RUN_FAILED)
|
|
1241
|
+
at the correct dispatch lifecycle points. Events are forwarded to registered
|
|
1242
|
+
`WebhookSubscription` endpoints via the existing webhook delivery system.
|
|
1243
|
+
|
|
1244
|
+
---
|
|
1245
|
+
|
|
1246
|
+
### Gap 5: Cost Aggregation — RESOLVED
|
|
1247
|
+
|
|
1248
|
+
**Previous state:**
|
|
1249
|
+
`reconcileTranscript()` summed token usage from SSE events. Cost fields were zero
|
|
1250
|
+
when no events were available.
|
|
1251
|
+
|
|
1252
|
+
**Resolution:**
|
|
1253
|
+
`checkBudget()` + `estimateCost()` enforce budget before dispatch. The agent pod's
|
|
1254
|
+
callback payload includes `costUsd` (actual incurred cost). `persistSessionEvent()`
|
|
1255
|
+
records this against `AgentDispatchRun.status.costUsd`. Budget ceiling is enforced
|
|
1256
|
+
at the infrastructure level via `activeDeadlineSeconds` on the Job spec, ensuring
|
|
1257
|
+
agents cannot exceed their budget even if the callback is never delivered.
|
|
1258
|
+
|
|
1259
|
+
---
|
|
1260
|
+
|
|
1261
|
+
### Gap 6: Real-Time Session Streaming to UI
|
|
1262
|
+
|
|
1263
|
+
**What kradle does now:**
|
|
1264
|
+
Web console shows session status from cached snapshot (30s staleness).
|
|
1265
|
+
SSE endpoint only emits resource-change events (kind/name/operation).
|
|
1266
|
+
|
|
1267
|
+
**What it should do:**
|
|
1268
|
+
Proxy or relay agent step events to the web console for live session
|
|
1269
|
+
viewing (token-by-token streaming).
|
|
1270
|
+
|
|
1271
|
+
**What blocks it:**
|
|
1272
|
+
Web container cannot reach agent pods directly. API container would need to relay
|
|
1273
|
+
agent SSE or WebSocket events to its own SSE endpoint. Significant complexity for
|
|
1274
|
+
multi-subscriber fanout with back-pressure.
|
|
1275
|
+
|
|
1276
|
+
**Estimated effort:** 5-8 days (SSE relay, subscriber management, back-pressure)
|
|
1277
|
+
|
|
1278
|
+
---
|
|
1279
|
+
|
|
1280
|
+
### Gap 7: Approval Gate Integration
|
|
1281
|
+
|
|
1282
|
+
**What kradle does now:**
|
|
1283
|
+
`AgentApproval` resources created with spec describing the gate (action, requestedBy).
|
|
1284
|
+
Status updated manually (via UI form or API call). `APPROVAL_REQUESTED` hooks event
|
|
1285
|
+
is now emitted when the gate is created.
|
|
1286
|
+
|
|
1287
|
+
**What it should do:**
|
|
1288
|
+
When approval is granted/denied, automatically resume or cancel the blocked agent Job.
|
|
1289
|
+
Currently approval resolution updates the AgentDispatchRun phase but does not signal
|
|
1290
|
+
the suspended Job pod to continue.
|
|
1291
|
+
|
|
1292
|
+
**What blocks it:**
|
|
1293
|
+
Agent Job pods are not designed to pause mid-execution awaiting approval. The approval
|
|
1294
|
+
gate must be enforced at dispatch time (before Job creation), not within a running pod.
|
|
1295
|
+
|
|
1296
|
+
**Estimated effort:** 1-2 days (pre-dispatch approval gate enforcement tightening)
|
|
1297
|
+
|
|
1298
|
+
---
|
|
1299
|
+
|
|
1300
|
+
### Gap 8: Context Bundle Delivery
|
|
1301
|
+
|
|
1302
|
+
**What kradle does now:**
|
|
1303
|
+
`AgentContextBundle` resources store prompt/context snapshots with digest.
|
|
1304
|
+
Created during dispatch, immutable after creation. Bundle digest is stored in
|
|
1305
|
+
the Job's env vars.
|
|
1306
|
+
|
|
1307
|
+
**What it should do:**
|
|
1308
|
+
Deliver the full bundle payload (attachments, provenance, redaction manifest) to
|
|
1309
|
+
the agent pod at startup, not just the digest.
|
|
1310
|
+
|
|
1311
|
+
**What blocks it:**
|
|
1312
|
+
Bundle payloads may be too large for env var injection. Need a signed URL or
|
|
1313
|
+
in-cluster object store reference that the pod can fetch at startup.
|
|
1314
|
+
|
|
1315
|
+
**Estimated effort:** 1-2 days (signed URL generation or object store integration)
|
|
1316
|
+
|
|
1317
|
+
---
|
|
1318
|
+
|
|
1319
|
+
### Gap 9: Workspace Mount Coordination — RESOLVED
|
|
1320
|
+
|
|
1321
|
+
**Previous state:**
|
|
1322
|
+
`KradleWorkspace` spec defined volume mounts. `launchSession()` passed
|
|
1323
|
+
`workspace.mountPath` to mux. Assumed workspace was pre-provisioned.
|
|
1324
|
+
|
|
1325
|
+
**Resolution:**
|
|
1326
|
+
The dispatch flow now verifies workspace phase before Job creation.
|
|
1327
|
+
`findReusableWorkspace()` returns only `Ready` workspaces. If none found,
|
|
1328
|
+
`createWorkspace()` provisions a new PVC. The Job is only submitted after the
|
|
1329
|
+
workspace PVC is `Bound`. The PVC is mounted at `/workspace` in the Job pod spec
|
|
1330
|
+
via `getMountSpec()`.
|
|
1331
|
+
|
|
1332
|
+
---
|
|
1333
|
+
|
|
1334
|
+
### Gap 10: Trigger-to-Dispatch Pipeline — RESOLVED
|
|
1335
|
+
|
|
1336
|
+
**Previous state:**
|
|
1337
|
+
`AgentTriggerRule` created `AgentDispatchRun` on match, but dispatch run creation
|
|
1338
|
+
was the terminal action — no agent was actually launched.
|
|
1339
|
+
|
|
1340
|
+
**Resolution:**
|
|
1341
|
+
`createManualDispatch()` now continues through the full dispatch flow after creating
|
|
1342
|
+
the run resource: it runs `checkBudget()`, calls `createAgentJob()`, and submits the
|
|
1343
|
+
Job to Kubernetes via `submitAgentJob()`. The trigger-to-execution pipeline is now
|
|
1344
|
+
complete end-to-end.
|
|
1345
|
+
|
|
1346
|
+
---
|
|
1347
|
+
|
|
1348
|
+
### Gap 11: Multi-Session Orchestration
|
|
1349
|
+
|
|
1350
|
+
**What kradle does now:**
|
|
1351
|
+
`AgentSubagent` defines child-agent roles with task kinds and tool subsets.
|
|
1352
|
+
Stack reconciliation resolves subagent references.
|
|
1353
|
+
|
|
1354
|
+
**What it should do:**
|
|
1355
|
+
Orchestrate multiple concurrent K8s Jobs (one per subagent) for a single
|
|
1356
|
+
dispatch run. Track progress, handle dependencies between subagent tasks.
|
|
1357
|
+
|
|
1358
|
+
**What blocks it:**
|
|
1359
|
+
No multi-Job coordinator. Would need significant orchestration logic
|
|
1360
|
+
(task graph, dependency resolution, failure handling, rollback).
|
|
1361
|
+
|
|
1362
|
+
**Estimated effort:** 10-15 days (orchestrator, state machine, failure handling)
|
|
1363
|
+
|
|
1364
|
+
---
|
|
1365
|
+
|
|
1366
|
+
### Gap 12: Memory Query at Dispatch Time
|
|
1367
|
+
|
|
1368
|
+
**What kradle does now:**
|
|
1369
|
+
`AgentMemorySnapshot` pins memory state at dispatch time. `AgentMemoryQuery`
|
|
1370
|
+
records retrieval requests. Both are CRD resources. A snapshot is created during
|
|
1371
|
+
dispatch if an `AgentMemoryRepository` exists.
|
|
1372
|
+
|
|
1373
|
+
**What it should do:**
|
|
1374
|
+
Automatically inject the resolved memory snapshot content into the context bundle
|
|
1375
|
+
delivered to the agent Job as a mounted file or env var.
|
|
1376
|
+
|
|
1377
|
+
**What blocks it:**
|
|
1378
|
+
Memory content may be large (full git-backed repository). Need efficient delivery
|
|
1379
|
+
mechanism (in-cluster object store or read-only PVC mount).
|
|
1380
|
+
|
|
1381
|
+
**Estimated effort:** 2-3 days (snapshot injection into Job spec)
|
|
1382
|
+
|
|
1383
|
+
---
|
|
1384
|
+
|
|
1385
|
+
### Gap 13: Provider Config Resolution
|
|
1386
|
+
|
|
1387
|
+
**What kradle does now:**
|
|
1388
|
+
`AgentProviderConfig` stores API base URLs, auth types, and model rate tables.
|
|
1389
|
+
`checkBudget()` and `estimateCost()` use these rate tables for budget enforcement.
|
|
1390
|
+
|
|
1391
|
+
**What it should do:**
|
|
1392
|
+
Resolve provider config at Job creation time and pass credential references to
|
|
1393
|
+
the agent pod via env var secret refs (`valueFrom.secretKeyRef`).
|
|
1394
|
+
|
|
1395
|
+
**What blocks it:**
|
|
1396
|
+
Security boundary -- kradle should not pass raw API keys in Job env vars.
|
|
1397
|
+
Need `secretKeyRef` protocol (reference existing K8s Secret by name/key).
|
|
1398
|
+
|
|
1399
|
+
**Estimated effort:** 1-2 days (secretKeyRef injection in createAgentJob)
|
|
1400
|
+
|
|
1401
|
+
---
|
|
1402
|
+
|
|
1403
|
+
### Summary of Integration Gaps
|
|
1404
|
+
|
|
1405
|
+
| # | Gap | Status | Effort | Priority |
|
|
1406
|
+
|---|-----|--------|--------|----------|
|
|
1407
|
+
| 1 | Session Lifecycle Sync | **RESOLVED** (callback endpoint) | — | — |
|
|
1408
|
+
| 2 | Adapter Capability Caching | Open | 0.5d | Low |
|
|
1409
|
+
| 3 | Transport Binding Activation | **RESOLVED** (env var injection) | — | — |
|
|
1410
|
+
| 4 | Lifecycle Event Emission | **RESOLVED** (9 hooks events) | — | — |
|
|
1411
|
+
| 5 | Cost Aggregation | **RESOLVED** (checkBudget + deadline) | — | — |
|
|
1412
|
+
| 6 | Real-Time Session Streaming | Open | 5-8d | High |
|
|
1413
|
+
| 7 | Approval Gate Integration | Partial (pre-dispatch) | 1-2d | Medium |
|
|
1414
|
+
| 8 | Context Bundle Delivery | Open | 1-2d | Medium |
|
|
1415
|
+
| 9 | Workspace Mount Coordination | **RESOLVED** (PVC mount in Job) | — | — |
|
|
1416
|
+
| 10 | Trigger-to-Dispatch Pipeline | **RESOLVED** (K8s Job submission) | — | — |
|
|
1417
|
+
| 11 | Multi-Session Orchestration | Open | 10-15d | Low |
|
|
1418
|
+
| 12 | Memory Query at Dispatch Time | Partial (snapshot created) | 2-3d | Medium |
|
|
1419
|
+
| 13 | Provider Config Resolution | Partial (rates used, creds pending) | 1-2d | Medium |
|
|
1420
|
+
|
|
1421
|
+
**Remaining open effort:** ~20-31 developer-days (down from 35-60)
|
|
1422
|
+
|
|
1423
|
+
**Resolved critical path:** Gap 10 (Trigger-to-Dispatch) is complete. The basic
|
|
1424
|
+
dispatch-to-completion lifecycle (Gaps 1, 9, 10) is now end-to-end operational
|
|
1425
|
+
via K8s Jobs. Remaining gaps (6, 11) are non-critical enhancements.
|
|
1426
|
+
|
|
1427
|
+
---
|
|
1428
|
+
|
|
1429
|
+
## 6. Architectural Choice: Why K8s Jobs (vs DaemonSet, Deployment, Raw Pod)
|
|
1430
|
+
|
|
1431
|
+
### Decision
|
|
1432
|
+
|
|
1433
|
+
Agent execution uses `batch/v1` Jobs (not DaemonSets, Deployments, or raw Pods).
|
|
1434
|
+
|
|
1435
|
+
### Rationale
|
|
1436
|
+
|
|
1437
|
+
| Option | Why Rejected |
|
|
1438
|
+
|--------|-------------|
|
|
1439
|
+
| **Raw Pod** | No automatic restart semantics; pod evictions or node failures lose the run. Pod lifecycle not tracked by K8s controller. Manual cleanup required. |
|
|
1440
|
+
| **Deployment** | Designed for long-lived services, not one-shot tasks. Scales by replica count, not by task. Restart policy conflicts with agent "run once to completion" semantics. |
|
|
1441
|
+
| **DaemonSet** | Runs one pod per node — not suitable for per-run isolation. Cannot express per-run resource limits or deadlines. |
|
|
1442
|
+
| **StatefulSet** | Designed for ordered, persistent services. Overkill for ephemeral agent runs. |
|
|
1443
|
+
| **batch/v1 Job (chosen)** | Native support for `completionMode`, `backoffLimit`, `activeDeadlineSeconds`. K8s garbage-collects succeeded Jobs automatically. Pod failure is surfaced as Job failure. Integrates with K8s scheduler for resource-aware placement. Works with cluster autoscaler for node provisioning. |
|
|
1444
|
+
|
|
1445
|
+
### Why `activeDeadlineSeconds` for Budget Enforcement
|
|
1446
|
+
|
|
1447
|
+
Budget enforcement requires a hard ceiling that survives process crashes, pod
|
|
1448
|
+
restarts, and network partitions. `activeDeadlineSeconds` is enforced by the
|
|
1449
|
+
K8s Job controller at the infrastructure level — even if the agent pod loses
|
|
1450
|
+
connectivity to Kradle, Kubernetes will terminate the pod when the deadline is
|
|
1451
|
+
exceeded and mark the Job as Failed. This gives Kradle a guaranteed out-of-band
|
|
1452
|
+
termination path independent of the callback mechanism.
|
|
1453
|
+
|
|
1454
|
+
### Why One Job per Dispatch Run (vs Shared Runner Pool)
|
|
1455
|
+
|
|
1456
|
+
- **Isolation:** Each run gets its own process namespace, filesystem, and network
|
|
1457
|
+
policy. A bug in one agent cannot corrupt another run's workspace.
|
|
1458
|
+
- **Resource accounting:** Job resource requests/limits are per-run, enabling
|
|
1459
|
+
accurate cost tracking and scheduler placement decisions.
|
|
1460
|
+
- **Auditability:** K8s Job name matches dispatch run name (`agent-{runName}`),
|
|
1461
|
+
making cross-referencing between Kradle resources and cluster logs trivial.
|
|
1462
|
+
- **Simplicity:** No need for a long-lived runner daemon to multiplex tasks; K8s
|
|
1463
|
+
handles pod lifecycle, log collection, and cleanup.
|
|
1464
|
+
|
|
1465
|
+
### Tradeoffs
|
|
1466
|
+
|
|
1467
|
+
- **Cold start latency:** Each run incurs image pull + pod scheduling time (~5-30s
|
|
1468
|
+
depending on cluster). RunnerPool warm replicas can mitigate this for CI Jobs,
|
|
1469
|
+
but agent Jobs currently pay the full cold start cost.
|
|
1470
|
+
- **Cluster resource pressure:** Many concurrent dispatch runs create many pods.
|
|
1471
|
+
Cluster autoscaler must be configured to scale node groups for agent workloads.
|
|
1472
|
+
- **Log retention:** K8s deletes completed Job pods after TTL. Kradle must ship
|
|
1473
|
+
logs to an external store (or use `persistSessionEvent` artifact logging) before
|
|
1474
|
+
the TTL expires.
|
|
1475
|
+
|
|
1476
|
+
|
|
1477
|
+
---
|
|
1478
|
+
|
|
1479
|
+
## KServe Integration: CRD Wrapper Architecture
|
|
1480
|
+
|
|
1481
|
+
**Decision:** Wrap KServe `InferenceService` and `ServingRuntime` CRDs via Kradle resources (`KradleInferenceService`, `KradleServingRuntime`) rather than exposing KServe CRDs directly to Kradle users.
|
|
1482
|
+
|
|
1483
|
+
**Rationale:**
|
|
1484
|
+
- Kradle resources follow the standard Kradle CRD model (`apiVersion: kradle.a5c.ai/v1alpha1`), keeping a consistent API surface for all platform resources
|
|
1485
|
+
- The wrapper validates model formats, generates correct KServe manifests, and abstracts away KServe-specific versioning (`serving.kserve.io/v1beta1`)
|
|
1486
|
+
- `toProviderConfig()` creates a seamless bridge to `AgentProviderConfig`, allowing agent stacks to reference on-cluster models without knowing they are backed by KServe
|
|
1487
|
+
- Enables future backend swap (e.g., Seldon, Triton standalone) without changing consumer API
|
|
1488
|
+
|
|
1489
|
+
**Alternative considered:** Direct KServe CRD exposure via kubectl pass-through. Rejected because it breaks the Kradle CRD model and makes agent integration harder.
|
|
1490
|
+
|
|
1491
|
+
---
|
|
1492
|
+
|
|
1493
|
+
## Artifact Registry: Internal + External Dual-Mode Design
|
|
1494
|
+
|
|
1495
|
+
**Decision:** Support both internal storage (Kradle-managed etcd/S3/GCS/Azure Blob) and external integration (GitHub Packages, etc.) within the same `ArtifactRegistry` resource via `storageBackend` and `externalIntegration` fields.
|
|
1496
|
+
|
|
1497
|
+
**Rationale:**
|
|
1498
|
+
- Teams often have existing external registries (npm, PyPI, GitHub Packages) they need to access; forcing migration to internal storage would block adoption
|
|
1499
|
+
- Internal storage (especially S3/GCS) provides full control over retention, access policy, and cost; external-only would lose that control
|
|
1500
|
+
- Mirror mode syncs published versions to external backends, enabling gradual migration or dual-publish workflows
|
|
1501
|
+
- A single resource kind covering both modes reduces API surface complexity
|
|
1502
|
+
|
|
1503
|
+
**Alternative considered:** External-only (proxy to existing registries). Rejected because Kradle needs to enforce access policies and track download analytics that external registries do not expose.
|
|
1504
|
+
|
|
1505
|
+
---
|
|
1506
|
+
|
|
1507
|
+
## Assistant Runtime: In-Process vs. K8s Job
|
|
1508
|
+
|
|
1509
|
+
**Decision:** Implement the assistant as an in-process runtime (`assistant-runtime.js`) using the Anthropic API directly, rather than dispatching K8s Jobs.
|
|
1510
|
+
|
|
1511
|
+
**Rationale:**
|
|
1512
|
+
- Assistant chat requires low latency (< 1s to first token); K8s Job startup overhead (5-30s) is unacceptable for interactive chat
|
|
1513
|
+
- Session state (message history) must survive across requests within a session; in-process `globalThis` store achieves this trivially without distributed state management
|
|
1514
|
+
- Streaming (SSE) is native to the Anthropic API and the Next.js runtime; routing through K8s Jobs would require complex proxying
|
|
1515
|
+
- Assistant workloads are lightweight (API calls, no GPU) and do not benefit from the isolation that K8s Jobs provide for agent workloads
|
|
1516
|
+
|
|
1517
|
+
**Alternative considered:** K8s Job dispatch (same as agent dispatch). Rejected due to latency, streaming complexity, and session state management overhead.
|
|
1518
|
+
|
|
1519
|
+
**Trade-off accepted:** In-process runtime means the assistant does not get the resource isolation and budget enforcement of K8s Jobs. This is acceptable because the assistant uses a shared Anthropic API key with org-level rate limiting.
|
|
1520
|
+
|
|
1521
|
+
---
|
|
1522
|
+
|
|
1523
|
+
## Integration Gap Status Update
|
|
1524
|
+
|
|
1525
|
+
| Gap | Status | Notes |
|
|
1526
|
+
|-----|--------|-------|
|
|
1527
|
+
| KServe inference | **Implemented** | KradleInferenceService + KradleServingRuntime controllers, 7 API routes, web console pages |
|
|
1528
|
+
| Artifact registry | **Implemented** | 5 resource kinds, 6 API routes, web console pages |
|
|
1529
|
+
| Assistant runtime | **Implemented** | In-process runtime, SSE streaming, session management, 5 API routes |
|
|
1530
|
+
| Auth on mutating routes | **Implemented** | withAuth applied to all POST/DELETE/PUT handlers; webhook HMAC and agent callback intentionally unauthenticated |
|