switchroom 0.13.2 → 0.13.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (69) hide show
  1. package/dist/agent-scheduler/index.js +2 -2
  2. package/dist/auth-broker/index.js +2 -2
  3. package/dist/cli/switchroom.js +132 -214
  4. package/dist/host-control/main.js +2 -2
  5. package/dist/vault/approvals/kernel-server.js +2 -2
  6. package/dist/vault/broker/server.js +2 -2
  7. package/package.json +1 -1
  8. package/profiles/_base/start.sh.hbs +8 -8
  9. package/profiles/default/CLAUDE.md.hbs +1 -1
  10. package/telegram-plugin/dist/gateway/gateway.js +42 -10
  11. package/telegram-plugin/gateway/boot-probes.ts +13 -6
  12. package/telegram-plugin/gateway/gateway.ts +44 -6
  13. package/telegram-plugin/hooks/silent-end-interrupt-stop.mjs +5 -1
  14. package/telegram-plugin/silent-end.ts +56 -0
  15. package/telegram-plugin/tests/boot-probes.test.ts +26 -2
  16. package/telegram-plugin/tests/silent-end.test.ts +69 -0
  17. package/telegram-plugin/uat/scenarios/bridge-flap-resilience-dm.test.ts +166 -0
  18. package/skills/buildkite-agent-infrastructure/SKILL.md +0 -321
  19. package/skills/buildkite-agent-infrastructure/agents/openai.yaml +0 -6
  20. package/skills/buildkite-agent-infrastructure/assets/buildkite-icon-large.png +0 -0
  21. package/skills/buildkite-agent-infrastructure/assets/buildkite-icon-small.png +0 -0
  22. package/skills/buildkite-agent-infrastructure/references/audit-logging.md +0 -87
  23. package/skills/buildkite-agent-infrastructure/references/graphql-mutations.md +0 -690
  24. package/skills/buildkite-agent-infrastructure/references/instance-shapes.md +0 -38
  25. package/skills/buildkite-agent-infrastructure/references/pipeline-templates.md +0 -73
  26. package/skills/buildkite-agent-infrastructure/references/self-hosted-agents.md +0 -137
  27. package/skills/buildkite-agent-infrastructure/references/sso-saml.md +0 -92
  28. package/skills/buildkite-agent-runtime/SKILL.md +0 -509
  29. package/skills/buildkite-agent-runtime/agents/openai.yaml +0 -6
  30. package/skills/buildkite-agent-runtime/assets/buildkite-icon-large.png +0 -0
  31. package/skills/buildkite-agent-runtime/assets/buildkite-icon-small.png +0 -0
  32. package/skills/buildkite-agent-runtime/references/flag-reference.md +0 -417
  33. package/skills/buildkite-agent-runtime/references/patterns-and-recipes.md +0 -555
  34. package/skills/buildkite-api/SKILL.md +0 -308
  35. package/skills/buildkite-api/agents/openai.yaml +0 -6
  36. package/skills/buildkite-api/assets/buildkite-icon-large.png +0 -0
  37. package/skills/buildkite-api/assets/buildkite-icon-small.png +0 -0
  38. package/skills/buildkite-api/references/graphql-reference.md +0 -195
  39. package/skills/buildkite-api/references/patterns.md +0 -44
  40. package/skills/buildkite-api/references/webhooks.md +0 -161
  41. package/skills/buildkite-cli/SKILL.md +0 -397
  42. package/skills/buildkite-cli/agents/openai.yaml +0 -6
  43. package/skills/buildkite-cli/assets/buildkite-icon-large.png +0 -0
  44. package/skills/buildkite-cli/assets/buildkite-icon-small.png +0 -0
  45. package/skills/buildkite-cli/references/command-reference.md +0 -181
  46. package/skills/buildkite-migration/SKILL.md +0 -195
  47. package/skills/buildkite-pipelines/SKILL.md +0 -481
  48. package/skills/buildkite-pipelines/agents/openai.yaml +0 -6
  49. package/skills/buildkite-pipelines/assets/buildkite-icon-large.png +0 -0
  50. package/skills/buildkite-pipelines/assets/buildkite-icon-small.png +0 -0
  51. package/skills/buildkite-pipelines/examples/basic-pipeline.yml +0 -24
  52. package/skills/buildkite-pipelines/examples/optimized-pipeline.yml +0 -100
  53. package/skills/buildkite-pipelines/references/advanced-patterns.md +0 -286
  54. package/skills/buildkite-pipelines/references/retry-and-error-codes.md +0 -131
  55. package/skills/buildkite-pipelines/references/step-types-reference.md +0 -225
  56. package/skills/buildkite-secure-delivery/SKILL.md +0 -182
  57. package/skills/buildkite-secure-delivery/agents/openai.yaml +0 -6
  58. package/skills/buildkite-secure-delivery/assets/buildkite-icon-large.png +0 -0
  59. package/skills/buildkite-secure-delivery/assets/buildkite-icon-small.png +0 -0
  60. package/skills/buildkite-secure-delivery/references/oidc-cloud-providers.md +0 -83
  61. package/skills/buildkite-secure-delivery/references/package-publishing.md +0 -100
  62. package/skills/buildkite-test-engine/SKILL.md +0 -256
  63. package/skills/buildkite-test-engine/agents/openai.yaml +0 -6
  64. package/skills/buildkite-test-engine/assets/buildkite-icon-large.png +0 -0
  65. package/skills/buildkite-test-engine/assets/buildkite-icon-small.png +0 -0
  66. package/skills/buildkite-test-engine/examples/bktec-splitting.yml +0 -16
  67. package/skills/buildkite-test-engine/examples/collector-pipeline.yml +0 -11
  68. package/skills/buildkite-test-engine/references/collectors.md +0 -198
  69. package/skills/buildkite-test-engine/references/splitting-examples.md +0 -93
@@ -0,0 +1,166 @@
1
+ /**
2
+ * Bridge-flap resilience scenario — regression guard for #1613 / #1616.
3
+ *
4
+ * ## The bug this guards
5
+ *
6
+ * The handoff-briefing summarizer shells out to a headless `claude -p`
7
+ * once per turn (handoff Stop hook). Before #1616 it ran without
8
+ * `--strict-mcp-config`, so it auto-discovered the agent's project
9
+ * `.mcp.json` and started every MCP server in it — including
10
+ * `switchroom-telegram`. That spun up a *second* telegram bridge
11
+ * process which registered against the same gateway socket as the
12
+ * live agent's real bridge; the two collided under the gateway's
13
+ * register-race close, producing an A↔B "bridge reconnect race" flap
14
+ * every ~2s for the ~7-9s the `claude -p` lived. The handoff hook
15
+ * fires every turn, so did the flap. A turn whose completion landed
16
+ * inside a flap burst could have its `turn_end` signal eaten — the
17
+ * agent looked wedged for that turn.
18
+ *
19
+ * The fix (#1616): the summarizer passes `--strict-mcp-config`, so
20
+ * the headless `claude -p` loads zero MCP servers and never spawns a
21
+ * competing bridge. The structural guard against a new offending
22
+ * callsite is `tests/bridge-flap-regression-guard.test.ts`; this
23
+ * scenario is the behavioural backstop.
24
+ *
25
+ * ## What this scenario asserts (root-cause-agnostic by design)
26
+ *
27
+ * The checks are symptom-based, so they catch a flap reintroduced by
28
+ * ANY future change — not only a regression of #1616:
29
+ *
30
+ * 1. Send a handful of DMs in succession — each drives a turn (and a
31
+ * handoff-hook fire). **Primary assertion:** every DM gets a reply
32
+ * within budget. Directly catches both the flap (eats turn_end)
33
+ * and the wedge (a zero-bridge gap strands the inbound).
34
+ * 2. **Forensic assertion:** inspect the agent's gateway-supervisor.log
35
+ * over the test window and assert the `bridge disconnected` density
36
+ * stays BELOW a flap threshold. One healthy persistent bridge
37
+ * produces only a trickle of disconnects; a sustained reconnect
38
+ * race produces dozens in tight ~2s bursts.
39
+ *
40
+ * ## Why the log inspection
41
+ *
42
+ * A flap is a server-side phenomenon that does not always surface as
43
+ * a missed reply (a burst can self-heal in ~20-30s). The `bridge
44
+ * disconnected` count is the transport-agnostic flap symptom. This
45
+ * scenario shells into the agent container via `docker exec` to read
46
+ * the gateway log; if docker is unavailable the log assertion is
47
+ * skipped with a warning and the responsiveness checks still run.
48
+ *
49
+ * ## Tolerances
50
+ *
51
+ * - `DISCONNECT_FLAP_THRESHOLD` is the max acceptable `bridge
52
+ * disconnected` count over the window. Post-#1616 a healthy ~4-turn
53
+ * run sits well under 16 (measured ~8-13, including anonymous
54
+ * probe-connection churn); a sustained flap is 20-40+. 16 sits
55
+ * comfortably in the gap.
56
+ */
57
+
58
+ import { describe, expect, it } from "vitest";
59
+ import { execSync } from "node:child_process";
60
+ import { spinUp } from "../harness.js";
61
+ import type { ObservedMessage } from "../driver.js";
62
+
63
+ const AGENT = "test-harness";
64
+ const CONTAINER = `switchroom-${AGENT}`;
65
+ const GATEWAY_LOG = "/var/log/switchroom/gateway-supervisor.log";
66
+
67
+ const DM_COUNT = 4;
68
+ const PER_DM_TIMEOUT_MS = 30_000;
69
+ const OVERALL_DEADLINE_MS = 180_000;
70
+
71
+ // Post-#1616 a healthy ~4-turn run logs ~8-13 `bridge disconnected`
72
+ // lines (one persistent bridge + anonymous probe-connection churn).
73
+ // A sustained A↔B flap produces 20-40+ in tight ~2s bursts. 16 sits
74
+ // in the gap.
75
+ const DISCONNECT_FLAP_THRESHOLD = 16;
76
+
77
+ /** Total line count of the agent's gateway-supervisor.log, or null if
78
+ * the container/log is unreachable (CI without the container). */
79
+ function gatewayLogLineCount(): number | null {
80
+ try {
81
+ const out = execSync(
82
+ `docker exec ${CONTAINER} sh -lc 'wc -l < ${GATEWAY_LOG}'`,
83
+ { encoding: "utf8", stdio: ["ignore", "pipe", "ignore"] },
84
+ );
85
+ return parseInt(out.trim(), 10);
86
+ } catch {
87
+ return null;
88
+ }
89
+ }
90
+
91
+ /** Count `bridge disconnected` lines after `sinceLine` in the log. */
92
+ function disconnectCountSince(sinceLine: number): number | null {
93
+ try {
94
+ const out = execSync(
95
+ `docker exec ${CONTAINER} sh -lc ` +
96
+ `'awk "NR>${sinceLine}" ${GATEWAY_LOG} | grep -c "bridge disconnected" || true'`,
97
+ { encoding: "utf8", stdio: ["ignore", "pipe", "ignore"] },
98
+ );
99
+ return parseInt(out.trim(), 10);
100
+ } catch {
101
+ return null;
102
+ }
103
+ }
104
+
105
+ describe("uat: bridge-flap resilience — agent stays responsive, gateway does not flap", () => {
106
+ it(
107
+ "every DM gets a reply and the gateway does not flap across turns",
108
+ async () => {
109
+ const baselineLine = gatewayLogLineCount();
110
+ const sc = await spinUp({ agent: AGENT });
111
+ try {
112
+ const overallDeadline = Date.now() + OVERALL_DEADLINE_MS;
113
+
114
+ for (let i = 1; i <= DM_COUNT; i++) {
115
+ await sc.sendDM(`flap-resilience probe ${i}/${DM_COUNT}: reply with OK${i}`);
116
+
117
+ const remaining = Math.min(
118
+ PER_DM_TIMEOUT_MS,
119
+ overallDeadline - Date.now(),
120
+ );
121
+ expect(
122
+ remaining,
123
+ `overall deadline hit before DM ${i} — earlier turns were too slow`,
124
+ ).toBeGreaterThan(0);
125
+
126
+ const reply = await sc.expectMessage(
127
+ (m: ObservedMessage) => m.fromBot && !m.edited,
128
+ { from: "bot", timeout: remaining },
129
+ );
130
+ expect(
131
+ reply.text.length,
132
+ `DM ${i}/${DM_COUNT} produced an empty reply — a flap may have ` +
133
+ `eaten the turn_end signal`,
134
+ ).toBeGreaterThan(0);
135
+ }
136
+
137
+ // Responsiveness held for all DM_COUNT turns. Now check the
138
+ // server-side flap signal.
139
+ if (baselineLine == null) {
140
+ console.warn(
141
+ "[bridge-flap-resilience] docker exec unavailable — skipping " +
142
+ "the gateway-log flap assertion; responsiveness checks passed.",
143
+ );
144
+ return;
145
+ }
146
+
147
+ const disconnectCount = disconnectCountSince(baselineLine);
148
+ expect(
149
+ disconnectCount,
150
+ "could not read gateway log after the run — container went away",
151
+ ).not.toBeNull();
152
+ expect(
153
+ disconnectCount as number,
154
+ `gateway logged ${disconnectCount} "bridge disconnected" lines across ` +
155
+ `${DM_COUNT} turns — at/above the flap threshold ` +
156
+ `(${DISCONNECT_FLAP_THRESHOLD}). A parasitic bridge is racing the ` +
157
+ `live one — check for a headless 'claude -p' spawned without ` +
158
+ `--strict-mcp-config (#1613/#1616).`,
159
+ ).toBeLessThan(DISCONNECT_FLAP_THRESHOLD);
160
+ } finally {
161
+ await sc.tearDown();
162
+ }
163
+ },
164
+ OVERALL_DEADLINE_MS + 30_000,
165
+ );
166
+ });
@@ -1,321 +0,0 @@
1
- ---
2
- name: buildkite-agent-infrastructure
3
- description: >
4
- Buildkite cluster / organization / platform administration. Whenever
5
- the user's message starts with the phrase "In Buildkite cluster
6
- admin," — regardless of what follows — use this skill; that prefix
7
- is a hard trigger that wins over `buildkite-api`, `buildkite-cli`,
8
- and `buildkite-agent-runtime`. Provision and govern Buildkite CI
9
- infrastructure: creating clusters, creating queues, scaling queues,
10
- setting up hosted agents, right-sizing instance shapes, optimizing
11
- CI costs, managing agent tokens, managing cluster secrets,
12
- configuring SSO, setting up SAML, setting up audit logging, creating
13
- pipeline templates, and standardizing pipelines across teams. Use
14
- when the user says, verbatim: "set up SAML", "manage agent tokens",
15
- "configure SSO", "set up audit logging", "Let's configure SSO.",
16
- "I need to configure SSO.", "Could you scale queues for me?",
17
- "Scale queues, please.", "scale queues", "Create a queue, please.",
18
- "Create a cluster, please.", "set up hosted agents", "manage
19
- cluster secrets", "right-size instance shapes", "optimize CI
20
- costs", "standardize pipelines across teams", "create a pipeline
21
- template", "configure agents", and typo'd variants like "manage
22
- clusetr secrets", "configuree agents", "set up hostted agents".
23
- Anything about buildkite-agent.cfg, agent tags, agent tokens, cluster
24
- queues, hosted agent instance shapes, pipeline templates, audit
25
- events, SSO/SAML providers, queue wait time, agent lifecycle hooks,
26
- or Buildkite platform governance fires this skill — even when the
27
- request mentions GraphQL or API calls (the rival `buildkite-api` is
28
- for generic webhook/pagination/scripting, NOT for SSO/queue/cluster
29
- admin which always belongs here).
30
- Do NOT use when the user is calling `buildkite-agent <subcommand>` from
31
- inside a running step (token use, artifact upload, annotate) — that's
32
- `buildkite-agent-runtime`; or when the user just wants cluster CLI
33
- shortcuts like `bk cluster ...` — that's `buildkite-cli`.
34
- ---
35
-
36
- # Buildkite Platform Engineering
37
-
38
- Provision and govern Buildkite CI infrastructure at scale: clusters, queues, hosted agent sizing, secrets, agent tokens, self-hosted configuration, lifecycle hooks, pipeline templates, audit logging, SSO/SAML, and cost optimization.
39
-
40
- ## Quick Start
41
-
42
- Create a cluster with a hosted queue to get builds running immediately. **Start with hosted agents unless there is a specific reason to self-host** (GPU workloads, on-prem, custom hardware). Self-hosted queues require provisioning your own agents; builds hang "scheduled" until agents connect.
43
-
44
- All GraphQL mutations go to `https://graphql.buildkite.com/v1` with a Bearer token:
45
-
46
- ```bash
47
- curl -sS -X POST "https://graphql.buildkite.com/v1" \
48
- -H "Authorization: Bearer $BUILDKITE_API_TOKEN" \
49
- -H "Content-Type: application/json" \
50
- -d '{"query": "<GRAPHQL_QUERY_OR_MUTATION>", "variables": { ... }}'
51
- ```
52
-
53
- **Step 1:** Get the organization ID: `query { organization(slug: "my-org") { id } }`
54
-
55
- **Step 2:** Create a cluster:
56
-
57
- ```graphql
58
- mutation {
59
- clusterCreate(input: {
60
- organizationId: "org-id"
61
- name: "Production"
62
- description: "Production CI cluster"
63
- emoji: ":rocket:"
64
- color: "#14CC80"
65
- }) { cluster { id uuid name } }
66
- }
67
- ```
68
-
69
- **Step 3:** Create a hosted queue with a specific instance shape:
70
-
71
- ```graphql
72
- mutation {
73
- clusterQueueCreate(input: {
74
- organizationId: "org-id"
75
- clusterId: "cluster-id"
76
- key: "linux-large"
77
- description: "Linux 8 vCPU / 32 GB for heavy compilation"
78
- hostedAgents: { instanceShape: LINUX_AMD64_8X32 }
79
- }) { clusterQueue { id key } }
80
- }
81
- ```
82
-
83
- **Step 4:** Create a pipeline in the cluster via GraphQL `pipelineCreate` or the REST API, then trigger a build.
84
-
85
- > For pipeline creation via REST and GraphQL, see the **buildkite-api** skill.
86
-
87
- > For pipeline YAML syntax including `agents:` routing and `secrets:` access, see the **buildkite-pipelines** skill.
88
- > For `bk cluster` CLI commands, see the **buildkite-cli** skill.
89
-
90
- ## Clusters
91
-
92
- A cluster is the top-level container for queues, agent tokens, and secrets. Every organization starts with one default cluster; create additional clusters to isolate workloads (e.g., production vs. staging, team-specific).
93
-
94
- ### Create a cluster
95
-
96
- ```bash
97
- curl -s -X POST "https://api.buildkite.com/v2/organizations/my-org/clusters" \
98
- -H "Authorization: Bearer $BUILDKITE_API_TOKEN" \
99
- -H "Content-Type: application/json" \
100
- -d '{
101
- "name": "Backend",
102
- "description": "Backend team CI cluster",
103
- "emoji": ":gear:",
104
- "color": "#0B79CE"
105
- }'
106
- ```
107
-
108
- Fields: `name` (required), `description`, `emoji`, `color`, `default_queue_id` (optional).
109
-
110
- > For full REST and GraphQL API reference, see the **buildkite-api** skill.
111
-
112
- ## Queues and Hosted Agents
113
-
114
- Queues route builds to agents. **Hosted queues** (Buildkite-managed compute) are the recommended starting point — builds run immediately. **Self-hosted queues** require connecting your own agents; builds remain "scheduled" until agents connect.
115
-
116
- Create queues with the `clusterQueueCreate` GraphQL mutation (shown in Quick Start above). To create a **self-hosted queue**, omit `hostedAgents`. Self-hosted agents connect by targeting the queue key in their configuration.
117
-
118
- ### Instance shapes and sizing guide
119
-
120
- Full list: `references/instance-shapes.md`. Quick sizing:
121
-
122
- | Workload | Shape |
123
- |----------|-------|
124
- | Linting, unit tests | `LINUX_AMD64_2X4` |
125
- | Monorepos, multi-service | `LINUX_AMD64_4X16` |
126
- | Heavy compilation (C++, Rust) | `LINUX_AMD64_8X32` |
127
- | Docker builds, ML prep | `LINUX_AMD64_16X64` |
128
- | iOS / macOS | `MACOS_M4_6X28` or `MACOS_M4_12X56` |
129
-
130
- Start with the smallest shape that keeps builds under target time. Scale up if queue wait exceeds 2 minutes.
131
-
132
- ### Queue design patterns
133
-
134
- - **Keep 1-2 static instances in the default queue** — avoids cold-start latency on pipeline uploads
135
- - **Retire oldest agents first during scale-down** — preserves warm caches
136
- - **Trial pattern** — test new shapes/architectures on a separate queue before migrating
137
- - **Tag builds with metadata** for cost attribution by queue or team
138
-
139
- Temporarily pause dispatch to a queue for maintenance or cost control using `clusterQueuePauseDispatch` / `clusterQueueResumeDispatch` GraphQL mutations. See `references/graphql-mutations.md` for examples.
140
-
141
- ## Cluster Secrets
142
-
143
- Cluster secrets are encrypted, cluster-scoped values accessible from pipeline steps. They replace hardcoded credentials and environment-hook-based secret injection. Create, update, and rotate secrets via the REST API at `/v2/organizations/{org}/clusters/{cluster_id}/secrets`.
144
-
145
- ### Secret key constraints
146
-
147
- | Rule | Detail |
148
- |------|--------|
149
- | Must start with | A letter (A-Z, a-z) |
150
- | Allowed characters | Letters, numbers, underscores only |
151
- | Prohibited prefixes | `buildkite`, `bk` (reserved) |
152
- | Max key length | 255 characters |
153
- | Max value size | 8 KB |
154
-
155
- ### Access policies
156
-
157
- Restrict which pipelines and branches can access a secret by adding a `policy` object with `claims`. Available claim types: `pipeline_slug`, `build_branch`, `build_creator`, `build_source`, `build_creator_team`, `cluster_queue_key`. Claims support `*` wildcards. See [Buildkite Secrets docs](https://buildkite.com/docs/pipelines/security/secrets/buildkite-secrets.md) for policy examples.
158
-
159
- Value rotation uses a separate endpoint (`PUT .../secrets/{id}/value`) from description/policy updates (`PUT .../secrets/{id}`).
160
-
161
- > For `secrets:` YAML syntax, see the **buildkite-pipelines** skill. For `buildkite-agent secret get`, see the **buildkite-agent-runtime** skill.
162
-
163
- ## Agent Tokens
164
-
165
- Agent tokens authenticate agents connecting to a cluster. Each token is scoped to a single cluster.
166
-
167
- ### Create a token
168
-
169
- ```bash
170
- curl -s -X POST "https://api.buildkite.com/v2/organizations/my-org/clusters/$CLUSTER_ID/tokens" \
171
- -H "Authorization: Bearer $BUILDKITE_API_TOKEN" \
172
- -H "Content-Type: application/json" \
173
- -d '{
174
- "description": "Backend CI agents - production",
175
- "allowed_ip_addresses": "10.0.0.0/8"
176
- }'
177
- ```
178
-
179
- | Field | Required | Description |
180
- |-------|----------|-------------|
181
- | `description` | Yes | Human-readable token description |
182
- | `allowed_ip_addresses` | No | Comma-separated CIDR ranges restricting agent connections |
183
- | `expires_at` | No | ISO 8601 expiry timestamp |
184
-
185
- The token value is only returned at creation time. Store it in a secrets manager immediately.
186
-
187
- ## Self-Hosted Agents and Lifecycle Hooks
188
-
189
- Self-hosted agents run on your own infrastructure, configured via `buildkite-agent.cfg`. Prefer clustered agents for new deployments — they provide secret scoping, queue isolation, and better organizational control. For full configuration reference, `buildkite-agent.cfg` examples, and clustered vs. unclustered agent details, see `references/self-hosted-agents.md`.
190
-
191
- Agent lifecycle hooks execute at specific points during job execution: `environment` → `pre-checkout` → `checkout` → `post-checkout` → `pre-command` → `command` → `post-command` → `pre-exit` → `pre-artifact`. Agent-level hooks run first, then repository hooks, then plugin hooks. For hook details and examples, see `references/self-hosted-agents.md`.
192
-
193
- ### Hosted agent caching behavior
194
-
195
- **Cache volumes on hosted agents are non-deterministic** — jobs may or may not get a warm cache. Treat cache volumes as performance accelerators, not guarantees. Cache volumes are **pipeline-scoped** (not shared across pipelines). For deterministic caching, use Docker images with pre-built dependencies instead. Git mirrors can be enabled via cache volumes to accelerate checkout; mount `.git/lfs/objects` in cache volumes and pre-install `git-lfs` in the agent image.
196
-
197
- ### Hosted agent checkout performance
198
-
199
- Buildkite's default checkout **prioritizes completeness over speed** — it may be noticeably slower than GitHub Actions for the same repo. Optimize with the Sparse Checkout plugin (monorepos), Git mirrors (frequent builds), or the Git Shallow Clone plugin (repos where full history is unnecessary).
200
-
201
- ### Hosted agent custom hooks
202
-
203
- Hosted agents support custom hooks via a custom agent image. Add hooks in a Dockerfile:
204
-
205
- ```dockerfile
206
- FROM buildkite/agent:latest
207
-
208
- ENV BUILDKITE_ADDITIONAL_HOOKS_PATHS=/custom/hooks
209
- COPY ./hooks/*.sh /custom/hooks/
210
- RUN chmod +x /custom/hooks/*.sh
211
- ```
212
-
213
- ### Hosted agent pre-installed tools
214
-
215
- Linux hosted agents include: `bash`, `curl`, `wget`, `git`, `docker`, `python3`, `jq`.
216
-
217
- **`nvm` is NOT pre-installed.** Do not source `~/.nvm/nvm.sh` — it will fail silently or exit 127. Use `fnm` instead:
218
-
219
- ```bash
220
- curl -fsSL https://fnm.vercel.app/install | bash -s -- --install-dir "$HOME/.fnm" --skip-shell
221
- export PATH="$HOME/.fnm:$PATH" && eval "$(fnm env --use-on-cd)"
222
- fnm install 20 && fnm use 20
223
- ```
224
-
225
- `fnm` downloads from `nodejs.org` directly and works for all versions including EOL.
226
-
227
- **GitHub release asset downloads may be blocked.** `release-assets.githubusercontent.com` is unreachable from hosted agents. Pre-install tools distributed as GitHub release binaries (CodeQL, Scorecard, Trivy) in a custom agent image using `agentImageRef`.
228
-
229
- **Always verify queue creation** after a GraphQL mutation by listing queues via `GET /v2/organizations/{org}/clusters/{cluster_id}/queues`. Silent GraphQL errors can leave the cluster without a hosted queue — if the list is empty, retry via the REST API.
230
-
231
- ## Plugin Security Controls
232
-
233
- Restrict which plugins agents can run:
234
-
235
- - **Agent-level allowlisting** — use `allowed-plugins` in `buildkite-agent.cfg` to restrict to approved plugins
236
- - **`no-plugins=true`** — disable all plugins on sensitive agents
237
- - **Cluster-based policies** — apply different plugin restrictions per cluster based on security requirements
238
-
239
- Audit plugin repositories proactively — Buildkite does not automatically alert to plugin vulnerabilities.
240
-
241
- ## Pipeline Templates
242
-
243
- Pipeline templates (Enterprise-only) standardize pipeline YAML across the organization. See `references/pipeline-templates.md`.
244
-
245
- ## Audit Logging
246
-
247
- Audit logging (Enterprise-only) tracks organization-level events for compliance. Query via GraphQL or stream to a SIEM via Amazon EventBridge. See `references/audit-logging.md`.
248
-
249
- ## SSO/SAML
250
-
251
- Buildkite supports SAML 2.0 (Okta, Azure AD, Google Workspace, OneLogin). See `references/sso-saml.md` for setup flow.
252
-
253
- ## Cost Optimization
254
-
255
- ### Cost reduction patterns
256
-
257
- | Pattern | Savings | How |
258
- |---------|---------|-----|
259
- | Right-size instance shapes | 20-40% | Match shape to actual resource needs |
260
- | Use `disconnect-after-job` for self-hosted | 10-20% | Ephemeral agents don't idle between jobs |
261
- | Pause queues during off-hours | 10-30% | `clusterQueuePauseDispatch` on nights/weekends |
262
- | Skip unnecessary work with `if_changed` | 10-30% | Only run tests for changed code paths |
263
- | Use `priority` to run critical jobs first | Indirect | Reduces developer wait time for important builds |
264
-
265
- > For `if_changed` and pipeline optimization patterns, see the **buildkite-pipelines** skill.
266
-
267
- ## Observability and Queue Monitoring
268
-
269
- | Tool | Purpose | How it works |
270
- |------|---------|-------------|
271
- | `buildkite-agent-metrics` | Fleet-level queue and job metrics | Polls the Buildkite API; emits to CloudWatch, Datadog, StatsD |
272
- | Agent health check service | Per-agent process health | Exposes Prometheus endpoint; scrape from each agent host |
273
-
274
- **Start with queue profiling** — wait time and checkout time are the biggest, cheapest wins. Target: queue wait time under 2 minutes.
275
-
276
- ### Scaling decision flow
277
-
278
- ```
279
- Queue wait > 2 min?
280
- ├── Yes → Check agent count
281
- │ ├── Agents maxed out → Scale up (add agents or increase shape)
282
- │ ├── Agents idle → Check for job distribution issues (tags, queue routing)
283
- │ └── No agents → Check token, connectivity, agent health
284
- └── No → Queue is healthy
285
- ```
286
-
287
- ## Common Mistakes
288
-
289
- | Mistake | Fix |
290
- |---------|-----|
291
- | Secret key starting with `buildkite` or `bk` | Use a different prefix — these are reserved |
292
- | Secret key with dashes or dots | Only letters, numbers, underscores allowed: `MY_SECRET_KEY` not `my-secret-key` |
293
- | Not storing agent token at creation time | Token is only shown once — store in secrets manager immediately |
294
- | Org-level tokens for clustered agents | Use cluster-scoped tokens (`clusterAgentTokenCreate`) |
295
- | Over-provisioning instance shapes | Start small, monitor, scale up only when builds are slow |
296
- | No `disconnect-after-job` on autoscaled agents | Set `disconnect-after-job=true` for ephemeral pools |
297
- | One large queue for all workloads | Create specialized queues per workload type |
298
- | Cluster creation returns HTTP 500 | List existing clusters first; rename the Default cluster via PATCH as a workaround |
299
- | "Upgrade to Platform Pro" on hosted queue creation | Fall back: create self-hosted queue, install `buildkite-agent` locally with `--spawn 3` |
300
- | Expecting cache volumes to always be warm | Design builds to work without cache — volumes are non-deterministic |
301
- | One IAM role for all queues | Assign different IAM roles per queue; scope secrets by `cluster_queue_key` |
302
- | Scaling down newest agents first | Retire oldest agents first to preserve warm caches |
303
- | Jobs hang "scheduled" with agents connected | Check `default_queue_id` matches the agent's queue tag; update via PATCH |
304
-
305
- ## Additional Resources
306
-
307
- - **`references/graphql-mutations.md`** — GraphQL mutations for clusters, queues, tokens, templates, SSO, audit
308
- - **`references/instance-shapes.md`** — All hosted agent instance shapes
309
- - **`references/self-hosted-agents.md`** — Agent config, clustered vs. unclustered, lifecycle hooks
310
- - **`references/pipeline-templates.md`** — Template mutations and strategy (Enterprise)
311
- - **`references/audit-logging.md`** — Audit queries, SIEM/EventBridge integration (Enterprise)
312
- - **`references/sso-saml.md`** — SSO/SAML provider setup
313
-
314
- ## Further Reading
315
-
316
- - [Buildkite Docs for LLMs](https://buildkite.com/docs/llms.txt)
317
- - [Manage clusters](https://buildkite.com/docs/clusters/manage-clusters.md)
318
- - [Manage cluster queues](https://buildkite.com/docs/clusters/manage-queues.md)
319
- - [Manage cluster secrets](https://buildkite.com/docs/pipelines/security/secrets/buildkite-secrets.md)
320
- - [Agent configuration](https://buildkite.com/docs/agent/v3/configuration.md)
321
- - [Agent hooks](https://buildkite.com/docs/agent/v3/hooks.md)
@@ -1,6 +0,0 @@
1
- interface:
2
- display_name: "Buildkite Agent Infrastructure"
3
- short_description: "Clusters, queues, hosted agents, secrets, SSO, and cost optimization"
4
- icon_small: "./assets/buildkite-icon-small.png"
5
- icon_large: "./assets/buildkite-icon-large.png"
6
- brand_color: "#00D974"
@@ -1,87 +0,0 @@
1
- # Audit Logging and SIEM Integration
2
-
3
- Audit logging (Enterprise-only) tracks organization-level events for compliance and security monitoring.
4
-
5
- ## Query audit events
6
-
7
- ```graphql
8
- query {
9
- organization(slug: "my-org") {
10
- auditEvents(
11
- first: 50
12
- occurredAtFrom: "2026-03-01T00:00:00Z"
13
- occurredAtTo: "2026-03-26T23:59:59Z"
14
- ) {
15
- edges {
16
- node {
17
- type
18
- occurredAt
19
- actor { name type uuid }
20
- subject { name type uuid }
21
- data
22
- }
23
- }
24
- }
25
- }
26
- }
27
- ```
28
-
29
- | Filter | Description |
30
- |--------|-------------|
31
- | `occurredAtFrom` / `occurredAtTo` | ISO 8601 time range |
32
- | `type` | Specific audit event type (e.g., `ORGANIZATION_UPDATED`) |
33
- | `subjectType` | Filter by subject type (e.g., `PIPELINE`, `AGENT_TOKEN`) |
34
- | `subjectUUID` | Filter by specific subject |
35
- | `order` | `RECENTLY_OCCURRED` (default) or `OLDEST_OCCURRED` |
36
-
37
- ## High-severity events to monitor
38
-
39
- | Event type | Why it matters |
40
- |------------|---------------|
41
- | `agent_token.created` / `.deleted` | Agent authentication changes |
42
- | `member.invited` / `.removed` | Team membership changes |
43
- | `sso_provider.created` / `.updated` | SSO configuration changes |
44
- | `pipeline_schedule.created` | New automated triggers |
45
- | `cluster_secret.created` / `.deleted` | Secret management changes |
46
- | `organization.updated` | Org-level setting changes |
47
-
48
- ## SIEM integration via Amazon EventBridge
49
-
50
- Stream audit events to a SIEM in real time using EventBridge:
51
-
52
- - **Source:** `aws.partner/buildkite.com/buildkite/<partner-event-source-id>`
53
- - **Detail type:** `"Audit Event Logged"`
54
-
55
- Event payload structure:
56
-
57
- ```json
58
- {
59
- "organization": {
60
- "uuid": "org-uuid",
61
- "graphql_id": "T3JnYW5pemF0aW9u...",
62
- "slug": "my-org"
63
- },
64
- "event": {
65
- "uuid": "event-uuid",
66
- "occurred_at": "2026-03-26T14:30:00Z",
67
- "type": "agent_token.created",
68
- "data": { },
69
- "subject_type": "AgentToken",
70
- "subject_uuid": "token-uuid",
71
- "subject_name": "Production agents",
72
- "context": {
73
- "request_id": "req-uuid",
74
- "request_ip": "203.0.113.42",
75
- "session_user_uuid": "user-uuid",
76
- "request_user_agent": "Mozilla/5.0..."
77
- }
78
- },
79
- "actor": {
80
- "name": "Jane Engineer",
81
- "type": "USER",
82
- "uuid": "user-uuid"
83
- }
84
- }
85
- ```
86
-
87
- Route high-severity events to PagerDuty, Splunk, or Datadog via EventBridge rules matching on `detail.event.type`.