cpflow 5.0.4 → 5.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,649 @@
1
+ # Connecting Control Plane workloads to a private AWS RDS/Aurora database
2
+
3
+ Control Plane (CPLN) does not offer a managed Postgres service. For production Postgres on AWS, the typical setup
4
+ is **Amazon RDS** (or **Aurora**) in a **private VPC subnet**, reached from CPLN workloads over a private network
5
+ path — not the public internet.
6
+
7
+ This guide covers the recommended setup: **CPLN Cloud Wormhole via an Agent**.
8
+
9
+ > **Sourcing note — verify field casing before you apply.** Field names, schema, and limits in this guide
10
+ > are sourced from the public Control Plane documentation at <https://shakadocs.controlplane.com> as of
11
+ > May 2026 and have **not** been end-to-end verified against a live org. This matters because a casing
12
+ > mismatch **fails silently**: `cpln apply` accepts the file, ignores the unrecognized field, and the
13
+ > workload then can't reach the database — with no error pointing back to the YAML. Before applying in
14
+ > production, diff your edited files against a fresh `cpln identity get <name> -o yaml-slim` (and
15
+ > `cpln agent get <name> -o yaml`) export to confirm the exact field names, and consult
16
+ > `cpln <command> --help` for the latest CLI flags.
17
+
18
+ ## Why private networking
19
+
20
+ - **Security.** The database stays in private subnets with no public IP and no Internet Gateway egress. Inbound
21
+ traffic to the DB only comes from a specific security group inside your VPC — never the public internet.
22
+ - **Compliance.** SOC 2, HIPAA, and most internal security reviews expect "no publicly addressable database."
23
+ - **Stable allowlists.** No need to maintain CPLN egress IP allowlists on the RDS security group as CPLN's
24
+ infrastructure evolves — the Agent runs inside *your* VPC.
25
+ - **Cost.** No NAT or data-egress fees for in-region database traffic when the Agent is in the same region as RDS.
26
+
27
+ If you only need RDS for **development/test** review apps and don't care about exposing it publicly, see the
28
+ [Database section of the README](../README.md#database) for the simpler "public RDS + security-group allowlist"
29
+ pattern. Don't run a production database that way.
30
+
31
+ ## Architecture
32
+
33
+ ```text
34
+ ┌──────────────────────────────────────────────────────────────────────────────┐
35
+ │ Control Plane (CPLN) │
36
+ │ │
37
+ │ ┌──────────────┐ ┌──────────────────────┐ │
38
+ │ │ Workload │ ──────► │ Identity │ │
39
+ │ │ (rails app) │ │ networkResources: │ │
40
+ │ │ │ │ - name: db-primary │ │
41
+ │ │ DATABASE_URL│ │ FQDN: db.rds… │ │
42
+ │ │ = postgres:// │ resolverIP: │ │
43
+ │ │ user:pwd@ │ 10.0.0.2 │ │
44
+ │ │ db.rds…:5432/myapp │ ports: [5432] │ │
45
+ │ │ │ agentLink: │ │
46
+ │ │ │ //agent/vpc1 │ │
47
+ │ └──────────────┘ └──────────┬───────────┘ │
48
+ │ │ │
49
+ │ Wormhole (encrypted) │
50
+ └───────────────────────────────────────┼──────────────────────────────────────┘
51
+ │ outbound TLS, agent-initiated
52
+ ┌───────────────────────────────────────▼──────────────────────────────────────┐
53
+ │ Your AWS VPC (e.g. us-east-2) │
54
+ │ │
55
+ │ ┌─────────────────────────────────┐ ┌─────────────────────────┐ │
56
+ │ │ Auto Scaling Group (Agents) │ ────► │ RDS / Aurora cluster │ │
57
+ │ │ - 2× EC2 (t3.medium+ prod) │ 5432 │ private subnets only │ │
58
+ │ │ - Ubuntu 24.04 LTS │ │ SG: allow from agent │ │
59
+ │ │ - SG: egress to RDS SG:5432 │ │ SG only │ │
60
+ │ └─────────────────────────────────┘ └─────────────────────────┘ │
61
+ └──────────────────────────────────────────────────────────────────────────────┘
62
+ ```
63
+
64
+ Key properties:
65
+
66
+ - The Agent **dials out** to CPLN over TLS. No inbound from CPLN to your VPC is required, so RDS stays fully
67
+ private.
68
+ - Traffic through the wormhole is **encrypted in transit** — the agent uses WireGuard for the tunnel between
69
+ your VPC and CPLN.
70
+ - The workload reaches the resource through the wormhole. For a **TLS** database like RDS/Aurora, connect by
71
+ the cluster's **FQDN** (the real endpoint) so the certificate matches; CPLN routes that FQDN through the
72
+ agent because it's declared in `networkResources`. The short `name` is the resource's identifier — usable
73
+ as the host only for non-TLS targets.
74
+ - The Agent is **org-scoped** — one agent can serve workloads in any GVC in the org.
75
+ - The Identity (and its `networkResources`) is **GVC-scoped** — an org with staging and production GVCs
76
+ needs the same `networkResources` declared in each GVC's identity, but can share a single agent.
77
+
78
+ ## Prerequisites
79
+
80
+ - An AWS account with:
81
+ - A VPC with at least one **private subnet** for RDS and at least one **subnet with outbound internet access**
82
+ for the agent ASG (the agent needs to reach CPLN; this is typically a public subnet or a private subnet
83
+ behind a NAT gateway).
84
+ - An RDS or Aurora cluster in the private subnets, with a security group you can edit.
85
+ - IAM permissions to create launch templates, ASGs, and security groups.
86
+ - A Control Plane org where you have the **`agent.create`**, **`identity.edit`**, and **`workload.edit`**
87
+ permissions.
88
+ - `cpln` CLI installed and authenticated against the target org.
89
+
90
+ ## Step 1 — Create the Agent and AWS user-data script
91
+
92
+ The CPLN UI first generates a **bootstrap config** for the agent, then uses that config to generate the
93
+ cloud-provider script. For AWS, the EC2 Launch Template needs the generated **Userdata Script** — not the
94
+ raw bootstrap config JSON.
95
+
96
+ 1. In the CPLN UI, go to **Agents** → **New**.
97
+ 2. Give it a stable name (e.g. `aws-us-east-2-prod`). The org-scoped link will be
98
+ `//agent/aws-us-east-2-prod`.
99
+ 3. Pick **AWS** as the platform.
100
+ 4. Click **Create**, then save the bootstrap config JSON manually or with **Download Config File**.
101
+ Treat this like a secret. It is not retrievable after you close the modal — if you lose it, delete
102
+ the agent and recreate it.
103
+ 5. Click **Next** and copy or download the AWS **Userdata Script**. This is the script you paste into the
104
+ EC2 Launch Template in step 2.
105
+
106
+ If you already created the agent but still have the bootstrap config, open the agent's page, choose
107
+ **Actions** → **Download Scripts**, paste/import the bootstrap config, and copy the YAML from the
108
+ **Userdata Script** tab.
109
+
110
+ > **CLI creation note:** do not use `cpln apply --file agent.yaml` for this step. It can create the
111
+ > Agent object, but it does not produce the bootstrap config needed by the EC2 host. Use the Console
112
+ > path above, or use `cpln agent create` and capture stdout to a bootstrap config file:
113
+ >
114
+ > ```sh
115
+ > cpln agent create \
116
+ > --name aws-us-east-2-prod \
117
+ > --description "Wormhole agent in customer AWS VPC, us-east-2." \
118
+ > --org my-org > bootstrap-config.json
119
+ > ```
120
+ >
121
+ > Do not paste `bootstrap-config.json` directly into EC2 user data. Render an AWS Userdata Script from
122
+ > that config first (Console **Download Scripts**, or a CLI/scripted equivalent that you have verified
123
+ > in your org), then use the rendered script in the Launch Template.
124
+
125
+ ## Step 2 — Launch the Agent on AWS
126
+
127
+ Recommended baseline:
128
+
129
+ - **AMI:** Ubuntu Server 24.04 LTS.
130
+ - **Instance type:** `t3.small` for testing; `t3.medium` or larger for production (CPLN recommends a minimum of
131
+ 2 vCPU / 4 GiB). In practice the binding constraint is **network bandwidth**, not CPU/RAM, so size on the
132
+ instance's *baseline* (not burst) bandwidth for your expected DB throughput. The agent ships in both Intel
133
+ (x86-64) and ARM (Graviton) builds, so ARM families cost less for the same capability — use `t4g` for
134
+ burstable workloads (e.g. `t4g.medium` in place of `t3.medium`) and `c7g`/`c6g` when you need sustained
135
+ bandwidth at higher load.
136
+ - **Launch Template:** set the user data to the AWS **Userdata Script** generated in step 1.
137
+ - **Security note:** EC2 user data is visible to anyone with `ec2:DescribeLaunchTemplates` /
138
+ `ec2:DescribeInstanceAttribute`, and readable from the instance itself at
139
+ `http://169.254.169.254/latest/user-data` via IMDS. For production, prefer storing the generated
140
+ Userdata Script (or the bootstrap config plus a verified render step) in AWS Secrets Manager or SSM
141
+ Parameter Store and having a small bootstrap snippet in user data fetch and execute it at startup
142
+ (granting the instance role `secretsmanager:GetSecretValue` or `ssm:GetParameter` on that specific
143
+ resource ARN). That requires an IAM instance profile: create or reuse one, attach only the narrowly
144
+ scoped `GetSecretValue` / `GetParameter` permission on the specific secret or parameter ARN, and set
145
+ the profile in the Launch Template under **Advanced details** → **IAM instance profile** before relying
146
+ on the user-data fetch. At minimum, audit who has those `Describe*` permissions in the account.
147
+ - **Enforce IMDSv2.** Set the Launch Template's metadata options to require a session token
148
+ (`HttpTokens: required`, a low `HttpPutResponseHopLimit` such as `1`) so a server-side request forgery
149
+ (SSRF) bug in a workload can't read the user data or instance role credentials from `169.254.169.254`
150
+ without a token. IMDSv1 leaves that endpoint open to any unauthenticated in-instance request.
151
+ - **Auto Scaling Group:**
152
+ - Testing: desired 1 / min 1 / max 1.
153
+ - Production: desired 2 / min 2 / max 4 across at least two availability zones. Two agents give you
154
+ a rolling-restart path during upgrades and survive a single-AZ outage.
155
+ - **Subnets:** subnets must have outbound internet access (public subnet with auto-assign IPv4, or private
156
+ subnet with NAT). The agent dials CPLN over TLS; it does *not* need any inbound rule from the internet.
157
+ When choosing AZs, note that not every EC2 instance type is offered in every AZ — confirm your chosen type
158
+ is available in the AZ(s) you target, especially if you co-locate the agent with the RDS writer's AZ to
159
+ avoid cross-AZ data-transfer charges (see [Cost](#cost)).
160
+
161
+ After the ASG is healthy, verify the agent registered:
162
+
163
+ 1. CPLN UI → **Agents** → select your agent → a green heartbeat should appear within 2–3 minutes.
164
+ 2. Or: `cpln agent get aws-us-east-2-prod -o yaml` and look for a recent `lastModified` / status.
165
+
166
+ ## Step 3 — Security groups
167
+
168
+ Two security groups. The RDS rule is the obvious one, but the agent also needs outbound paths for
169
+ CPLN control-plane TLS and DNS, otherwise it can register-but-not-resolve (or fail to register at
170
+ all) even when the DB rule is correct:
171
+
172
+ - **Agent SG** (attached to the ASG instances):
173
+ - Egress to the **RDS SG** on port `5432` (or `3306` for MySQL) — the database traffic itself.
174
+ - Egress to `0.0.0.0/0` on **TCP `443`** — the agent dials out to CPLN over TLS to register and
175
+ tunnel traffic. If your environment forbids `0.0.0.0/0`, restrict to your CPLN region's
176
+ documented egress endpoints (or route via a NAT gateway / egress proxy with that allowlist).
177
+ - Egress to the **VPC DNS resolver** on **UDP/TCP `53`** — required when `networkResources` uses
178
+ `FQDN` + `resolverIP` so the agent can resolve the cluster endpoint. The VPC `.2` resolver lives
179
+ inside the VPC, so this is typically already permitted by default egress; lock it down to the
180
+ resolver IP if your default egress is restrictive.
181
+ - **RDS SG** (attached to the DB cluster): ingress from the Agent SG on port `5432`.
182
+
183
+ ```text
184
+ Agent SG egress ──► RDS SG port 5432 (database)
185
+ Agent SG egress ──► 0.0.0.0/0 TCP 443 (CPLN control plane)
186
+ Agent SG egress ──► VPC .2 DNS UDP/TCP 53 (FQDN resolution)
187
+ RDS SG ingress ◄── Agent SG port 5432
188
+ ```
189
+
190
+ Reference the agent SG by ID, not by CIDR. That way, the rule stays correct as ASG instances are recycled.
191
+
192
+ > **Simpler alternative: share the RDS SG.** Instead of managing a separate egress/ingress rule pair, you can
193
+ > attach the **RDS SG itself** to the agent instances. Each agent then carries two security groups — its own
194
+ > (egress to CPLN on `443` and the VPC DNS resolver on `53`) plus the RDS SG. If the RDS SG already has a
195
+ > self-referencing rule that allows ingress from itself on `5432`, this removes the need to maintain a
196
+ > dedicated Agent-SG → RDS-SG rule.
197
+
198
+ ## Step 4 — Declare the Network Resource on the Identity
199
+
200
+ cpflow provisions a **single identity per app**, named `<app>-identity`, that is shared by every
201
+ workload in the app (see `Config#identity` and the same `{{APP_IDENTITY_LINK}}` referenced by
202
+ `templates/rails.yml`, `templates/sidekiq.yml`, and `templates/daily-task.yml`). We need to add a
203
+ `networkResources` entry to **that existing identity**, not create a new one — and not a separate
204
+ identity per workload.
205
+
206
+ > **Important: `cpln apply` is a full replace, not a merge.** Use `yaml-slim` for manifests you plan
207
+ > to re-apply; full `yaml` output can include server-managed metadata such as IDs, versions, and
208
+ > timestamps. Applying a stripped-down identity YAML
209
+ > that only contains `networkResources` will silently drop any other fields already set on the
210
+ > identity (tags, description, and any previously configured `networkResources`). Always **export
211
+ > the live identity first**, edit it in place, then re-apply. (Policy bindings themselves live on
212
+ > the Policy resource — not on the identity — so applying the identity won't touch them, but the
213
+ > verification step below still checks the policy → identity link in case anything else recreated it.)
214
+
215
+ First, find the identity name. cpflow's `Config#identity` defines this as `<app>-identity` and the
216
+ `{{APP_IDENTITY}}` template variable expands to the same — so for an app named `my-app-production` the
217
+ identity is `my-app-production-identity`. Confirm against your org (`cpln identity get` with no ref returns
218
+ all identities in the GVC; there is no `list` subcommand):
219
+
220
+ ```sh
221
+ cpln identity get --gvc my-app-production --org my-org \
222
+ | grep my-app-production-identity
223
+ ```
224
+
225
+ Export it and add the `networkResources` block. Look up your VPC's DNS resolver IP (typically the VPC CIDR
226
+ base + 2, e.g. `10.0.0.2` for a `10.0.0.0/16` VPC) and your RDS cluster endpoint:
227
+
228
+ ```sh
229
+ # Replace my-app-production-identity with whatever `cpln identity get` shows for your app.
230
+ cpln identity get my-app-production-identity \
231
+ --gvc my-app-production --org my-org -o yaml-slim > identity-db.yaml
232
+ ```
233
+
234
+ Edit `identity-db.yaml` and **add** the `networkResources` block — keep every other apply-safe field
235
+ exactly as exported:
236
+
237
+ > ⚠️ **Verify field casing against your org before applying.** The field names below (`FQDN`, `IPs`,
238
+ > `agentLink`, `resolverIP`) are sourced from public CPLN docs. If the live API uses different casing,
239
+ > `cpln apply` will accept the file but silently ignore the resource — workloads will hit
240
+ > `could not translate host name` with no obvious link to the identity YAML. Diff your edited file
241
+ > against the original `yaml-slim` export from `cpln identity get` to confirm.
242
+
243
+ ```yaml
244
+ # identity-db.yaml (after edit — abbreviated, your file will have more fields)
245
+ kind: identity
246
+ name: my-app-production-identity # output of `cpln identity get` for your app
247
+ description: ... # leave existing description alone
248
+ # … any other existing fields stay as-is …
249
+ networkResources:
250
+ - name: db-primary # resource identifier (TLS workloads connect via the FQDN below)
251
+ agentLink: //agent/aws-us-east-2-prod
252
+ FQDN: myapp-prod.cluster-xxxxx.us-east-2.rds.amazonaws.com # ← workload connects to this endpoint
253
+ resolverIP: 10.0.0.2 # your VPC's .2 resolver, reachable from the agent
254
+ ports:
255
+ - 5432
256
+ # Optional second resource for Aurora reader endpoint:
257
+ - name: db-readers
258
+ agentLink: //agent/aws-us-east-2-prod
259
+ FQDN: myapp-prod.cluster-ro-xxxxx.us-east-2.rds.amazonaws.com
260
+ resolverIP: 10.0.0.2
261
+ ports:
262
+ - 5432
263
+ ```
264
+
265
+ If you'd rather pin to specific IPs instead of an FQDN, swap `FQDN` + `resolverIP` for an `IPs:` array of
266
+ 1–5 IPv4 addresses. See [IPs vs FQDN](#ips-vs-fqdn--which-to-use) below for trade-offs.
267
+
268
+ Apply it back:
269
+
270
+ ```sh
271
+ cpln apply --file identity-db.yaml --gvc my-app-production --org my-org
272
+ rm identity-db.yaml # optional: the live identity is the source of truth; re-export anytime with `cpln identity get`
273
+ ```
274
+
275
+ (Use whatever `--org` / `--gvc` flags your team uses, or rely on the org/GVC defaults set by `cpln profile`.)
276
+
277
+ Confirm the identity now has the new `networkResources` and that the policy still references it with the
278
+ `reveal` permission:
279
+
280
+ ```sh
281
+ # 1. The identity itself should now list `networkResources`.
282
+ # Use -A 15 so the full block (name, agentLink, FQDN, resolverIP, ports) is visible without re-running.
283
+ cpln identity get my-app-production-identity --gvc my-app-production --org my-org -o yaml \
284
+ | grep -A 15 networkResources
285
+
286
+ # 2. Policies are separate resources — they reference the identity by link, not by replacing it.
287
+ # Confirm the existing cpflow-generated policy (typically <app-prefix>-secrets-policy) has `reveal`
288
+ # and the identity link in the same binding block. The full identity link cpflow uses is
289
+ # /org/<org>/gvc/<app>/identity/<app>-identity (see Config#identity_link).
290
+ # (The `cpln` CLI exposes `policy get`/`policy query` — there is no `policy list` subcommand.)
291
+ cpln policy get my-app-secrets-policy --org my-org -o yaml
292
+ ```
293
+
294
+ Look for a binding shaped like this; both `reveal` and the identity link must be in the same item:
295
+
296
+ ```yaml
297
+ bindings:
298
+ - permissions:
299
+ - reveal
300
+ principalLinks:
301
+ - /org/my-org/gvc/my-app-production/identity/my-app-production-identity
302
+ ```
303
+
304
+ If the policy block is gone, no longer contains the identity link, or does not grant `reveal`, your workload
305
+ won't be able to read its secrets — re-bind the identity to the secrets policy directly with the CPLN CLI:
306
+
307
+ ```sh
308
+ # Verify this subcommand exists in your installed cpln version first: cpln policy --help
309
+ cpln policy add-binding my-app-secrets-policy --org my-org \
310
+ --identity /org/my-org/gvc/my-app-production/identity/my-app-production-identity \
311
+ --permission reveal
312
+ ```
313
+
314
+ This is the same call `cpflow setup-app` makes internally (see `Controlplane#bind_identity_to_policy`).
315
+ `cpflow apply-template app` does **not** recreate the binding — its `app` template only defines the
316
+ GVC, and `--add-app-identity` only inserts an identity object, not the policy binding. Adjust the
317
+ policy name if you've overridden `secrets_policy_name` in `controlplane.yml`.
318
+
319
+ > **Naming note.** The secret and policy default to `<app-prefix>-secrets` / `<app-prefix>-secrets-policy`,
320
+ > where the prefix is the matched `controlplane.yml` entry name (`Config#secrets`). That prefix can be
321
+ > **shorter than the full app name** used for the GVC and identity — e.g. an app `my-app-production` matched
322
+ > by a `my-app` entry has identity `my-app-production-identity` but secret `my-app-secrets` and policy
323
+ > `my-app-secrets-policy`. Confirm yours with `cpln secret get --gvc <app> --org <org>` if unsure.
324
+
325
+ Schema notes (per CPLN's documented `networkResources` schema):
326
+
327
+ - `name` — a short, stable identifier for the resource (`db-primary`, `db-readers`). A workload may address
328
+ the resource by this name **or** by its `FQDN` — but a **TLS** target like RDS/Aurora must be reached by
329
+ the `FQDN` (see Step 5), so the name serves mainly as the resource label here.
330
+ - `agentLink` — `//agent/<agent-name>` (org-scoped).
331
+ - Exactly one of `IPs` or `FQDN` is required per resource:
332
+ - `IPs` — array of **1 to 5** IPv4 addresses. The agent routes to exactly those IPs.
333
+ - `FQDN` — a fully qualified domain name; the agent resolves it from inside your VPC. Pair it with
334
+ `resolverIP` (below) unless the agent can already resolve the FQDN on its own.
335
+ - `resolverIP` — **optional** IPv4 of a DNS server the agent uses to resolve the `FQDN` from inside the
336
+ private VPC (typically the VPC's `.2` resolver, e.g. `10.0.0.2`). Not needed if the agent can already
337
+ resolve the FQDN; when set, the agent queries this resolver for the FQDN.
338
+ - `ports` — array of **1 to 10** ports, each `0–65535`. Required.
339
+ - One identity may declare **up to 50** `networkResources`.
340
+ - For native AWS PrivateLink, an alternative `awsPrivateLink` field exists on identities (see
341
+ [Alternatives](#alternatives--when-not-to-use-an-agent) below).
342
+
343
+ ### IPs vs FQDN — which to use?
344
+
345
+ - **FQDN** (recommended for Aurora and any RDS cluster with failover): set `FQDN` to the cluster endpoint,
346
+ and set `resolverIP` to the VPC's `.2` resolver unless the agent can already resolve the endpoint on its
347
+ own. The agent re-resolves on connect, so there is no identity change required on failover — but plan for the full 60–120 s window (failure detection, replica
348
+ promotion, DNS update, propagation), not just Aurora's ~30 s DNS TTL. See
349
+ [Aurora failover](#aurora-failover) in Operations.
350
+ - **IPs** (only when you control the target's IP stability): a single-instance RDS that you don't expect to
351
+ recycle, or a static private IP behind a network appliance. The agent will route to exactly those IPs
352
+ — if RDS recycles the underlying instance and the IP changes, you must update the identity manually.
353
+
354
+ For most ShakaCode setups, use FQDN — and a TLS RDS/Aurora connection needs the FQDN as the `DATABASE_URL`
355
+ host anyway so the certificate matches (see Step 5). The IPs form is shown here mainly to make the schema
356
+ concrete and for the rare case where you want to pin to a specific IP.
357
+
358
+ ## Step 5 — Point the workload at the resource
359
+
360
+ The workload's `DATABASE_URL` uses the **RDS/Aurora endpoint** (the `FQDN` you declared in step 4) as the
361
+ hostname, not the short resource `name` — see the TLS note below for why. CPLN still routes the endpoint
362
+ through the agent because it's declared in `networkResources`.
363
+
364
+ CPLN's `cpln://secret/<name>.<key>` syntax substitutes the **entire env var value** at workload startup — it
365
+ is not a substring interpolation. So you have three options for assembling a `DATABASE_URL` that includes a
366
+ secret password:
367
+
368
+ > **TLS requirement — use the FQDN endpoint as the host.** RDS and Aurora require TLS by default in current
369
+ > parameter groups, and their certificate is issued for the **actual cluster endpoint**. Per Control Plane,
370
+ > a TLS resource must be reached by its FQDN: connecting through the short `networkResources` name
371
+ > (`db-primary`) fails the TLS handshake unless you disable certificate validation. So use the real endpoint
372
+ > (e.g. `myapp-prod.cluster-xxxxx.us-east-2.rds.amazonaws.com`) as the `DATABASE_URL` host in every example
373
+ > below. CPLN routes it through the agent because that FQDN is declared in `networkResources`.
374
+ >
375
+ > Always include `sslmode=require` so the connection is encrypted; without it, newer clusters refuse the
376
+ > connection and older ones silently fall back to unencrypted. Because the host now matches the certificate,
377
+ > the stricter `verify-ca` / `verify-full` modes also work — add the AWS RDS CA bundle to the container image
378
+ > and reference it with `sslrootcert=/path/to/rds-ca.pem` (or `PGSSLROOTCERT`). See the
379
+ > [AWS RDS SSL/TLS docs](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/UsingWithRDS.SSL.html).
380
+
381
+ **Option A: store the full URL in a secret (simpler).** Recommended for production where the DB credentials
382
+ rarely change.
383
+
384
+ Add the connection string to the existing app dictionary secret created by `cpflow setup-app`
385
+ (`my-app-secrets` in these examples). **Use the CPLN UI** (Secrets → select the app secret → add keys) —
386
+ it's the safest place to enter credentials and avoids shell history, tty echo, and plaintext request bodies
387
+ entirely.
388
+
389
+ > ⚠️ **CLI form below is reference-only.** It writes credentials to your shell history, local disk, and the
390
+ > `cpln` request body in plaintext. The common "prefix with a space" trick only works when `HISTCONTROL`
391
+ > includes `ignorespace` or `ignoreboth`, which is **not** set by default on many stripped-down bastion/EC2
392
+ > shells. Prefer the UI for any real credential.
393
+
394
+ ```sh
395
+ # Reference only — prefer the UI. mktemp gives an unpredictable, owner-only (mode 600) file —
396
+ # safer than a fixed path under world-readable /tmp.
397
+ secret_file=$(mktemp)
398
+ cpln secret reveal my-app-secrets --org my-org -o yaml-slim > "$secret_file"
399
+ # Edit "$secret_file" and add these keys under data, preserving existing entries:
400
+ # url: "postgres://app:supersecret@myapp-prod.cluster-xxxxx.us-east-2.rds.amazonaws.com:5432/myapp_production?sslmode=require"
401
+ # url_readers: "postgres://app_readonly:readsecret@myapp-prod.cluster-ro-xxxxx.us-east-2.rds.amazonaws.com:5432/myapp_production?sslmode=require"
402
+ cpln apply --file "$secret_file" --org my-org
403
+ rm -f "$secret_file"
404
+ ```
405
+
406
+ > **Secret policy target.** `cpflow setup-app` creates the app secret (`my-app-secrets`) and a secrets
407
+ > policy that targets only that secret. If you instead store database values in a separate secret such as
408
+ > `my-app-database`, add `//secret/my-app-database` to the existing policy's `targetLinks` while preserving
409
+ > `//secret/my-app-secrets`; otherwise `cpln://secret/my-app-database...` references will fail at workload
410
+ > startup even if the identity has a `reveal` binding.
411
+ >
412
+ > ```sh
413
+ > cpln policy get my-app-secrets-policy --org my-org -o yaml > /tmp/my-app-secrets-policy.yml
414
+ > # Edit targetLinks to include both:
415
+ > # - //secret/my-app-secrets
416
+ > # - //secret/my-app-database
417
+ > cpln apply --file /tmp/my-app-secrets-policy.yml --org my-org
418
+ > ```
419
+ >
420
+ > See [secrets-and-env-values.md](./secrets-and-env-values.md) for the generated app secret/policy flow.
421
+
422
+ Then in your workload template:
423
+
424
+ ```yaml
425
+ # In your rails.yml workload template
426
+ spec:
427
+ containers:
428
+ - name: rails
429
+ env:
430
+ - name: DATABASE_URL
431
+ value: cpln://secret/my-app-secrets.url
432
+ # Optional, for read replicas:
433
+ - name: DATABASE_REPLICA_URL
434
+ value: cpln://secret/my-app-secrets.url_readers
435
+ ```
436
+
437
+ **Option B: keep the password in a secret, assemble the URL in app code.** Use this if you want the URL host
438
+ to live in plaintext config so it's easy to grep for in templates.
439
+
440
+ Add just the password to the existing app secret. Again, **prefer the CPLN UI** to enter the credential; the
441
+ CLI heredoc below is reference-only (same shell-history caveat as Option A). If you create
442
+ `my-app-database` instead of adding the password to `my-app-secrets`, update the app secrets policy
443
+ `targetLinks` as shown in Option A.
444
+
445
+ ```sh
446
+ # Reference only — prefer the UI. mktemp gives an unpredictable, owner-only (mode 600) file —
447
+ # safer than a fixed path under world-readable /tmp.
448
+ secret_file=$(mktemp)
449
+ cpln secret reveal my-app-secrets --org my-org -o yaml-slim > "$secret_file"
450
+ # Edit "$secret_file" and add this key under data, preserving existing entries:
451
+ # password: "supersecret"
452
+ cpln apply --file "$secret_file" --org my-org
453
+ rm -f "$secret_file"
454
+ ```
455
+
456
+ Then set the workload env:
457
+
458
+ ```yaml
459
+ env:
460
+ - name: DATABASE_HOST
461
+ value: myapp-prod.cluster-xxxxx.us-east-2.rds.amazonaws.com # the networkResources[].FQDN from step 4
462
+ - name: DATABASE_PORT
463
+ value: "5432"
464
+ - name: DATABASE_NAME
465
+ value: myapp_production
466
+ - name: DATABASE_USER
467
+ value: app
468
+ - name: DATABASE_PASSWORD
469
+ value: cpln://secret/my-app-secrets.password
470
+ ```
471
+
472
+ …and in `config/database.yml`:
473
+
474
+ ```yaml
475
+ production:
476
+ adapter: postgresql
477
+ host: <%= ENV.fetch("DATABASE_HOST") %>
478
+ port: <%= ENV.fetch("DATABASE_PORT") %>
479
+ database: <%= ENV.fetch("DATABASE_NAME") %>
480
+ username: <%= ENV.fetch("DATABASE_USER") %>
481
+ password: <%= ENV.fetch("DATABASE_PASSWORD") %>
482
+ sslmode: require
483
+ ```
484
+
485
+ **Option C: compose the URL at the CPLN env layer with `$(VAR)` interpolation.** Use this to keep credentials
486
+ as separate secret keys *and* avoid touching `database.yml` — the workload still receives a single
487
+ `DATABASE_URL`.
488
+
489
+ While `cpln://secret/...` replaces a whole value, CPLN *also* supports `$(VAR)` references that interpolate
490
+ **other env vars** defined on the same workload. Combine the two: back each component with a secret, then
491
+ assemble `DATABASE_URL` from those env vars.
492
+
493
+ ```yaml
494
+ env:
495
+ - name: DATABASE_USER
496
+ value: cpln://secret/my-app-secrets.DATABASE_USER
497
+ - name: DATABASE_PASSWORD
498
+ value: cpln://secret/my-app-secrets.DATABASE_PASSWORD
499
+ - name: DATABASE_HOST
500
+ value: myapp-prod.cluster-xxxxx.us-east-2.rds.amazonaws.com # the networkResources[].FQDN from step 4
501
+ - name: DATABASE_URL
502
+ value: postgres://$(DATABASE_USER):$(DATABASE_PASSWORD)@$(DATABASE_HOST):5432/myapp_production?sslmode=require
503
+ ```
504
+
505
+ With this form, `config/database.yml` can read `DATABASE_URL` directly
506
+ (`url: <%= ENV.fetch("DATABASE_URL") %>`) — no host/user/password plumbing required. cpflow template
507
+ variables such as `{{APP_NAME}}` also expand inside the value if you want the database name to track the app
508
+ name.
509
+
510
+ Either way:
511
+
512
+ - The host in `DATABASE_URL` is the **`FQDN`** from step 4 (e.g.
513
+ `myapp-prod.cluster-xxxxx.us-east-2.rds.amazonaws.com`), so the TLS certificate matches. The short
514
+ `networkResources` `name` (`db-primary` / `db-readers`) is the resource's identifier, not the connection
515
+ host for a TLS database.
516
+ - If your workload can't reach the endpoint, diagnose with
517
+ `cpln workload exec <workload> -- nslookup <endpoint>`. The Verification section below covers the
518
+ most common resolution and connectivity failures.
519
+
520
+ ## Step 6 — Bind the identity to the workload
521
+
522
+ The wormhole only takes effect once the workload's `spec.identityLink` points at the identity from step 4.
523
+ cpflow's stock templates already do this — see `{{APP_IDENTITY_LINK}}` in `templates/rails.yml`,
524
+ `templates/sidekiq.yml`, and `templates/daily-task.yml`:
525
+
526
+ ```yaml
527
+ # templates/rails.yml (excerpt)
528
+ spec:
529
+ # Identity is used for binding workload to secrets — and, after step 4, also to network resources.
530
+ identityLink: {{APP_IDENTITY_LINK}}
531
+ ```
532
+
533
+ If you're using the cpflow templates as-is, no change is needed here — re-applying the templates
534
+ (`cpflow apply-template rails -a my-app-production` etc.) is enough. If you wrote a custom workload without
535
+ that line, add it now. The most reliable way to set `identityLink` on an existing workload is the same
536
+ export-edit-apply loop used in Step 4: `cpln workload get <name> --gvc <gvc> -o yaml-slim > workload.yaml`,
537
+ edit `spec.identityLink`, then `cpln apply --file workload.yaml`. (`cpln workload update --help` may also
538
+ list a `--set` flag for this; verify against your installed CLI version before relying on it.)
539
+
540
+ Re-apply the workload template:
541
+
542
+ ```sh
543
+ cpflow apply-template rails -a my-app-production
544
+ ```
545
+
546
+ ## Verification
547
+
548
+ Open an interactive shell in a one-off copy of the rails workload (with the identity, env, and image of
549
+ the live workload), then run `psql` from inside it. `cpflow run` flattens argv with `join(" ")` and does
550
+ no shell escaping (see
551
+ [`Command::Base.args_join`](https://github.com/shakacode/control-plane-flow/blob/main/lib/command/base.rb)),
552
+ so quoted multi-arg commands like
553
+ `-- bash -c 'psql "$DATABASE_URL" -c "select 1"'` won't survive round-tripping — running `psql` from
554
+ inside the interactive shell sidesteps the issue entirely.
555
+
556
+ ```sh
557
+ # Step 1: open a one-off interactive shell inside the rails workload.
558
+ cpflow run -a my-app-production -w rails
559
+
560
+ # Step 2: from inside the workload shell, confirm DATABASE_URL is set and reachable.
561
+ echo "$DATABASE_URL"
562
+ psql "$DATABASE_URL" -c 'select now(), version();'
563
+ ```
564
+
565
+ Expected: a current timestamp and the RDS Postgres version. Common failures:
566
+
567
+ - `could not translate host name "<rds-endpoint>" to address` — usually one of:
568
+ - The identity isn't bound to the workload: check `spec.identityLink` on the workload, re-apply the
569
+ template, and re-run.
570
+ - The workload was running **before** you added `networkResources` to the identity: existing replicas
571
+ don't pick up identity changes automatically. Recycle the workload with
572
+ `cpflow ps:restart -a my-app-production` (or `cpflow ps:restart -a my-app-production -w <workload>`
573
+ for one workload; lower-level equivalent:
574
+ `cpln workload force-redeployment <workload> --gvc <gvc> --org <org>`) so new replicas start with the
575
+ updated network resource map.
576
+ - The agent can't resolve the endpoint: confirm the `FQDN` in `networkResources` exactly matches the RDS
577
+ endpoint in `DATABASE_URL`, and that `resolverIP` (if set) points at a DNS server the agent can reach
578
+ (the VPC `.2` resolver), with the agent SG allowing egress on `53`.
579
+ - `connection refused` or timeouts to the resource — the agent → RDS path is wrong. Verify the agent has a
580
+ green heartbeat in the CPLN UI, the agent SG has egress to the RDS SG on `5432`, and the RDS SG accepts
581
+ ingress from the agent SG.
582
+ - `FATAL: no pg_hba.conf entry … SSL off` or `SSL connection is required` — `sslmode=require` is missing
583
+ from the connection string or `database.yml`.
584
+
585
+ ## Operations
586
+
587
+ ### High availability
588
+
589
+ - Run **at least 2 agents** in the ASG in production. Agent upgrades are rolling, and a single agent is a
590
+ single point of failure for *all* private connectivity from the GVC.
591
+ - Spread the ASG across at least 2 availability zones.
592
+
593
+ ### Aurora failover
594
+
595
+ - With **IPs** in `networkResources`: failover swaps which underlying instance the writer endpoint resolves
596
+ to. The IPs you declared are routing targets, so they may now forward to the wrong instance. For Aurora
597
+ specifically, prefer the **FQDN** form so the agent re-resolves on each connect.
598
+ - With **FQDN**: the agent re-resolves the cluster endpoint. Aurora's DNS TTL is ~30 seconds, but the full
599
+ failover window (detection → replica promotion → DNS update → propagation to your VPC resolver) is
600
+ typically **60–120 seconds** in practice. Size connection-pool timeouts and circuit breakers for the
601
+ upper end of that range, not just the DNS TTL.
602
+ - Either way, the Rails connection pool will see a burst of errors during failover. The app should be
603
+ configured to reconnect cleanly on `PG::ConnectionBad` and similar errors. Test failover behavior
604
+ before assuming the app recovers without intervention.
605
+
606
+ ### Agent sizing and upgrades
607
+
608
+ - For production, start with `t3.medium` or larger so the agent has at least 2 vCPU / 4 GiB, matching the
609
+ CPLN recommendation from Step 2. Capacity-plan on the instance's baseline network bandwidth, not burst
610
+ bandwidth; `t3.small` is fine for testing, but its lower baseline makes it the wrong default for production
611
+ database connectivity. Watch agent CPU and active connection count and scale up the instance type if CPU
612
+ stays above ~70% or connection-queue latency rises.
613
+ - CPLN occasionally publishes new agent versions. The ASG + bootstrap script combination handles upgrades
614
+ by re-rolling instances; let it do that during low-traffic windows.
615
+
616
+ ### Cost
617
+
618
+ - Two `t3.medium` instances 24/7 in `us-east-2`: roughly $60/month before data transfer; check current EC2
619
+ pricing for your region and instance family.
620
+ - Cross-AZ data transfer between agent and RDS is **billed per-GB in each direction** even within one region,
621
+ so steady query traffic — not just bulk loads — accumulates a charge whenever the agent and the RDS writer
622
+ sit in different AZs. To minimize it, place an agent in the **same AZ as the RDS writer**. This trades off
623
+ against the multi-AZ HA recommended above: for cost-sensitive setups, co-locate with the writer's AZ; for
624
+ HA, spread across AZs and accept the cross-AZ transfer cost. Either way, keep the agent in the **same
625
+ region** as RDS.
626
+
627
+ ## Alternatives — when not to use an Agent
628
+
629
+ The Agent approach is universal: works for any TCP target in any private network. Two CPLN-native
630
+ alternatives are worth knowing about:
631
+
632
+ - **AWS PrivateLink (`awsPrivateLink` on identity).** If your AWS team is willing to publish the database
633
+ behind a Network Load Balancer + VPC Endpoint Service, you can skip the Agent entirely and have CPLN
634
+ consume the endpoint service natively. Fewer moving parts to operate, but more AWS-side setup
635
+ (NLB + target group + endpoint service + permissions). Recommended when you're already standardized on
636
+ PrivateLink for other services.
637
+ - **Public RDS with a tight security-group allowlist.** Acceptable for dev/test review apps only. CPLN
638
+ workload egress IPs are not stable in the long term, so this approach is brittle for production.
639
+
640
+ ## See also
641
+
642
+ - [Migrating Postgres database from Heroku infrastructure](./postgres.md) — covers Bucardo migration, which
643
+ still applies once the target RDS is reachable via the agent.
644
+ - [Secrets and env values](./secrets-and-env-values.md) — how `cpln://secret/...` references resolve.
645
+ - [README: Database](../README.md#database) — high-level options including dev/test public RDS.
646
+ - [Control Plane: Agent reference](https://shakadocs.controlplane.com/reference/agent).
647
+ - [Control Plane: Create an Identity guide](https://shakadocs.controlplane.com/guides/create-identity).
648
+ - [Control Plane: Setup Agent on AWS](https://shakadocs.controlplane.com/guides/setup-agent).
649
+ - [AWS: Scenarios for accessing a DB instance in a VPC](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_VPC.Scenarios.html).