direxio-deployer 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/AGENTS.md +92 -0
- package/LICENSE +21 -0
- package/README.md +221 -0
- package/README_zh.md +218 -0
- package/SKILL.md +722 -0
- package/agents/README.md +25 -0
- package/agents/openai.yaml +12 -0
- package/bin/direxio-deployer.mjs +375 -0
- package/package.json +28 -0
- package/references/agent-targets.md +128 -0
- package/references/architecture.md +44 -0
- package/references/bug-history.md +78 -0
- package/references/deployment-lessons.md +218 -0
- package/references/deployment-optimization-audit.md +317 -0
- package/references/deployment-workflow.md +341 -0
- package/references/iam-policy.json +52 -0
- package/references/runtime-wiring.md +209 -0
- package/references/state-machine.md +46 -0
- package/references/token-refresh.md +81 -0
- package/references/tooling.md +106 -0
- package/references/troubleshooting.md +26 -0
- package/references/user-journey.md +75 -0
- package/references/verification-recovery.md +84 -0
- package/references/voip-turn-runbook.md +154 -0
- package/references/windows-deployment-notes.md +119 -0
- package/scripts/aws-credentials.sh +195 -0
- package/scripts/cloud-init/Caddyfile +48 -0
- package/scripts/cloud-init/docker-compose.yml +125 -0
- package/scripts/cloud-init/init-tokens.sh +238 -0
- package/scripts/cloud-init/user-data.yaml +40 -0
- package/scripts/destroy.ps1 +77 -0
- package/scripts/destroy.sh +589 -0
- package/scripts/lib/aws.sh +73 -0
- package/scripts/lib/domain.sh +175 -0
- package/scripts/lib/operation_report.sh +240 -0
- package/scripts/lib/ops.sh +230 -0
- package/scripts/lib/paths.sh +35 -0
- package/scripts/lib/state.sh +137 -0
- package/scripts/mcp-tools-list.mjs +95 -0
- package/scripts/orchestrate.ps1 +112 -0
- package/scripts/orchestrate.sh +1126 -0
- package/scripts/phases/s0_prereq_aws.sh +39 -0
- package/scripts/phases/s1_preflight.sh +72 -0
- package/scripts/phases/s2_domain.sh +103 -0
- package/scripts/phases/s3_provision.sh +421 -0
- package/scripts/phases/s4_bootstrap_stack.sh +38 -0
- package/scripts/phases/s5_init_tokens.sh +118 -0
- package/scripts/phases/s6_wire_local.sh +1435 -0
- package/scripts/phases/s7_verify_e2e.sh +136 -0
- package/scripts/pricing-estimate.sh +256 -0
- package/scripts/render/render-userdata.sh +86 -0
- package/scripts/reset-app-data.sh +40 -0
- package/scripts/update.sh +30 -0
- package/tests/aws_credentials_test.sh +139 -0
- package/tests/connect_daemon_runtime_check_test.sh +120 -0
- package/tests/default_paths_test.sh +58 -0
- package/tests/destroy_local_bridge_test.sh +154 -0
- package/tests/destroy_root_identity_test.sh +91 -0
- package/tests/destroy_route53_zone_test.sh +80 -0
- package/tests/domain_authoritative_dns_test.sh +49 -0
- package/tests/mcp_doctor_runtime_check_test.sh +86 -0
- package/tests/mcp_smoke_runtime_check_test.sh +121 -0
- package/tests/mcp_tools_runtime_check_test.sh +123 -0
- package/tests/npm_skill_distribution_test.sh +95 -0
- package/tests/operation_report_test.sh +258 -0
- package/tests/orchestrate_status_recovery_test.sh +91 -0
- package/tests/phase_timeout_test.sh +88 -0
- package/tests/pricing_estimate_test.sh +159 -0
- package/tests/render_userdata_remote_nodes_test.sh +40 -0
- package/tests/root_volume_tracking_test.sh +41 -0
- package/tests/route53_overwrite_guard_test.sh +86 -0
- package/tests/route53_zone_auto_create_test.sh +66 -0
- package/tests/runtime_summary_check_test.sh +203 -0
- package/tests/s6_wire_local_test.sh +405 -0
- package/tests/skill_structure_test.sh +298 -0
- package/tests/update_reset_ops_test.sh +230 -0
- package/tests/user_confirmation_gates_test.sh +152 -0
|
@@ -0,0 +1,78 @@
|
|
|
1
|
+
# 踩坑记录(按修复 PR 归类)
|
|
2
|
+
|
|
3
|
+
部署链路上所有真实踩过的坑。**已全部修进 `scripts/` 下的部署文件**,新部署不会再撞;
|
|
4
|
+
列在这里是为了:① 理解每个设计决策的来由;② 若有人改坏了哪处,能快速定位回退点。
|
|
5
|
+
|
|
6
|
+
## p2p-matrix-as 仓库
|
|
7
|
+
|
|
8
|
+
### AS PR #4 — 镜像多架构
|
|
9
|
+
- **症状**:ARM 架构 EC2(t4g 系列)`docker pull` 后 `exec format error`。
|
|
10
|
+
- **根因**:镜像只 build 了 amd64。
|
|
11
|
+
- **修复**:CI 用 buildx 出 `amd64+arm64` 多架构镜像。`xuyanzu01/p2p-im-as:latest` 已是多架构。
|
|
12
|
+
|
|
13
|
+
### AS PR #5 — 容器化体验
|
|
14
|
+
- **卷权限**:命名卷默认 root:700,AS 降权到 asd(UID 10001)后打不开 sqlite → `SQLITE_CANTOPEN`。
|
|
15
|
+
修复:entrypoint chown `/opt/p2p` `/data` 后再 `su-exec asd`。
|
|
16
|
+
- **无健康检查**:compose 无法判断 AS 就绪。修复:Dockerfile 加 `HEALTHCHECK` 探 `:9090/healthz`。
|
|
17
|
+
- **registration 路径写死**:registration.yaml 落在 cwd 不可配。修复:加 `RegistrationPath` 配置项。
|
|
18
|
+
|
|
19
|
+
### AS PR #5 遗留 — GHA 缓存(⚠️ 未合主干)
|
|
20
|
+
- **症状**:长时 arm64 build 时 Azure cache 的 SAS token 过期 → `403 AuthenticationFailed`,CI 失败。
|
|
21
|
+
- **影响**:**只影响"重新打 AS 镜像"**;部署默认用已发布的 public 镜像,不打包就不踩。
|
|
22
|
+
- **状态**:提交过 cache-disable 修复,但没进 PR #5 的 squash-merge。**待补一个独立 PR。**
|
|
23
|
+
|
|
24
|
+
## 部署脚本
|
|
25
|
+
|
|
26
|
+
### ops PR #2 — 一键部署主体
|
|
27
|
+
依次撞过、已全部修掉:
|
|
28
|
+
1. **docker compose 插件不在 Amazon Linux**(且早期下错架构 x86 装到 ARM)→ 改用 `get.docker.com` 装,自带 compose v2。
|
|
29
|
+
2. **asd.yaml chmod 600** 让容器内 asd 读不到 → cloud-init write_files 用 0600 但属主对。
|
|
30
|
+
3. **Dendrite SQLite broken**(element-hq/dendrite#3435,`near "SEQUENCE" syntax error`)→ 加 `postgres:16-alpine`,Dendrite 连 PG。
|
|
31
|
+
4. **Caddyfile `email {$ACME_EMAIL}` 为空** → `wrong argument count`。修复:删掉默认 email 块,Caddy 自动签不需要 email。
|
|
32
|
+
5. **Dendrite 启动早于 AS 写 registration** → 竞态。修复:compose `depends_on: asd service_healthy` + `postgres service_healthy`。
|
|
33
|
+
6. **旧 init-tokens.sh 经 Caddy 308 跳转** → HTTP 初始化失败。当前流程已移除该接口调用,改为等 AS 写 `/opt/p2p/bootstrap.json`。
|
|
34
|
+
7. **init-tokens.sh CRLF** → `pipefail: invalid option name`。修复:文件存为 LF(`sed -i 's/\r$//'`)。
|
|
35
|
+
|
|
36
|
+
### ops PR #3 — owner.json 发现
|
|
37
|
+
- **症状**:client 探 `/.well-known/portal/owner.json` 得 404 → 误报 "Portal 未部署"。
|
|
38
|
+
- **根因**:AS 不自带 owner.json 的 HTTP handler;它把文件写到 wellknown 卷,需 Caddy 静态 serve。
|
|
39
|
+
- **修复**:
|
|
40
|
+
- asd.yaml:`wellknown.output_dir: /opt/p2p/wellknown`(宿主 `/opt/p2p` 共享目录子目录)。
|
|
41
|
+
- Caddyfile:`handle_path /.well-known/portal/*` → `root * /srv/wellknown/wellknown` → `file_server`。
|
|
42
|
+
- compose:caddy 挂 `/opt/p2p:/srv/wellknown:ro` 读同一份。
|
|
43
|
+
- **后续发现**:Web client 从本地 dev origin 读 owner.json 时,HTTP 200 但缺 CORS 会被浏览器拦截,表现为 `net::ERR_FAILED 200 (OK)`。修复:Caddy 的 portal well-known handler 加 `Access-Control-Allow-Origin *`,S7 同步校验响应头。
|
|
44
|
+
|
|
45
|
+
### ops VoIP — 通话连不上 = 没 TURN relay
|
|
46
|
+
- **症状**:语音/视频通话信令(`m.call.*`)互通,但 WebRTC 连不上;`/_matrix/client/v3/voip/turnServer` 返回 `{}`,ICE 只有 host/srflx 无 relay → 跨 NAT/防火墙必失败。**纯后端缺口,非前端。**
|
|
47
|
+
- **修复**(第一版明文 turn:3478,不上 TLS):
|
|
48
|
+
- compose 加 `coturn`(`network_mode: host` + `--use-auth-secret` + `--static-auth-secret=${TURN_SECRET}` + `--external-ip=${PUBLIC_IP}` + `--min/max-port 49160-49200`)。
|
|
49
|
+
- s3 安全组开 `3478 udp/tcp` + `49160-49200/udp`。
|
|
50
|
+
- user-data 从 IMDS public-ipv4 落 `PUBLIC_IP`,随机生成 `TURN_SECRET`,都写进 .env。
|
|
51
|
+
- dendrite entrypoint 追加 `client_api.turn`(`turn_shared_secret` 同 TURN_SECRET + `turn_uris` turn:DOMAIN:3478 udp/tcp)→ homeserver 动态签短期凭证。
|
|
52
|
+
- S7 加 turnServer 非空校验。
|
|
53
|
+
- **重部署勿删**:coturn/端口/turn 段已固化进 skill;后续简化部署时不要删,否则铲掉重起会再丢通话能力。
|
|
54
|
+
- **难查点**:Dendrite 的 `turn_shared_secret` 必须 == coturn `--static-auth-secret`(都来自 .env `TURN_SECRET`),不一致 → turnServer 返回凭证但 relay 拒绝。详见 `voip-turn-runbook.md`。
|
|
55
|
+
|
|
56
|
+
## 机型/内存类
|
|
57
|
+
- **机型默认 t3.small(2GB)**:postgres + dendrite + asd + caddy 四容器同机,2GB 实测能稳跑,不靠 swap。
|
|
58
|
+
想更省钱可换 t3.micro(1GB),但 1GB 跑 4 容器 + 首次拉镜像易 OOM,届时需在 cloud-init 装 Docker 前
|
|
59
|
+
自配 2GB swap(`/swapfile`,`vm.swappiness=10`)兜底。默认不开,避免无谓复杂度。
|
|
60
|
+
- **架构固定 x86/amd64**:状态机 S3 的 AMI 锁 `.../amd64/...`,机型默认 t3.small,**不走 ARM**——
|
|
61
|
+
规避 AS PR #4 那类"单架构镜像在 ARM 上 `exec format error`"的坑(虽然现镜像已多架构,x86 更省心)。
|
|
62
|
+
|
|
63
|
+
## 真·全新账号 + macOS 端到端实测踩坑(2026-06-01,已全部修)
|
|
64
|
+
这批来自一次真实"干净 macOS + 全新 AWS 账号"跑 skill 的实测,是之前没做过的端到端验证:
|
|
65
|
+
- **macOS 默认 bash 3.2 不支持 `declare -A`**:`orchestrate.sh` 一启动就崩。修:阶段→脚本映射改用 `case`(`phase_file()`),不再用关联数组。
|
|
66
|
+
- **S0 只认 AK/SK,本机有 `AWS_PROFILE` 也误判"没凭证"**:修:直接 `aws sts get-caller-identity` 判断凭证有效性(支持 profile/AK/SK/角色),仅在 sts 失败且既无 AK/SK 也无 AWS_PROFILE 时才算等用户。
|
|
67
|
+
- **user-data 超 16384 字节硬上限**:三份部署文件各自 base64 内联会超限,AWS 报 `User data is limited to 16384 bytes`。修:打成一个 `bundle.tar.gz` 单条内联,开机 runcmd 第一步解包到 `/opt/p2p`。实测降到 ~11KB。
|
|
68
|
+
- **AS 读不到 `asd.yaml`(`permission denied`)**:把只读配置叠挂进 `/opt/p2p` 与降权用户 cwd 冲突。修:`asd.yaml` 改挂 `/etc/p2p/asd.yaml`,`--config` 指向新路径;`/opt/p2p` 只作共享运行输出目录。
|
|
69
|
+
- **EC2 自带 `*.compute.amazonaws.com` 签不了 Let's Encrypt**(`rejectedIdentifier`)。历史修复曾给验收/试用流程加过 `<公网IP>.sslip.io` 临时域名,用于绕过 EC2 默认域名不能签证书的问题。当前正式部署接口已经移除该路径:没有最终域名时 S2 直接阻断,不创建 EC2;Matrix `server_name` 必须使用用户长期持有并能管理 DNS 的正式域名。
|
|
70
|
+
- **宿主机读不到凭据文件**:AS 在容器内写 `/opt/p2p/bootstrap.json`,若 compose 未 bind mount `/opt/p2p`,本机 S5 无法从 EC2 读取。修:`asd` 挂 `/opt/p2p:/opt/p2p`,S5 用 `ssh ... sudo cat /opt/p2p/bootstrap.json`。
|
|
71
|
+
- **旧 HTTP 初始化抢跑 Dendrite**:早期脚本主动发初始化请求,Dendrite 没 ready 时会返回 500,但脚本可能误判成功。当前流程已移除主动初始化请求,由 AS 启动成功后自行写 `bootstrap.json`;`init-tokens.sh` 只等待文件完整,失败即不写 `.deploy-done`,状态机如实反映失败。
|
|
72
|
+
- **(client 侧,不在本仓库)** `/_as/auth` 返回容器内网 `http://dendrite:8008`,App 在用户机访问不到 → 前端遇到 `dendrite` 这类内部 host 时回退到用户输入的公网 Portal 地址。
|
|
73
|
+
|
|
74
|
+
## 机型/内存类(承上)
|
|
75
|
+
|
|
76
|
+
## 本机/环境类(不在仓库,属操作经验)
|
|
77
|
+
- **本地代理截断 AWS/lark 的 TLS**:AWS 公共前置已 `export NO_PROXY="*"` 并 unset 代理。lark-cli 用 `LARK_CLI_NO_PROXY=1`。
|
|
78
|
+
- **PowerShell 生成 SSH key 时 `-N '""'`** 把字面 `"` 当密码 → 加密私钥登录失败。正确:`-N ''`(单引号空)。
|
|
@@ -0,0 +1,218 @@
|
|
|
1
|
+
# Deployment Lessons From im2.jkmf.top
|
|
2
|
+
|
|
3
|
+
This note captures operational lessons from the production deployment of
|
|
4
|
+
`im2.jkmf.top` on AWS from a Windows workstation. Keep it short and practical:
|
|
5
|
+
symptom, cause, and what the next operator or agent should do.
|
|
6
|
+
|
|
7
|
+
## AS Bootstrap Initialization
|
|
8
|
+
|
|
9
|
+
Symptom:
|
|
10
|
+
|
|
11
|
+
```text
|
|
12
|
+
S5_INIT_TOKENS failed: read bootstrap.json timed out
|
|
13
|
+
/opt/p2p/bootstrap.json was missing or incomplete
|
|
14
|
+
```
|
|
15
|
+
|
|
16
|
+
Cause:
|
|
17
|
+
|
|
18
|
+
Current `p2p-matrix-as` builds initialize on service startup and write
|
|
19
|
+
`/opt/p2p/bootstrap.json` with the login `password`, `agent_token`, and owner
|
|
20
|
+
metadata. Calling the old bootstrap HTTP endpoint or scraping logs is no longer
|
|
21
|
+
part of the deploy path.
|
|
22
|
+
|
|
23
|
+
Fix now in ops:
|
|
24
|
+
|
|
25
|
+
- Cloud-side `scripts/cloud-init/init-tokens.sh` waits for AS `/healthz` and
|
|
26
|
+
the credentials file.
|
|
27
|
+
- `docker-compose.yml` bind-mounts host `/opt/p2p` into the AS container so the
|
|
28
|
+
file is readable from the EC2 host.
|
|
29
|
+
- Local S5 reads the file with `ssh ... sudo cat /opt/p2p/bootstrap.json`,
|
|
30
|
+
normalizes it into local `outputs.json`, and stores `password`/`agent_token`
|
|
31
|
+
in state.
|
|
32
|
+
|
|
33
|
+
## Windows Runtime Pitfalls
|
|
34
|
+
|
|
35
|
+
Do not hard-code a Git Bash path. Different machines install Git/MSYS/Cygwin in
|
|
36
|
+
different locations, and WSL may or may not be configured.
|
|
37
|
+
|
|
38
|
+
Use a POSIX shell that actually runs the deployment scripts:
|
|
39
|
+
|
|
40
|
+
```powershell
|
|
41
|
+
Get-Command bash.exe -All
|
|
42
|
+
bash -lc 'echo ok; command -v aws; command -v jq; command -v ssh; command -v scp; command -v curl'
|
|
43
|
+
```
|
|
44
|
+
|
|
45
|
+
If `bash` prints the Windows Subsystem for Linux installation prompt or exits
|
|
46
|
+
before running `echo ok`, it is only the Windows WSL launcher and cannot run ops
|
|
47
|
+
until a WSL distro is installed. Use another POSIX shell such as Git Bash, MSYS2,
|
|
48
|
+
Cygwin, or a working WSL distro.
|
|
49
|
+
|
|
50
|
+
The orchestrator now prepends workspace-local `.tools/bin` when present. This
|
|
51
|
+
directory is an optional local tool cache that may be downloaded by the operator
|
|
52
|
+
or system; it is not assumed to come from the original repo or from the skill.
|
|
53
|
+
When `.tools/bin/jq.exe` exists, compatible Windows POSIX shells can discover it
|
|
54
|
+
without manual PATH surgery.
|
|
55
|
+
|
|
56
|
+
Prefer the `ssh`/`scp` that belongs to the same POSIX environment used for
|
|
57
|
+
`bash`. Windows OpenSSH can reject EC2 private keys because inherited ACLs make
|
|
58
|
+
the `.pem` look too open, even when Git/MSYS OpenSSH accepts it. If using Windows
|
|
59
|
+
OpenSSH directly, fix the key ACL instead of disabling SSH checks.
|
|
60
|
+
|
|
61
|
+
## Local Polling Can Hang While The Server Is Healthy
|
|
62
|
+
|
|
63
|
+
Symptom:
|
|
64
|
+
|
|
65
|
+
- `state.json` stays at `S4_BOOTSTRAP_STACK=polling`.
|
|
66
|
+
- `https://<domain>/healthz` returns `{"status":"ok"}` from another shell.
|
|
67
|
+
- A leftover local `curl -skf https://<domain>/healthz` or SSH child process is
|
|
68
|
+
still running after the agent/operator interrupted the deployment turn.
|
|
69
|
+
|
|
70
|
+
Cause:
|
|
71
|
+
|
|
72
|
+
The cloud side may have completed successfully, but a local network call can
|
|
73
|
+
hang long enough that the state machine never records the successful phase. This
|
|
74
|
+
is especially confusing on Windows when proxy settings, direct TCP reachability,
|
|
75
|
+
or interrupted terminal sessions leave child processes behind.
|
|
76
|
+
|
|
77
|
+
Fix now in ops:
|
|
78
|
+
|
|
79
|
+
- S4 health checks use per-attempt curl timeouts:
|
|
80
|
+
`HEALTH_CURL_CONNECT_TIMEOUT` and `HEALTH_CURL_MAX_TIME`.
|
|
81
|
+
- S5 SSH reads use non-interactive SSH options plus `SSH_COMMAND_TIMEOUT` when
|
|
82
|
+
the local `timeout` command is available.
|
|
83
|
+
- If a deployment was interrupted, inspect `scripts/orchestrate.sh status`,
|
|
84
|
+
stop only leftover local `orchestrate.sh`/`curl`/`ssh` children for that run,
|
|
85
|
+
and resume with `P2P_EXISTING_STATE_ACTION=continue`.
|
|
86
|
+
- If SSH to the instance is blocked but AWS access still works, attach a
|
|
87
|
+
temporary SSM role and use SSM Run Command to read `/opt/p2p/bootstrap.json`
|
|
88
|
+
without printing secrets. Remove or audit the temporary role after recovery.
|
|
89
|
+
|
|
90
|
+
## DNS And State Handling
|
|
91
|
+
|
|
92
|
+
For `DOMAIN_MODE=user`, S3 intentionally stops after allocating the EIP and waits
|
|
93
|
+
until the real DNS A record points at that EIP. Continue only after public DNS
|
|
94
|
+
resolves correctly. This avoids Caddy and Let's Encrypt racing DNS propagation.
|
|
95
|
+
|
|
96
|
+
When rerunning after a resource was created, set:
|
|
97
|
+
|
|
98
|
+
```bash
|
|
99
|
+
P2P_EXISTING_STATE_ACTION=continue
|
|
100
|
+
```
|
|
101
|
+
|
|
102
|
+
This is deliberate. It prevents accidental duplicate EC2/EIP creation or unsafe
|
|
103
|
+
reuse of an old deployment state.
|
|
104
|
+
|
|
105
|
+
## Credential Safety
|
|
106
|
+
|
|
107
|
+
Prefer a temporary `DirexioDeployer` IAM user or dedicated IAM role for routine
|
|
108
|
+
deployment. Root access keys are allowed when the operator explicitly chooses
|
|
109
|
+
them; report that the identity is root and remind the operator to rotate or
|
|
110
|
+
remove the key when it is no longer needed.
|
|
111
|
+
|
|
112
|
+
Do not store AWS AK/SK in skill files, docs, or committed repo files. Treat
|
|
113
|
+
`state.json`, `outputs.json`, and `~/.direxio/nodes/<service_id>/credentials.json` as local
|
|
114
|
+
secrets because they contain the portal/agent token after S5.
|
|
115
|
+
|
|
116
|
+
## Route53 Delegation From Third-Party Registrar
|
|
117
|
+
|
|
118
|
+
Symptom:
|
|
119
|
+
|
|
120
|
+
- User chose `DOMAIN_MODE=route53` but the domain is registered at Alibaba Cloud / GoDaddy / Cloudflare (not AWS Route53 registrar).
|
|
121
|
+
- S3 creates or reuses a Route53 hosted zone and upserts the A record, but public
|
|
122
|
+
DNS still does not resolve to the new IP.
|
|
123
|
+
|
|
124
|
+
Cause:
|
|
125
|
+
|
|
126
|
+
S3 can create the Route53 hosted zone, but Route53 does not become
|
|
127
|
+
authoritative until the current registrar delegates the zone's NS records. When
|
|
128
|
+
the domain administrator is a third party, the user or a provider-specific DNS
|
|
129
|
+
connector must update NS delegation outside AWS.
|
|
130
|
+
|
|
131
|
+
Fix procedure:
|
|
132
|
+
|
|
133
|
+
1. Read the created or reused zone details from `state.json`:
|
|
134
|
+
```bash
|
|
135
|
+
jq '.resources | {route53_zone_id, route53_zone_name, route53_name_servers}' ~/.direxio/nodes/<service_id>/state.json
|
|
136
|
+
```
|
|
137
|
+
2. Delegate those NS servers at the current registrar, or use the provider API
|
|
138
|
+
if credentials are available.
|
|
139
|
+
3. Wait for authoritative NS and A-record propagation.
|
|
140
|
+
4. Re-run `scripts/orchestrate.sh` with `P2P_EXISTING_STATE_ACTION=continue`.
|
|
141
|
+
|
|
142
|
+
DNS propagation of new NS records can take minutes to hours. After the user
|
|
143
|
+
confirms the change, verify with `nslookup -type=NS <DOMAIN>` or
|
|
144
|
+
`dig NS <DOMAIN> +short`. The S3 phase's `_require_user_dns_ready()` will
|
|
145
|
+
handle the A-record wait loop.
|
|
146
|
+
|
|
147
|
+
Always report:
|
|
148
|
+
|
|
149
|
+
- App domain and eight-digit app initialization code, with the code sourced from the backend `password` field.
|
|
150
|
+
- Portal token or where it was written.
|
|
151
|
+
- `~/.direxio/nodes/<service_id>/credentials.json` status and profile shape.
|
|
152
|
+
- AWS region, EC2 instance ID, public IP, security group, state path, SSH command.
|
|
153
|
+
- Stop-billing guidance: ask the agent to destroy this node when finished; AWS resources keep billing until teardown completes.
|
|
154
|
+
- Any manual DNS record the user owns outside Route53.
|
|
155
|
+
|
|
156
|
+
## Let's Encrypt Certificate Rate Limits
|
|
157
|
+
|
|
158
|
+
Symptom:
|
|
159
|
+
|
|
160
|
+
- S4_BOOTSTRAP_STACK health check times out after 5-10 minutes.
|
|
161
|
+
- SSH reveals all containers are up and healthy (caddy, message-server, postgres, coturn).
|
|
162
|
+
- `docker logs p2p-caddy-1` shows repeated errors:
|
|
163
|
+
|
|
164
|
+
```
|
|
165
|
+
HTTP 429 urn:ietf:params:acme:error:rateLimited - too many certificates (5)
|
|
166
|
+
already issued for this exact set of identifiers in the last 168h0m0s,
|
|
167
|
+
retry after ...
|
|
168
|
+
```
|
|
169
|
+
|
|
170
|
+
Cause: Let's Encrypt allows at most 5 certificates per domain per 168 hours (7 days). Redeploying the same domain repeatedly within a week exhausts this quota.
|
|
171
|
+
|
|
172
|
+
Workaround (use when the health check is the only blocker and the rate limit is temporary):
|
|
173
|
+
|
|
174
|
+
1. **Add `tls internal` to the Caddyfile** so Caddy uses its built-in CA (self-signed). The directive goes on the line after the site block opener.
|
|
175
|
+
|
|
176
|
+
2. Write the modified Caddyfile to the remote host. Use base64+SSH to avoid shell escaping issues:
|
|
177
|
+
```bash
|
|
178
|
+
echo '<base64-encoded-caddyfile>' | base64 -d | sudo tee /opt/p2p/Caddyfile
|
|
179
|
+
sudo docker compose -f /opt/p2p/docker-compose.yml restart caddy
|
|
180
|
+
```
|
|
181
|
+
|
|
182
|
+
3. Wait 5 seconds, then verify HTTPS works:
|
|
183
|
+
```bash
|
|
184
|
+
curl -sk --resolve <domain>:443:<EIP> https://<domain>/healthz
|
|
185
|
+
# Expected: {"status":"ok"}
|
|
186
|
+
```
|
|
187
|
+
|
|
188
|
+
4. Resume orchestrate.sh with:
|
|
189
|
+
```bash
|
|
190
|
+
P2P_EXISTING_STATE_ACTION=continue bash scripts/orchestrate.sh
|
|
191
|
+
```
|
|
192
|
+
|
|
193
|
+
5. **After deployment completes**, restore the original Caddyfile (remove `tls internal`) and restart Caddy. Caddy will retry the production Let's Encrypt cert when the rate limit resets. The self-signed cert is a temporary bridge; HTTPS will show a browser warning until the production cert is obtained.
|
|
194
|
+
|
|
195
|
+
Prevention:
|
|
196
|
+
|
|
197
|
+
- Use a separate subdomain per deployment cycle (e.g. `__DOMAIN_A__`, `__DOMAIN_B__`) when doing repeated test deployments within 7 days.
|
|
198
|
+
- Preserve the old `caddy-data` Docker volume on redeploy to carry forward the existing certificate.
|
|
199
|
+
|
|
200
|
+
## Route53 Duplicate Zone Detection
|
|
201
|
+
|
|
202
|
+
Symptom: A new hosted zone was created via `aws route53 create-hosted-zone` for a domain that already had a Route53 zone from a prior deployment. The NS records of the new zone do not match the NS records configured at the registrar, so DNS resolution still uses the old zone's servers.
|
|
203
|
+
|
|
204
|
+
Fix:
|
|
205
|
+
|
|
206
|
+
1. List all existing zones first: `aws route53 list-hosted-zones --query 'HostedZones[*].[Name,Id]'`
|
|
207
|
+
2. Check which zone's NS servers match DNS: `nslookup -type=NS <domain>`
|
|
208
|
+
3. If the domain is already delegated to Route53, use the matching existing zone.
|
|
209
|
+
4. Delete the duplicate: `aws route53 delete-hosted-zone --id /hostedzone/<DUPLICATE_ID>`
|
|
210
|
+
|
|
211
|
+
Prevention:
|
|
212
|
+
|
|
213
|
+
- Before deployment, check for existing zones. If one exists and its NS records
|
|
214
|
+
match current DNS delegation, no new zone is needed; S3 will reuse it.
|
|
215
|
+
- Let S3 create a new hosted zone only when deploying a domain with no matching
|
|
216
|
+
Route53 zone or when migrating DNS delegation for the first time.
|
|
217
|
+
- Destroy attempts to delete hosted zones recorded as created by the deployer;
|
|
218
|
+
user-owned or pre-existing zones are retained.
|
|
@@ -0,0 +1,317 @@
|
|
|
1
|
+
# Deployment Optimization Audit
|
|
2
|
+
|
|
3
|
+
This file maps the 2026-06-26 deployment discussion checklist to the current
|
|
4
|
+
`direxio-deployer` branch. It is intentionally a deployer-side audit, not a
|
|
5
|
+
claim that every App or host-agent runtime has been proven in a real session.
|
|
6
|
+
|
|
7
|
+
## Status Legend
|
|
8
|
+
|
|
9
|
+
- Deployer-side implemented: implemented or guarded in this repository with
|
|
10
|
+
scripts, docs, and regression tests.
|
|
11
|
+
- Runtime evidence still required: this repository can prepare or check the
|
|
12
|
+
condition, but a real App, Codex, OpenClaw, Hermes, or MCP runtime must still
|
|
13
|
+
provide final evidence.
|
|
14
|
+
- Deferred by design: not part of the current EC2 MVP path.
|
|
15
|
+
|
|
16
|
+
## Current Best Plan
|
|
17
|
+
|
|
18
|
+
Current best plan is the stricter plan now encoded in this branch:
|
|
19
|
+
|
|
20
|
+
1. Keep one EC2 MVP path first, with `t3.small` default, dynamic cost estimate,
|
|
21
|
+
Route53 automation, temporary IAM user guidance, and destroy evidence.
|
|
22
|
+
2. Keep all node-local deployment state under `~/.direxio/nodes/<service_id>/`.
|
|
23
|
+
Do not mutate global host MCP configs or assume one computer has only one
|
|
24
|
+
agent or one backend node.
|
|
25
|
+
3. Treat S7 as automated foundation checks only. `verify runtime is an internal
|
|
26
|
+
non-polluting check`; it is not enough to declare the product complete.
|
|
27
|
+
4. Require explicit user App initialization and real chat evidence before user
|
|
28
|
+
gates can be confirmed.
|
|
29
|
+
5. Keep Agent/MCP validation non-polluting by default, then let the user decide
|
|
30
|
+
whether to send a real message in the Agent chat box.
|
|
31
|
+
6. Keep update/reset/destroy as separate operations with separate receipts;
|
|
32
|
+
update/reset are now first-class scripts, not runbook-only manual actions.
|
|
33
|
+
7. Treat update/reset follow-up as a Local refresh state: update/reset cleared
|
|
34
|
+
old credentials, user confirmations, runtime checks, and bridge install
|
|
35
|
+
proof, so the next action is to rerun S4-S7 and runtime checks.
|
|
36
|
+
8. Keep Lightsail out of the current user-facing path. Lightsail remains
|
|
37
|
+
deferred until it has an independent resource model, pricing, state,
|
|
38
|
+
destroy, and test matrix.
|
|
39
|
+
|
|
40
|
+
Audit anchors:
|
|
41
|
+
- verify runtime is an internal non-polluting check
|
|
42
|
+
- user App initialization and real chat evidence
|
|
43
|
+
- update/reset are now first-class scripts
|
|
44
|
+
- Local refresh
|
|
45
|
+
- Lightsail remains deferred
|
|
46
|
+
|
|
47
|
+
## Requirement Mapping
|
|
48
|
+
|
|
49
|
+
### DEPLOY-P0-001 - Do Not Declare Completion Early
|
|
50
|
+
|
|
51
|
+
Status: Deployer-side implemented, with Runtime evidence still required.
|
|
52
|
+
|
|
53
|
+
Current evidence:
|
|
54
|
+
- `SKILL.md` defines Product Completion Gates and says S7 green is not final
|
|
55
|
+
product completion.
|
|
56
|
+
- `scripts/orchestrate.sh confirm app_initialization|real_chat|agent_mcp_runtime`
|
|
57
|
+
requires explicit evidence and rejects short generic confirmations.
|
|
58
|
+
- `tests/user_confirmation_gates_test.sh` and `tests/operation_report_test.sh`
|
|
59
|
+
assert pending gates and redacted evidence.
|
|
60
|
+
|
|
61
|
+
Difference from the checklist:
|
|
62
|
+
- The checklist asked for product gates. The current branch implements them as
|
|
63
|
+
state/report fields instead of a single "done" word.
|
|
64
|
+
|
|
65
|
+
Remaining evidence:
|
|
66
|
+
- Real user App initialization and real chat evidence.
|
|
67
|
+
- Real selected runtime confirmation that service-scoped MCP tools loaded.
|
|
68
|
+
|
|
69
|
+
### DEPLOY-P0-002 - OpenClaw Runtime Acceptance
|
|
70
|
+
|
|
71
|
+
Status: Deployer-side implemented, with Runtime evidence still required.
|
|
72
|
+
|
|
73
|
+
Current evidence:
|
|
74
|
+
- Runtime snippets are written under `~/.direxio/nodes/<service_id>/mcp/`.
|
|
75
|
+
- `verify mcp_doctor`, `verify mcp_tools`, `verify mcp_smoke`, and
|
|
76
|
+
`verify runtime` are available.
|
|
77
|
+
- `mcp_smoke` uses a read-only backend action by default.
|
|
78
|
+
- `agent_mcp_runtime` confirmation requires both a passed runtime summary and
|
|
79
|
+
`DIREXIO_CONFIRM_RUNTIME_PROBE=1`.
|
|
80
|
+
|
|
81
|
+
Difference from the checklist:
|
|
82
|
+
- The deployer does not auto-send a test chat message. That is deliberate:
|
|
83
|
+
internal probes are non-polluting, while user-visible chat proof stays a user
|
|
84
|
+
action.
|
|
85
|
+
|
|
86
|
+
Remaining evidence:
|
|
87
|
+
- A real OpenClaw/Hermes/Codex runtime must prove it loaded the exact
|
|
88
|
+
service-scoped MCP snippet and can use it.
|
|
89
|
+
|
|
90
|
+
### DEPLOY-P0-003 - Cost And Destroy Loop
|
|
91
|
+
|
|
92
|
+
Status: Deployer-side implemented.
|
|
93
|
+
|
|
94
|
+
Current evidence:
|
|
95
|
+
- `scripts/pricing-estimate.sh` records EC2, EBS, public IPv4/EIP, and Route53
|
|
96
|
+
estimate fields, with fallback status when AWS Pricing cannot answer.
|
|
97
|
+
- `scripts/destroy.sh` reads AWS resources back and records `destroy.evidence`.
|
|
98
|
+
- `operation-report.json` includes `billing.destroy_cleanup_status` and
|
|
99
|
+
`billing.possible_remaining_billable_resources`.
|
|
100
|
+
- Tests reject old wording that attached public IPv4/EIP is free.
|
|
101
|
+
|
|
102
|
+
Difference from the checklist:
|
|
103
|
+
- The checklist asked for deploy and destroy receipts. The current branch adds
|
|
104
|
+
operation-scoped machine-readable reports so future agents can audit residue.
|
|
105
|
+
|
|
106
|
+
Remaining evidence:
|
|
107
|
+
- User should still review AWS Billing Console and AWS Budget status in the AWS
|
|
108
|
+
account, because credits, tax, transfer, and usage are account-specific.
|
|
109
|
+
|
|
110
|
+
### DEPLOY-P0-004 - Domain And DNS Automation Protection
|
|
111
|
+
|
|
112
|
+
Status: Deployer-side implemented, with external-provider limits.
|
|
113
|
+
|
|
114
|
+
Current evidence:
|
|
115
|
+
- Route53 hosted zone reuse/create is automated.
|
|
116
|
+
- Existing A record overwrite is blocked unless
|
|
117
|
+
`DIREXIO_CONFIRM_DNS_OVERWRITE=1`.
|
|
118
|
+
- Authoritative DNS checks and tests cover hosted-zone and overwrite behavior.
|
|
119
|
+
|
|
120
|
+
Difference from the checklist:
|
|
121
|
+
- The current MVP automates Route53. If the domain is managed by another DNS
|
|
122
|
+
provider without an available API or authorization, the correct state is
|
|
123
|
+
waiting for authorization, not pretending completion.
|
|
124
|
+
|
|
125
|
+
Remaining evidence:
|
|
126
|
+
- Third-party DNS providers still need provider-specific automation before they
|
|
127
|
+
can be treated like Route53.
|
|
128
|
+
|
|
129
|
+
### DEPLOY-P0-005 - Authorization And Security Boundary
|
|
130
|
+
|
|
131
|
+
Status: Deployer-side implemented.
|
|
132
|
+
|
|
133
|
+
Current evidence:
|
|
134
|
+
- `scripts/aws-credentials.sh import-csv|verify` imports local CSV credentials,
|
|
135
|
+
tightens file permissions, blocks root identity, and redacts identity output.
|
|
136
|
+
- `SKILL.md` documents the temporary `DirexioDeployer` IAM user path with
|
|
137
|
+
temporary `AdministratorAccess`, then cleanup.
|
|
138
|
+
- Reports and tests assert secrets are redacted and not written to reports.
|
|
139
|
+
|
|
140
|
+
Difference from the checklist:
|
|
141
|
+
- The current branch chooses the practical MVP path: temporary IAM admin user,
|
|
142
|
+
no root access keys, and cleanup guidance after deployment.
|
|
143
|
+
|
|
144
|
+
Remaining evidence:
|
|
145
|
+
- Long-term least-privilege IAM generation is still a future hardening task.
|
|
146
|
+
|
|
147
|
+
### DEPLOY-P1-001 - Instance And Region Choice
|
|
148
|
+
|
|
149
|
+
Status: Deployer-side implemented.
|
|
150
|
+
|
|
151
|
+
Current evidence:
|
|
152
|
+
- `SKILL.md` keeps the current MVP path as EC2 `t3.small` by default.
|
|
153
|
+
- Pricing is region-aware where AWS Pricing lookup succeeds, otherwise marked
|
|
154
|
+
fallback.
|
|
155
|
+
- Docs steer ordinary users away from `t2.micro`/`t3.micro` as default
|
|
156
|
+
production nodes.
|
|
157
|
+
|
|
158
|
+
Difference from the checklist:
|
|
159
|
+
- The current plan is simpler than a user choice matrix: recommend one default
|
|
160
|
+
path, then allow explicit upgrade when the user asks for heavier usage.
|
|
161
|
+
|
|
162
|
+
Remaining evidence:
|
|
163
|
+
- Region choice still depends on where the user and their contacts are.
|
|
164
|
+
|
|
165
|
+
### DEPLOY-P1-002 - EC2 And Lightsail Path Separation
|
|
166
|
+
|
|
167
|
+
Status: Deployer-side implemented for EC2 boundary, Deferred by design for
|
|
168
|
+
Lightsail.
|
|
169
|
+
|
|
170
|
+
Current evidence:
|
|
171
|
+
- `SKILL.md` says the current MVP deployment path is EC2-only.
|
|
172
|
+
- Tests reject wording that offers Lightsail as an implemented automatic path.
|
|
173
|
+
|
|
174
|
+
Difference from the checklist:
|
|
175
|
+
- No current script attempts to mix Lightsail into the EC2 state machine. That
|
|
176
|
+
is the safer plan.
|
|
177
|
+
|
|
178
|
+
Remaining evidence:
|
|
179
|
+
- A future `deploy_mode=lightsail` must have independent provision, DNS,
|
|
180
|
+
state, pricing, destroy, and tests before being offered.
|
|
181
|
+
|
|
182
|
+
### DEPLOY-P1-003 - Recovery For Nontechnical Users
|
|
183
|
+
|
|
184
|
+
Status: Deployer-side implemented.
|
|
185
|
+
|
|
186
|
+
Current evidence:
|
|
187
|
+
- `orchestrate.sh status` prints a recovery summary with phase, billing impact,
|
|
188
|
+
resume safety, next action, and stop-loss guidance.
|
|
189
|
+
- `reset` warns that resetting state can lose destroy resource records.
|
|
190
|
+
- Tests cover recovery output shape.
|
|
191
|
+
|
|
192
|
+
Difference from the checklist:
|
|
193
|
+
- The branch moves recovery language into the state command so it is available
|
|
194
|
+
during real interrupted deployments, not only in docs.
|
|
195
|
+
|
|
196
|
+
Remaining evidence:
|
|
197
|
+
- Real failures should still be reviewed after live deploy runs to improve
|
|
198
|
+
phase-specific wording.
|
|
199
|
+
|
|
200
|
+
### DEPLOY-P1-004 - Operation-Specific Receipts
|
|
201
|
+
|
|
202
|
+
Status: Deployer-side implemented.
|
|
203
|
+
|
|
204
|
+
Current evidence:
|
|
205
|
+
- `operation-report.json` supports `new_deploy`, `repair_or_verify`, `update`,
|
|
206
|
+
`reset_app_data`, and `destroy`.
|
|
207
|
+
- `scripts/update.sh` updates an existing EC2 node without recreating infra or
|
|
208
|
+
deleting data volumes.
|
|
209
|
+
- `scripts/reset-app-data.sh` clears application data only after strong
|
|
210
|
+
confirmation and preserves infra/TLS state.
|
|
211
|
+
- `scripts/destroy.sh` records AWS cleanup evidence.
|
|
212
|
+
|
|
213
|
+
Difference from the checklist:
|
|
214
|
+
- The checklist said update/reset first-class scripts were still future work.
|
|
215
|
+
They are now present in this branch.
|
|
216
|
+
|
|
217
|
+
Remaining evidence:
|
|
218
|
+
- Live update/reset should be exercised against a disposable deployed node
|
|
219
|
+
before treating them as production-proven.
|
|
220
|
+
|
|
221
|
+
### DEPLOY-P1-005 - Redeploy Token Refresh
|
|
222
|
+
|
|
223
|
+
Status: Deployer-side implemented.
|
|
224
|
+
|
|
225
|
+
Current evidence:
|
|
226
|
+
- S5 refreshes bootstrap credentials from the server.
|
|
227
|
+
- S6 rewrites service-scoped `credentials.json`, `env`, cc-connect config, and
|
|
228
|
+
MCP snippets.
|
|
229
|
+
- Update/reset mark S4-S7 pending and report refresh-pending status.
|
|
230
|
+
- Update/reset stops only the matching service-scoped direxio-connect daemon
|
|
231
|
+
when its `WorkDir` matches the current service, so stale local bridge
|
|
232
|
+
processes do not keep using old credentials.
|
|
233
|
+
- `status` reports Local refresh when update/reset cleared old credentials, user confirmations, runtime checks, and bridge install proof.
|
|
234
|
+
- Runtime checks fail closed when a stale service directory or wrong WorkDir is
|
|
235
|
+
detected.
|
|
236
|
+
|
|
237
|
+
Difference from the checklist:
|
|
238
|
+
- The current path is service-scoped by domain-derived service id, so multiple
|
|
239
|
+
nodes can coexist without global credential pollution.
|
|
240
|
+
|
|
241
|
+
Remaining evidence:
|
|
242
|
+
- If a real runtime reports 401/403 after reset, first verify it is using the
|
|
243
|
+
current service-scoped credential file before blaming the backend.
|
|
244
|
+
|
|
245
|
+
### DEPLOY-P2-001 - Automated And Human Acceptance Layers
|
|
246
|
+
|
|
247
|
+
Status: Deployer-side implemented, with Runtime evidence still required.
|
|
248
|
+
|
|
249
|
+
Current evidence:
|
|
250
|
+
- Delivery wording is "Automated Deployment Gates Passed", not final product
|
|
251
|
+
completion.
|
|
252
|
+
- Reports keep automatic gates separate from user confirmation gates.
|
|
253
|
+
- Runtime confirmation has an explicit stricter path.
|
|
254
|
+
|
|
255
|
+
Difference from the checklist:
|
|
256
|
+
- The better plan is a layered state model: automated gates passed, user
|
|
257
|
+
initialization pending, real chat pending, runtime confirmation pending.
|
|
258
|
+
|
|
259
|
+
Remaining evidence:
|
|
260
|
+
- Human confirmation is still required for App initialization and real chat.
|
|
261
|
+
|
|
262
|
+
### DEPLOY-P2-002 - User Initialization And Message Loop
|
|
263
|
+
|
|
264
|
+
Status: Deployer-side implemented for wording and gates, Runtime evidence still
|
|
265
|
+
required for the App loop.
|
|
266
|
+
|
|
267
|
+
Current evidence:
|
|
268
|
+
- Docs and delivery call the user-facing value an eight-digit app
|
|
269
|
+
initialization code.
|
|
270
|
+
- Old wording around direct IM login is rejected by structure tests.
|
|
271
|
+
- `confirm app_initialization` and `confirm real_chat` record user evidence.
|
|
272
|
+
|
|
273
|
+
Difference from the checklist:
|
|
274
|
+
- The deployer can record and guard the App path, but the App itself must prove
|
|
275
|
+
the domain plus eight-digit code flow and message readback.
|
|
276
|
+
|
|
277
|
+
Remaining evidence:
|
|
278
|
+
- Real App or simulator evidence that initialization completes, a user message
|
|
279
|
+
is stored/synced, and the message can be read after refresh/restart.
|
|
280
|
+
|
|
281
|
+
### DEPLOY-P2-003 - Agent Agents Room Loop
|
|
282
|
+
|
|
283
|
+
Status: Deployer-side implemented for internal checks, Runtime evidence still
|
|
284
|
+
required for user-visible chat.
|
|
285
|
+
|
|
286
|
+
Current evidence:
|
|
287
|
+
- Non-polluting MCP doctor, tools discovery, read-only smoke, and aggregate
|
|
288
|
+
runtime checks exist.
|
|
289
|
+
- `real_chat` cannot be confirmed from internal non-polluting probes alone.
|
|
290
|
+
- `agent_mcp_runtime` requires explicit runtime probe evidence.
|
|
291
|
+
|
|
292
|
+
Difference from the checklist:
|
|
293
|
+
- The user-facing proof stays simple: the user sends a message and sees the
|
|
294
|
+
agent reply. Internal gate details stay as agent diagnostics.
|
|
295
|
+
|
|
296
|
+
Remaining evidence:
|
|
297
|
+
- Real chat in the selected host runtime, using the current service-scoped
|
|
298
|
+
agent room and token.
|
|
299
|
+
|
|
300
|
+
### DEPLOY-P2-004 - Call And TURN Acceptance
|
|
301
|
+
|
|
302
|
+
Status: Deployer-side implemented for basic deployment acceptance.
|
|
303
|
+
|
|
304
|
+
Current evidence:
|
|
305
|
+
- S7 checks Matrix `turnServer` and requires non-empty valid TURN credentials.
|
|
306
|
+
- Security group and compose include coturn ports and configuration.
|
|
307
|
+
- `references/voip-turn-runbook.md` distinguishes basic deploy acceptance from
|
|
308
|
+
real media-call testing.
|
|
309
|
+
|
|
310
|
+
Difference from the checklist:
|
|
311
|
+
- The current branch keeps real voice/video calls out of the blocking deploy
|
|
312
|
+
gate. That is better because calls depend on devices, NAT, permissions, and
|
|
313
|
+
client behavior beyond server deployment.
|
|
314
|
+
|
|
315
|
+
Remaining evidence:
|
|
316
|
+
- Real device call testing belongs to a VoIP-specific test pass, not every
|
|
317
|
+
normal deployment.
|