@dp-pcs/ogp 0.7.2 → 0.8.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (63) hide show
  1. package/README.md +59 -12
  2. package/dist/cli/config.d.ts +4 -0
  3. package/dist/cli/config.d.ts.map +1 -1
  4. package/dist/cli/config.js +45 -2
  5. package/dist/cli/config.js.map +1 -1
  6. package/dist/cli/expose.d.ts +4 -1
  7. package/dist/cli/expose.d.ts.map +1 -1
  8. package/dist/cli/expose.js +7 -106
  9. package/dist/cli/expose.js.map +1 -1
  10. package/dist/cli/install.d.ts +1 -0
  11. package/dist/cli/install.d.ts.map +1 -1
  12. package/dist/cli/install.js +8 -2
  13. package/dist/cli/install.js.map +1 -1
  14. package/dist/cli/project.d.ts +24 -0
  15. package/dist/cli/project.d.ts.map +1 -1
  16. package/dist/cli/project.js +68 -15
  17. package/dist/cli/project.js.map +1 -1
  18. package/dist/cli/tunnel.d.ts +65 -0
  19. package/dist/cli/tunnel.d.ts.map +1 -0
  20. package/dist/cli/tunnel.js +413 -0
  21. package/dist/cli/tunnel.js.map +1 -0
  22. package/dist/cli.js +21 -8
  23. package/dist/cli.js.map +1 -1
  24. package/dist/daemon/contribution-signing.d.ts +49 -0
  25. package/dist/daemon/contribution-signing.d.ts.map +1 -0
  26. package/dist/daemon/contribution-signing.js +91 -0
  27. package/dist/daemon/contribution-signing.js.map +1 -0
  28. package/dist/daemon/message-handler.js +41 -18
  29. package/dist/daemon/message-handler.js.map +1 -1
  30. package/dist/daemon/openclaw-bridge.d.ts +6 -0
  31. package/dist/daemon/openclaw-bridge.d.ts.map +1 -1
  32. package/dist/daemon/openclaw-bridge.js +27 -12
  33. package/dist/daemon/openclaw-bridge.js.map +1 -1
  34. package/dist/daemon/peers.d.ts.map +1 -1
  35. package/dist/daemon/peers.js +19 -0
  36. package/dist/daemon/peers.js.map +1 -1
  37. package/dist/daemon/projects.d.ts +20 -0
  38. package/dist/daemon/projects.d.ts.map +1 -1
  39. package/dist/daemon/projects.js +70 -0
  40. package/dist/daemon/projects.js.map +1 -1
  41. package/dist/daemon/server.d.ts.map +1 -1
  42. package/dist/daemon/server.js +43 -2
  43. package/dist/daemon/server.js.map +1 -1
  44. package/dist/daemon/state-lock.d.ts +23 -0
  45. package/dist/daemon/state-lock.d.ts.map +1 -0
  46. package/dist/daemon/state-lock.js +115 -0
  47. package/dist/daemon/state-lock.js.map +1 -0
  48. package/package.json +13 -3
  49. package/scripts/completion.bash +25 -6
  50. package/scripts/completion.zsh +26 -8
  51. package/skills/ogp-expose/SKILL.md +40 -10
  52. package/docs/RC1-FEDERATION-TEST-CHECKLIST.md +0 -477
  53. package/docs/case-studies/CRASH_RESOLUTION_20260407.md +0 -190
  54. package/docs/case-studies/OpenClaw_Hermes_Status_Report_20260407.md +0 -142
  55. package/docs/case-studies/OpenClaw_Stability_Fix_Summary.md +0 -209
  56. package/docs/case-studies/README.md +0 -40
  57. package/docs/case-studies/crash_observations.md +0 -250
  58. package/docs/nat-hole-punch-spike.md +0 -399
  59. package/docs/project-intent-testing.md +0 -97
  60. package/scripts/render-ogp-overview-video.mjs +0 -454
  61. package/scripts/test-migration-execute.js +0 -74
  62. package/scripts/test-migration.js +0 -42
  63. package/scripts/test-project-intents.mjs +0 -614
@@ -1,142 +0,0 @@
1
- # OpenClaw & Hermes — Status Report
2
- **Date:** April 7, 2026
3
-
4
- > **Note:** This document is a sanitized version of internal status reporting. System-specific paths, PIDs, and operational details have been generalized. Created during OGP development to compare gateway stability.
5
-
6
- ---
7
-
8
- ## Executive Summary
9
-
10
- Both local AI gateways (OpenClaw, Hermes) were evaluated. OpenClaw crashed multiple times in 24 hours from two distinct bugs. Hermes has been stable. A fact-check of an agent-drafted comparison article revealed it was substantially wrong. Config changes were made to switch OpenClaw's primary model to GPT-5.4 and adjust provider configurations.
11
-
12
- ---
13
-
14
- ## OpenClaw Gateway — Crash Analysis
15
-
16
- ### Crash #1: OOM (April 6, evening)
17
-
18
- **Root cause:** The BrainLift plugin kicked off its nightly run for all 5 agents simultaneously. The agents hit Anthropic's rate limit (429) on Claude Sonnet 4.6. The embedded agent runner retried aggressively with no backoff ceiling and no memory cleanup between attempts. Heap grew to 4.08 GB and Node.js SIGABRT'd.
19
-
20
- **Contributing factors:**
21
- - All 5 agents scheduled at the same time
22
- - No exponential backoff on 429 retries
23
- - Default V8 heap limit (4 GB) with no `--max-old-space-size` override
24
- - Auth error mixed in (API key issue)
25
-
26
- **Evidence:** Logs showed repeated 429s, followed by `FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory`
27
-
28
- ### Crash #2: Unhandled Promise Rejection (April 7, afternoon)
29
-
30
- **Root cause:** Bug in `pi-agent-core` — the exec tool's stdout handler fired a callback after the agent run had already ended. The gateway's global unhandled rejection handler treated this as fatal.
31
-
32
- **Stack trace origin:** `Agent.processEvents` in `pi-agent-core/src/agent.ts:533` — "Agent listener invoked outside active run"
33
-
34
- **Trigger:** Agent was editing files via the exec tool when the run completed but the exec process continued emitting stdout.
35
-
36
- ### Crash #3: Same as #2 (April 7, evening)
37
-
38
- Identical stack trace. Same exec lifecycle bug. Reproducible.
39
-
40
- ---
41
-
42
- ## OpenClaw — Config Issues Found
43
-
44
- ### 1. LaunchAgent Environment Variables Don't Work
45
-
46
- The LaunchAgent plist uses `$(security find-generic-password ...)` shell expansion syntax for API keys. **This doesn't work in launchd plists** — plist values are literal strings, not shell-evaluated. Keychain-derived env vars are empty when launched via launchd.
47
-
48
- **Impact:** Gateway starts without API keys → auth failures → retry loops → OOM.
49
-
50
- **Fix needed:** Either use a wrapper shell script in the plist that resolves keys before exec'ing the gateway, or store keys directly in the plist (less secure).
51
-
52
- ### 2. Kimi Provider Configuration Issue
53
-
54
- The Kimi direct provider was referencing an API key that wasn't properly configured. The gateway's secret resolver treated this as a hard failure.
55
-
56
- **Current workaround:** Disabled the kimi plugin, removed kimi auth profile. The Fireworks-routed Kimi K2.5 still works via FIREWORKS_API_KEY.
57
-
58
- ### 3. Skills Loading Issues
59
-
60
- On every startup, many skills log `"Skipping skill path that resolves outside its configured root."` These are likely symlinks or relative path references. A significant portion of the skill set is silently not loading.
61
-
62
- **Impact:** Agent capabilities reduced without any user-visible error.
63
-
64
- ### 4. BrainLift Double-Fires
65
-
66
- The BrainLift plugin logged two `"starting nightly run"` entries within seconds of each other — running the full 5-agent sweep twice. This doubles API usage and compounds the rate limit problem.
67
-
68
- ---
69
-
70
- ## Changes Made This Session
71
-
72
- | Change | File | Detail |
73
- |--------|------|--------|
74
- | Primary model → GPT-5.4 | `openclaw.json` | Was `anthropic/claude-sonnet-4-6` |
75
- | Fallback chain updated | `openclaw.json` | Multiple fallback providers configured |
76
- | `openai/gpt-5.4` added to models | `openclaw.json` | New model entry with Responses API |
77
- | Kimi plugin disabled | `openclaw.json` | `plugins.entries.kimi.enabled: false` |
78
- | Kimi auth profile removed | `openclaw.json` | Removed kimi auth profile |
79
- | Gateway started with 8GB heap | Manual launch | `--max-old-space-size=8192` |
80
- | Logs truncated | `logs/` | Log rotation applied |
81
-
82
- ---
83
-
84
- ## Hermes Gateway — Status
85
-
86
- Hermes has been stable throughout. Running Python 3.11, `hermes gateway run --replace`. Port responding (403 on unauthenticated requests, expected). Low resource usage.
87
-
88
- OGP bridge process also running.
89
-
90
- No crashes, no issues.
91
-
92
- ---
93
-
94
- ## Article Fact-Check: "Hermes vs OpenClaw"
95
-
96
- An agent-drafted article was **substantially wrong**. Its central thesis — "OpenClaw is desktop-first, Hermes is cloud-native" — is fabricated. Both are local daemons running on the same machine.
97
-
98
- **Key errors corrected:**
99
- - Hermes is NOT cloud-hosted (it's a local Python process)
100
- - Hermes storage is NOT cloud-backed (it's local SQLite + markdown)
101
- - Hermes skills are NOT synced cloud storage (local filesystem)
102
- - Hermes does NOT have built-in public endpoints (needs tunnels like OpenClaw)
103
- - "Turn off your phone and federation continues" is false (machine off = Hermes off)
104
-
105
- **Corrected article delivered** with fact-checked claims against live configs and running processes.
106
-
107
- ---
108
-
109
- ## Recommended Actions
110
-
111
- ### Immediate (Stability)
112
-
113
- 1. **Fix the exec lifecycle crash** — File issue against `pi-agent-core`. The unhandled rejection in `Agent.processEvents` when exec stdout fires after run completion is a repeatable crasher. Until fixed, the gateway will keep dying.
114
-
115
- 2. **Fix LaunchAgent env vars** — Replace the `$(...)` plist values with a wrapper script:
116
- ```bash
117
- #!/bin/bash
118
- export ANTHROPIC_API_KEY=$(security find-generic-password ...)
119
- # ... other keys ...
120
- exec <node-path> --max-old-space-size=8192 \
121
- <openclaw-path> gateway --port <port>
122
- ```
123
- Point the plist's ProgramArguments at this script instead of node directly.
124
-
125
- 3. **Add `--max-old-space-size=8192`** to the LaunchAgent permanently (via the wrapper script above).
126
-
127
- ### Short-term (Reliability)
128
-
129
- 4. **Stagger BrainLift agent runs** — Don't fire all agents at the same cron tick. Space them apart to avoid rate limit contention.
130
-
131
- 5. **Investigate the skipped skills issue** — Check for broken symlinks or path traversal in skills directory. These represent a significant portion of the skill set not loading.
132
-
133
- ### Medium-term (Resilience)
134
-
135
- 6. **Request backoff/retry ceiling in embedded agent runner** — The 429 retry loop with no backoff is the #1 contributor to OOM crashes. Needs exponential backoff + max retry count + memory cleanup between attempts.
136
-
137
- 7. **Add process supervision** — Current state: launchd throttles after crash, manual nohup doesn't survive reboot. Consider a wrapper that catches SIGABRT and restarts with a cooldown.
138
-
139
- ---
140
-
141
- **Document Created:** April 7, 2026
142
- **Sanitized for Publication:** April 8, 2026
@@ -1,209 +0,0 @@
1
- # OpenClaw Stability Fix Summary
2
- **Date:** April 7, 2026
3
- **Status:** RESOLVED - Mitigations Implemented
4
-
5
- > **Note:** This document is a sanitized version of internal debugging notes. File paths, process IDs, and system-specific details have been generalized. The original was created during OGP development to debug an unrelated OpenClaw regression.
6
-
7
- ---
8
-
9
- ## Problem Summary
10
-
11
- OpenClaw gateway (v2026.4.5) was crashing every 10-60 minutes with two distinct failure modes:
12
-
13
- 1. **Exec Lifecycle Bug** - "Agent listener invoked outside active run" error
14
- 2. **Browser Automation OOM** - V8 heap exhaustion from heavy browser use
15
-
16
- ---
17
-
18
- ## Root Cause Analysis
19
-
20
- ### Bug #1: Exec Lifecycle Crash (CRITICAL)
21
-
22
- **Status:** **KNOWN BUG in OpenClaw 2026.4.5** - Regression from 2026.4.2
23
-
24
- **Error:** `Unhandled promise rejection: Error: Agent listener invoked outside active run`
25
-
26
- **GitHub Issues:**
27
- - [#62137](https://github.com/openclaw/openclaw/issues/62137) - Exec/PTY unhandled promise rejection
28
- - [#61592](https://github.com/openclaw/openclaw/issues/61592) - Background exec process crashes
29
- - [#61812](https://github.com/openclaw/openclaw/issues/61812) - Regression in 2026.4.5
30
- - [#61733](https://github.com/openclaw/openclaw/issues/61733) - Windows crashes with same error
31
-
32
- **Technical Details:**
33
- When a background exec process emits stdout after the agent run has completed, the gateway crashes instead of safely ignoring or buffering the output. The `pi-agent-core` library's `Agent.processEvents` method throws when called outside an active run context.
34
-
35
- **Trigger Scenarios:**
36
- - File operations
37
- - Long-running exec processes
38
- - Bash tools calling `openclaw message send`
39
- - Cron jobs spawning exec sessions
40
-
41
- **Impact:** Gateway crashes every 10-60 minutes during normal operation
42
-
43
- ### Bug #2: Browser Automation OOM
44
-
45
- **Error:** `FATAL ERROR: v8::internal::HeapAllocator::AllocateRawWithLightRetrySlowPath Allocation failed - JavaScript heap out of memory`
46
-
47
- **Root Cause:** Heavy browser automation creates large serialized objects that overflow the default V8 heap limit (4GB)
48
-
49
- **Impact:** Gateway crashes after extended browser automation sessions (2-4 hours)
50
-
51
- ### Bug #3: Cron Job API Key Failures (FIXED)
52
-
53
- **Status:** RESOLVED by disabling cron jobs
54
-
55
- **Error:** `401 Incorrect API key provided` (environment variable expansion failing)
56
-
57
- **Root Cause:** Environment variable evaluation failing in LaunchAgent context when cron jobs execute, triggering cascading model fallback failures and eventual OOM
58
-
59
- **Fix:** Disabled all cron jobs + BrainLift plugin
60
-
61
- ---
62
-
63
- ## Solutions Implemented
64
-
65
- ### ✅ Solution #1: Wrapper Script with 8GB Heap Limit
66
-
67
- **File:** `$HOME/.openclaw/bin/gateway-wrapper.sh`
68
-
69
- **What it does:**
70
- - Sets all required environment variables explicitly
71
- - Launches gateway with `--max-old-space-size=8192` (8GB heap limit)
72
- - Provides logging for debugging
73
-
74
- **LaunchAgent Integration:**
75
- Updated LaunchAgent plist to use wrapper instead of calling node directly
76
-
77
- **Benefits:**
78
- - Doubles heap limit to prevent browser OOM crashes
79
- - Ensures env vars are always set correctly
80
- - Survives OpenClaw updates (wrapper script is outside node_modules)
81
-
82
- ### ✅ Solution #2: Disabled All Cron Jobs
83
-
84
- **Files Modified:**
85
- - `$HOME/.openclaw/openclaw.json` - BrainLift plugin disabled
86
- - `$HOME/.openclaw/cron/jobs.json` - All cron jobs disabled
87
-
88
- **Impact:**
89
- - Eliminates cron-triggered API key evaluation failures
90
- - Prevents BrainLift OOM crashes from simultaneous agent runs
91
- - Stops scheduled jobs that were triggering crashes
92
-
93
- ### ✅ Solution #3: API Keys in Config File
94
-
95
- **Status:** Already fixed via config modification
96
-
97
- **What happened:** OpenClaw config's `env` section was modified to include API keys directly instead of shell command expansion
98
-
99
- **Effect:** Environment variables now always available, preventing auth cascades
100
-
101
- ---
102
-
103
- ## Remaining Issues
104
-
105
- ### ⚠️ Exec Lifecycle Bug - NOT FIXED, MITIGATED
106
-
107
- **Status:** Waiting for OpenClaw developers to fix in pi-agent-core
108
-
109
- **Mitigation:** Gateway will still crash when exec lifecycle bug triggers, but LaunchAgent will auto-restart it
110
-
111
- **Upstream Fix Options:**
112
- 1. Wait for OpenClaw team to release patch
113
- 2. Roll back to 2026.4.2 (workaround mentioned in GitHub issues)
114
- 3. Avoid file operations that trigger long-running exec processes
115
-
116
- **Recommended Action:** Monitor for OpenClaw 2026.4.6 or later that fixes these issues
117
-
118
- ---
119
-
120
- ## OGP Correlation
121
-
122
- **Conclusion:** OGP work is **NOT** the cause of crashes
123
-
124
- **Evidence:**
125
- - Both bugs are known OpenClaw 2026.4.5 regressions affecting all users
126
- - Crashes occur with zero OGP activity
127
- - GitHub issues filed by users not using OGP
128
- - Dual-assistant setup (OpenClaw + Hermes) may have exposed bugs faster due to higher load, but didn't create them
129
-
130
- ---
131
-
132
- ## Current Status
133
-
134
- **Gateway:** ✅ Running
135
- **Heap Limit:** ✅ 8GB (doubled from default 4GB)
136
- **Cron Jobs:** ✅ Disabled
137
- **BrainLift:** ✅ Disabled
138
- **Wrapper Script:** ✅ Active via LaunchAgent
139
-
140
- **Expected Stability:**
141
- - ✅ No more cron-triggered crashes
142
- - ✅ No more browser OOM crashes (unless >8GB heap usage)
143
- - ⚠️ Exec lifecycle bug may still cause occasional crashes (auto-restart enabled)
144
-
145
- ---
146
-
147
- ## Testing & Monitoring
148
-
149
- **To verify stability:**
150
-
151
- ```bash
152
- # Check gateway status
153
- launchctl list | grep openclaw
154
- lsof -i :<gateway-port>
155
-
156
- # Monitor for crashes
157
- tail -f ~/.openclaw/logs/gateway.err.log | grep -E "unhandled|crash|FATAL"
158
-
159
- # Check uptime
160
- ps aux | grep openclaw-gateway
161
- ```
162
-
163
- **Success Metrics:**
164
- - Gateway uptime > 24 hours without manual restart
165
- - No API key evaluation errors in logs
166
- - No OOM crashes during browser automation
167
-
168
- ---
169
-
170
- ## Rollback Instructions
171
-
172
- If issues persist, to rollback:
173
-
174
- ```bash
175
- # Restore original LaunchAgent
176
- cp $HOME/Library/LaunchAgents/ai.openclaw.gateway.plist.backup-* \
177
- $HOME/Library/LaunchAgents/ai.openclaw.gateway.plist
178
-
179
- # Restore cron jobs
180
- cp $HOME/.openclaw/cron/jobs.json.backup-* \
181
- $HOME/.openclaw/cron/jobs.json
182
-
183
- # Re-enable BrainLift in openclaw.json
184
- # (manually change "enabled": false to true)
185
-
186
- # Reload LaunchAgent
187
- launchctl unload $HOME/Library/LaunchAgents/ai.openclaw.gateway.plist
188
- launchctl load $HOME/Library/LaunchAgents/ai.openclaw.gateway.plist
189
- ```
190
-
191
- Or consider rolling back OpenClaw to 2026.4.2:
192
- ```bash
193
- npm install -g openclaw@2026.4.2
194
- # Note: May require removing plugins.entries.memory-core.config.dreaming from config
195
- ```
196
-
197
- ---
198
-
199
- ## Next Steps
200
-
201
- 1. ✅ Monitor gateway stability for 24-48 hours
202
- 2. ⏸️ Wait for OpenClaw 2026.4.6+ release with exec lifecycle fix
203
- 3. 🔍 Investigate skipped skills issue (low priority)
204
-
205
- ---
206
-
207
- **Document Created:** April 7, 2026
208
- **Last Updated:** April 7, 2026
209
- **Sanitized for Publication:** April 8, 2026
@@ -1,40 +0,0 @@
1
- # OGP Development Case Studies
2
-
3
- This directory contains sanitized debugging notes from real-world OGP development and deployment challenges. These documents capture the messy reality of building federated AI systems — including the false starts, red herrings, and lessons learned.
4
-
5
- > **⚠️ Note:** These files are sanitized versions of internal debugging notes. System-specific details (file paths, PIDs, API key fragments, port numbers) have been removed or generalized to protect operational security while preserving the technical narrative.
6
-
7
- ---
8
-
9
- ## Contents
10
-
11
- | File | Description |
12
- |------|-------------|
13
- | `OpenClaw_Stability_Fix_Summary.md` | Comprehensive analysis of OpenClaw 2026.4.5 regression bugs encountered during OGP development. Includes root cause analysis, mitigations, and wrapper script implementation. |
14
- | `CRASH_RESOLUTION_20260407.md` | Quick reference guide for the same stability issues — condensed version for immediate action. |
15
- | `crash_observations.md` | Raw timeline and observations from the debugging session. Shows the iterative process of elimination that ultimately cleared OGP of suspicion. |
16
- | `OpenClaw_Hermes_Status_Report_20260407.md` | Comparative analysis of OpenClaw vs. Hermes gateway stability during federation testing. Includes fact-check of an AI-drafted article that was substantially wrong. |
17
-
18
- ---
19
-
20
- ## Context
21
-
22
- These documents were created on April 7, 2026, during intensive OGP federation testing. The initial hypothesis was that OGP's dual-assistant setup (OpenClaw + Hermes) was causing gateway instability. **The reality:** OpenClaw 2026.4.5 had known regression bugs affecting all users.
23
-
24
- **Key Lesson:** When debugging complex systems, correlation is not causation. The OGP work exposed OpenClaw bugs faster due to higher load, but didn't create them.
25
-
26
- ---
27
-
28
- ## Related Article
29
-
30
- The debugging narrative behind these files is documented in:
31
-
32
- **"[Case Study] When Your AI Tools Keep Crashing: A Meta-Debugging Loop with OpenClaw and Claude"**
33
-
34
- This Substack article tells the story of using Claude (via Dispatch) to diagnose OpenClaw crashes while OpenClaw was down, then using OpenClaw/Claude Code to fix OGP bugs, then back to Claude when OpenClaw crashed again — a meta-loop that became the only way forward.
35
-
36
- ---
37
-
38
- **Why These Are Here:**
39
-
40
- The article promised these files would be "available in dp-pcs/ogp." Rather than leave them as unverified claims, we're publishing the sanitized source material. Real debugging is messy. Real systems fail in unexpected ways. Federation requires resilience not just in protocol design, but in the development process itself.
@@ -1,250 +0,0 @@
1
- # OpenClaw Crash Observations - April 7, 2026
2
-
3
- > **Note:** This document is a sanitized version of internal debugging notes. API key fragments, file paths, and system-specific details have been removed or generalized. Created during OGP development to document OpenClaw regression analysis.
4
-
5
- ---
6
-
7
- ## Timeline Summary
8
-
9
- User reported that OpenClaw has been crashing non-stop for the last 24 hours after previously working without issues. Multiple agents were affected, with the gateway itself crashing repeatedly.
10
-
11
- ---
12
-
13
- ## Issues Identified & Fixed
14
-
15
- ### 1. Cascading Authentication Failures (8:00 AM)
16
-
17
- **Symptoms:**
18
- - Agent failing with all configured model providers
19
- - Error sequence: Multiple providers failing in sequence
20
- - "All models failed" errors in logs
21
-
22
- **Root Causes:**
23
- - Kimi API: HTTP 401 - Invalid/expired API key
24
- - Anthropic API: HTTP 401 - Invalid API key (rate limits also hit)
25
- - OpenAI API: 401 - Malformed API key (env var expansion failing)
26
-
27
- **Analysis:**
28
- The `openclaw doctor` command appeared to have modified the configuration file, simplifying the env section and causing environment variable expansion to fail. The OpenAI provider had a misconfigured API key reference.
29
-
30
- **Fix Applied:**
31
- - Restored proper API key references in env section
32
- - Changed OpenAI provider to use environment variable references
33
-
34
- ### 2. Model Configuration Corruption (Multiple Occurrences)
35
-
36
- **Symptoms:**
37
- - Agent configuration simplified to only primary model, no fallbacks
38
- - Default model referencing non-existent model ID
39
- - Invalid model IDs causing 404 errors
40
-
41
- **Root Cause:**
42
- Configuration file was being modified (likely by `openclaw doctor` command or auto-formatting) which:
43
- - Removed fallback models from agent configurations
44
- - Simplified environment variable definitions
45
- - Changed model references
46
-
47
- **Example Configuration Issue:**
48
- ```json
49
- // Broken (no fallbacks)
50
- "model": {
51
- "primary": "anthropic/claude-sonnet-4-6"
52
- }
53
-
54
- // Restored (with fallbacks)
55
- "model": {
56
- "primary": "openai/gpt-5.4",
57
- "fallbacks": [
58
- "anthropic/claude-sonnet-4-6",
59
- "openai/gpt-4o",
60
- "fireworks/accounts/fireworks/models/kimi-k2p5"
61
- ]
62
- }
63
- ```
64
-
65
- ### 3. OpenAI API 404 Errors (9:00-9:50 AM)
66
-
67
- **Symptoms:**
68
- - Continuous 404 errors on OpenAI models
69
- - All OpenAI models failing despite being available via API
70
-
71
- **Root Cause:**
72
- GPT-5.4 and newer models require the **Responses API** endpoint (`/v1/responses`) instead of the Chat Completions API endpoint (`/v1/chat/completions`). OpenClaw was using the wrong endpoint.
73
-
74
- **Evidence:**
75
- - Manual API query confirmed models exist: `gpt-5.4`, `gpt-5.4-2026-03-05`, `gpt-4o` all available
76
- - OpenAI documentation confirmed GPT-5.4 requires Responses API
77
- - Error pattern: 404 with no body = wrong endpoint
78
-
79
- **Fix Applied:**
80
- Added `"api": "openai-responses"` to OpenAI provider configuration:
81
- ```json
82
- "openai": {
83
- "baseUrl": "https://api.openai.com/v1",
84
- "apiKey": "${PERSONAL_OPENAI_API_KEY}",
85
- "api": "openai-responses",
86
- "models": [...]
87
- }
88
- ```
89
-
90
- **References:**
91
- - GitHub Issue: openclaw/openclaw#38706 - "GPT-5.4 via openai-codex OAuth uses wrong API"
92
- - OpenAI Docs: Responses API required for GPT-5.4+
93
-
94
- ### 4. Gateway Crash - "Agent listener invoked outside active run" (12:08 PM)
95
-
96
- **Symptoms:**
97
- - Gateway completely unreachable
98
- - LaunchAgent exit status: -1
99
- - Error: `Unhandled promise rejection: Error: Agent listener invoked outside active run`
100
-
101
- **Stack Trace:**
102
- ```
103
- at Agent.processEvents (pi-agent-core/src/agent.ts:533:10)
104
- at emitUpdate (exec-defaults-*.js:1524:8)
105
- at handleStdout (exec-defaults-*.js:1546:4)
106
- ```
107
-
108
- **Context:**
109
- Crash occurred during OGP federation testing operations. Preceding log entries show:
110
- - Edit operations failing
111
- - Read operations failing for config files
112
- - Multiple edit retry attempts
113
-
114
- **Hypothesis:**
115
- The crash may be related to:
116
- 1. OGP operations triggering edge cases in the exec/agent framework
117
- 2. File operations failing and causing state inconsistencies
118
- 3. Agent event processing happening outside the expected execution context
119
-
120
- **Fix Applied:**
121
- Gateway restart resolved the immediate issue, but underlying cause remained unclear at the time.
122
-
123
- ---
124
-
125
- ## Configuration Stability Concerns
126
-
127
- ### Observed Pattern:
128
- 1. Manual configuration changes applied
129
- 2. Gateway restart
130
- 3. Configuration file modified by unknown process
131
- 4. Settings reverted or simplified
132
- 5. Agents fail again
133
-
134
- ### Suspected Culprits:
135
- - `openclaw doctor` command
136
- - Auto-formatting/validation on config reload
137
- - Hot reload mechanism modifying config
138
-
139
- ---
140
-
141
- ## Gateway Stability Observations
142
-
143
- ### Crash Frequency:
144
- - Multiple gateway restarts required during debugging session
145
- - Many restarts over 4-hour period
146
- - One complete crash requiring manual intervention
147
-
148
- ### Memory/CPU Usage:
149
- - Gateway process consistently using high CPU during startup
150
- - Process ID changing frequently
151
-
152
- ### LaunchAgent Behavior:
153
- - LaunchAgent showing status `-1` during crashes
154
- - Sometimes showing status `0` but process not actually running
155
- - Restart command occasionally reports "stale process" and force-kills
156
-
157
- ---
158
-
159
- ## OGP-Related Observations
160
-
161
- ### Timing Correlation:
162
- User mentioned doing OGP work and the timeline suggests:
163
- - OpenClaw was stable before OGP work
164
- - Issues began within last 24 hours
165
- - Gateway crash occurred during OGP federation operations
166
-
167
- ### OGP Operations Observed in Logs:
168
- - Federation requests to Clawporate gateway
169
- - Agent-to-agent communication attempts
170
- - File operations on OGP-related files
171
- - Attempts to read OGP config (file not found)
172
-
173
- ### Potential OGP-Related Issues:
174
- 1. **File Operation Failures**: Multiple edit/read failures on OGP-related files
175
- 2. **Agent Event Processing**: Crash occurred during stdout handling from supervised process
176
- 3. **Missing Config Files**: OGP config expected but not found in multiple locations
177
-
178
- ---
179
-
180
- ## Mitigation Steps Applied - 5:56 PM
181
-
182
- ### Changes Made to Test Crash Prevention:
183
-
184
- **1. Disabled Heartbeat Tasks**
185
- - Removed `heartbeat` configurations from agent defaults
186
- - **Hypothesis**: Hourly heartbeats triggering cron jobs that hit API failures and crashed the gateway
187
-
188
- **2. Replaced Keychain Lookups with Direct API Keys**
189
- - Changed from: `$(security find-generic-password ...)`
190
- - Changed to: Direct environment variable references
191
- - **Reason**: Keychain lookups repeatedly failing with env var expansion errors
192
-
193
- **3. BrainLift Plugin**
194
- - Already disabled (enabled: false)
195
- - No changes needed
196
-
197
- ---
198
-
199
- ## Current Working Configuration
200
-
201
- ### Models:
202
- - **Primary**: openai/gpt-5.4 (via Responses API)
203
- - **Fallbacks**: anthropic/claude-sonnet-4-6, openai/gpt-4o, fireworks/kimi-k2p5
204
-
205
- ### API Keys (via environment variables):
206
- - ANTHROPIC_API_KEY: Working
207
- - OPENAI_API_KEY: Working (via Responses API)
208
- - FIREWORKS_API_KEY: Working
209
-
210
- ### Critical Config Settings:
211
- ```json
212
- {
213
- "models": {
214
- "providers": {
215
- "openai": {
216
- "api": "openai-responses",
217
- "baseUrl": "https://api.openai.com/v1",
218
- "apiKey": "${PERSONAL_OPENAI_API_KEY}"
219
- }
220
- }
221
- }
222
- }
223
- ```
224
-
225
- ---
226
-
227
- ## Open Questions
228
-
229
- 1. **What triggers config file modifications?** Is it automatic or user-initiated?
230
- 2. **Is OGP plugin causing instability?** Correlation suggests possible connection
231
- 3. **Why are keychain lookups failing intermittently?** Sometimes work, sometimes fail
232
- 4. **What is the expected behavior for "Agent listener invoked outside active run"?** Is this a known edge case?
233
-
234
- ---
235
-
236
- ## Files Modified During Session
237
-
238
- - `$HOME/.openclaw/openclaw.json` (multiple times)
239
- - API key configurations
240
- - Model provider settings
241
- - Agent model configurations
242
- - Environment variables
243
-
244
- ---
245
-
246
- **Session Date**: April 7, 2026
247
- **OpenClaw Version**: 2026.4.5
248
- **Total Crashes**: Multiple
249
- **Average Uptime Between Crashes**: 10-20 minutes
250
- **Sanitized for Publication**: April 8, 2026