@dp-pcs/ogp 0.3.2 → 0.4.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +275 -49
- package/dist/cli/completion.d.ts +5 -0
- package/dist/cli/completion.d.ts.map +1 -0
- package/dist/cli/completion.js +148 -0
- package/dist/cli/completion.js.map +1 -0
- package/dist/cli/config.d.ts +3 -0
- package/dist/cli/config.d.ts.map +1 -0
- package/dist/cli/config.js +207 -0
- package/dist/cli/config.js.map +1 -0
- package/dist/cli/expose.d.ts.map +1 -1
- package/dist/cli/expose.js +20 -13
- package/dist/cli/expose.js.map +1 -1
- package/dist/cli/federation.d.ts.map +1 -1
- package/dist/cli/federation.js +252 -9
- package/dist/cli/federation.js.map +1 -1
- package/dist/cli/setup.d.ts +19 -0
- package/dist/cli/setup.d.ts.map +1 -1
- package/dist/cli/setup.js +507 -32
- package/dist/cli/setup.js.map +1 -1
- package/dist/cli.js +348 -32
- package/dist/cli.js.map +1 -1
- package/dist/daemon/agent-comms.d.ts.map +1 -1
- package/dist/daemon/agent-comms.js +14 -9
- package/dist/daemon/agent-comms.js.map +1 -1
- package/dist/daemon/intent-registry.d.ts.map +1 -1
- package/dist/daemon/intent-registry.js +7 -4
- package/dist/daemon/intent-registry.js.map +1 -1
- package/dist/daemon/keypair.d.ts.map +1 -1
- package/dist/daemon/keypair.js +34 -13
- package/dist/daemon/keypair.js.map +1 -1
- package/dist/daemon/message-handler.d.ts.map +1 -1
- package/dist/daemon/message-handler.js +7 -0
- package/dist/daemon/message-handler.js.map +1 -1
- package/dist/daemon/notify.d.ts +19 -0
- package/dist/daemon/notify.d.ts.map +1 -1
- package/dist/daemon/notify.js +329 -73
- package/dist/daemon/notify.js.map +1 -1
- package/dist/daemon/openclaw-bridge.d.ts +34 -0
- package/dist/daemon/openclaw-bridge.d.ts.map +1 -0
- package/dist/daemon/openclaw-bridge.js +261 -0
- package/dist/daemon/openclaw-bridge.js.map +1 -0
- package/dist/daemon/peers.d.ts +8 -0
- package/dist/daemon/peers.d.ts.map +1 -1
- package/dist/daemon/peers.js +48 -14
- package/dist/daemon/peers.js.map +1 -1
- package/dist/daemon/projects.d.ts.map +1 -1
- package/dist/daemon/projects.js +7 -4
- package/dist/daemon/projects.js.map +1 -1
- package/dist/daemon/server.d.ts +16 -0
- package/dist/daemon/server.d.ts.map +1 -1
- package/dist/daemon/server.js +147 -46
- package/dist/daemon/server.js.map +1 -1
- package/dist/shared/config.d.ts +52 -1
- package/dist/shared/config.d.ts.map +1 -1
- package/dist/shared/config.js +18 -11
- package/dist/shared/config.js.map +1 -1
- package/dist/shared/framework-detection.d.ts +31 -0
- package/dist/shared/framework-detection.d.ts.map +1 -0
- package/dist/shared/framework-detection.js +91 -0
- package/dist/shared/framework-detection.js.map +1 -0
- package/dist/shared/help.d.ts +5 -0
- package/dist/shared/help.d.ts.map +1 -0
- package/dist/shared/help.js +280 -0
- package/dist/shared/help.js.map +1 -0
- package/dist/shared/meta-config.d.ts +44 -0
- package/dist/shared/meta-config.d.ts.map +1 -0
- package/dist/shared/meta-config.js +89 -0
- package/dist/shared/meta-config.js.map +1 -0
- package/dist/shared/migration.d.ts +57 -0
- package/dist/shared/migration.d.ts.map +1 -0
- package/dist/shared/migration.js +255 -0
- package/dist/shared/migration.js.map +1 -0
- package/docs/CLI-REFERENCE.md +1360 -0
- package/docs/GETTING-STARTED.md +942 -0
- package/docs/MIGRATION.md +202 -0
- package/docs/MULTI-FRAMEWORK-DEMO.md +352 -0
- package/docs/MULTI-FRAMEWORK-DESIGN.md +378 -0
- package/docs/MULTI-FRAMEWORK-IMPL.md +197 -0
- package/docs/case-studies/CRASH_RESOLUTION_20260407.md +190 -0
- package/docs/case-studies/OpenClaw_Hermes_Status_Report_20260407.md +142 -0
- package/docs/case-studies/OpenClaw_Stability_Fix_Summary.md +209 -0
- package/docs/case-studies/README.md +40 -0
- package/docs/case-studies/crash_observations.md +250 -0
- package/docs/federation-flow.md +21 -31
- package/docs/hermes-implementation-checklist.md +4 -0
- package/docs/rendezvous.md +13 -14
- package/package.json +9 -3
- package/scripts/completion.bash +123 -0
- package/scripts/completion.zsh +372 -0
- package/scripts/test-migration-execute.js +74 -0
- package/scripts/test-migration.js +42 -0
- package/skills/ogp/SKILL.md +197 -64
- package/skills/ogp-agent-comms/SKILL.md +107 -41
- package/skills/ogp-expose/SKILL.md +84 -21
- package/skills/ogp-project/SKILL.md +66 -58
|
@@ -0,0 +1,190 @@
|
|
|
1
|
+
# OpenClaw Crash Resolution
|
|
2
|
+
**Date:** April 7, 2026
|
|
3
|
+
**Status:** ✅ RESOLVED with mitigations
|
|
4
|
+
|
|
5
|
+
> **Note:** This document is a sanitized version of internal debugging notes. System-specific details have been generalized. Original created during OGP development to document OpenClaw regression debugging.
|
|
6
|
+
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
## Quick Summary
|
|
10
|
+
|
|
11
|
+
Your OpenClaw crashes were caused by **TWO KNOWN BUGS in version 2026.4.5** - NOT by your OGP work or dual-assistant setup. Multiple GitHub issues filed by other users in the last 1-2 days confirm this.
|
|
12
|
+
|
|
13
|
+
**Fixes implemented:**
|
|
14
|
+
1. ✅ Wrapper script with 8GB heap limit
|
|
15
|
+
2. ✅ All cron jobs disabled
|
|
16
|
+
3. ✅ BrainLift plugin disabled
|
|
17
|
+
4. ✅ Gateway auto-restart enabled
|
|
18
|
+
|
|
19
|
+
**Current Status:** Gateway running stable with mitigations. Exec lifecycle bug may still cause occasional crashes but will auto-restart.
|
|
20
|
+
|
|
21
|
+
---
|
|
22
|
+
|
|
23
|
+
## The Bugs
|
|
24
|
+
|
|
25
|
+
### Bug #1: Exec Lifecycle Crash
|
|
26
|
+
**GitHub Issues:** [#62137](https://github.com/openclaw/openclaw/issues/62137), [#61592](https://github.com/openclaw/openclaw/issues/61592), [#61812](https://github.com/openclaw/openclaw/issues/61812)
|
|
27
|
+
|
|
28
|
+
**Error:** `Unhandled promise rejection: Error: Agent listener invoked outside active run`
|
|
29
|
+
|
|
30
|
+
**Cause:** Regression in 2026.4.5 where background exec process stdout crashes gateway after agent run completes
|
|
31
|
+
|
|
32
|
+
**Platforms Affected:** Linux, Windows, macOS (all platforms)
|
|
33
|
+
|
|
34
|
+
**Your Impact:** Crashed every 10-60 minutes during normal operations
|
|
35
|
+
|
|
36
|
+
**Mitigation:** Wrapper script enables auto-restart; upstream fix pending
|
|
37
|
+
|
|
38
|
+
### Bug #2: Browser Automation OOM
|
|
39
|
+
**Error:** `FATAL ERROR: JavaScript heap out of memory`
|
|
40
|
+
|
|
41
|
+
**Cause:** Default 4GB V8 heap limit too small for heavy browser automation
|
|
42
|
+
|
|
43
|
+
**Your Impact:** Crashed after 2+ hours of browser activity
|
|
44
|
+
|
|
45
|
+
**Fix:** Increased heap to 8GB via `--max-old-space-size=8192` flag
|
|
46
|
+
|
|
47
|
+
### Bug #3: Cron Job API Key Failures
|
|
48
|
+
**Error:** `401 Incorrect API key provided` (env var expansion failing)
|
|
49
|
+
|
|
50
|
+
**Cause:** Environment variable evaluation failing when cron jobs execute
|
|
51
|
+
|
|
52
|
+
**Your Impact:** Cron job running every 5 minutes triggering cascading failures
|
|
53
|
+
|
|
54
|
+
**Fix:** Disabled all cron jobs and BrainLift plugin
|
|
55
|
+
|
|
56
|
+
---
|
|
57
|
+
|
|
58
|
+
## What We Did
|
|
59
|
+
|
|
60
|
+
### 1. Created Gateway Wrapper Script
|
|
61
|
+
|
|
62
|
+
**File:** `$HOME/.openclaw/bin/gateway-wrapper.sh`
|
|
63
|
+
|
|
64
|
+
**What it does:**
|
|
65
|
+
```bash
|
|
66
|
+
#!/bin/bash
|
|
67
|
+
# - Sets all environment variables
|
|
68
|
+
# - Launches gateway with 8GB heap limit
|
|
69
|
+
# - Enables auto-restart via LaunchAgent
|
|
70
|
+
exec <node-path>/bin/node --max-old-space-size=8192 \
|
|
71
|
+
<openclaw-path>/dist/index.js gateway --port <port>
|
|
72
|
+
```
|
|
73
|
+
|
|
74
|
+
### 2. Updated LaunchAgent
|
|
75
|
+
|
|
76
|
+
**File:** `$HOME/Library/LaunchAgents/ai.openclaw.gateway.plist`
|
|
77
|
+
|
|
78
|
+
**Change:** Now calls wrapper script instead of node directly
|
|
79
|
+
|
|
80
|
+
### 3. Disabled Cron Jobs
|
|
81
|
+
|
|
82
|
+
**Files:**
|
|
83
|
+
- `$HOME/.openclaw/openclaw.json` - BrainLift disabled
|
|
84
|
+
- `$HOME/.openclaw/cron/jobs.json` - All cron jobs disabled
|
|
85
|
+
|
|
86
|
+
---
|
|
87
|
+
|
|
88
|
+
## OGP Cleared of Suspicion
|
|
89
|
+
|
|
90
|
+
**Verdict:** Your OGP work is NOT causing the crashes.
|
|
91
|
+
|
|
92
|
+
**Evidence:**
|
|
93
|
+
- Same bugs reported by users not using OGP
|
|
94
|
+
- GitHub issues filed 1-2 days ago across all platforms
|
|
95
|
+
- Crashes occur with zero OGP activity
|
|
96
|
+
- Known regressions in OpenClaw 2026.4.5
|
|
97
|
+
|
|
98
|
+
**Your dual-assistant setup (OpenClaw + Hermes) may have exposed the bugs faster due to higher load, but didn't create them.**
|
|
99
|
+
|
|
100
|
+
---
|
|
101
|
+
|
|
102
|
+
## Current Gateway Status
|
|
103
|
+
|
|
104
|
+
- **PID:** [Running]
|
|
105
|
+
- **Port:** ✅ listening
|
|
106
|
+
- **Heap Limit:** 8GB (doubled from 4GB)
|
|
107
|
+
- **Wrapper:** ✅ Active
|
|
108
|
+
- **Cron Jobs:** ✅ Disabled
|
|
109
|
+
- **BrainLift:** ✅ Disabled
|
|
110
|
+
- **LaunchAgent:** ✅ Auto-restart enabled
|
|
111
|
+
|
|
112
|
+
**Uptime:** Started and currently stable
|
|
113
|
+
|
|
114
|
+
---
|
|
115
|
+
|
|
116
|
+
## Expected Behavior
|
|
117
|
+
|
|
118
|
+
**Fixed:**
|
|
119
|
+
- ✅ No more cron-triggered crashes
|
|
120
|
+
- ✅ No more browser OOM crashes (unless you exceed 8GB heap)
|
|
121
|
+
- ✅ Auto-restart on any crash
|
|
122
|
+
|
|
123
|
+
**Still Possible:**
|
|
124
|
+
- ⚠️ Exec lifecycle bug may still crash gateway occasionally
|
|
125
|
+
- When this happens, LaunchAgent will auto-restart within seconds
|
|
126
|
+
|
|
127
|
+
---
|
|
128
|
+
|
|
129
|
+
## Monitoring
|
|
130
|
+
|
|
131
|
+
**Check gateway status:**
|
|
132
|
+
```bash
|
|
133
|
+
launchctl list | grep openclaw
|
|
134
|
+
lsof -i :<port>
|
|
135
|
+
ps aux | grep openclaw-gateway
|
|
136
|
+
```
|
|
137
|
+
|
|
138
|
+
**Watch for crashes:**
|
|
139
|
+
```bash
|
|
140
|
+
tail -f ~/.openclaw/logs/gateway.err.log | grep -E "unhandled|FATAL"
|
|
141
|
+
```
|
|
142
|
+
|
|
143
|
+
**Success metrics:**
|
|
144
|
+
- Uptime > 24 hours without manual intervention
|
|
145
|
+
- No API key errors in logs
|
|
146
|
+
- Auto-restart working if crashes occur
|
|
147
|
+
|
|
148
|
+
---
|
|
149
|
+
|
|
150
|
+
## Rollback (if needed)
|
|
151
|
+
|
|
152
|
+
```bash
|
|
153
|
+
# Restore original LaunchAgent
|
|
154
|
+
cp $HOME/Library/LaunchAgents/ai.openclaw.gateway.plist.backup-* \
|
|
155
|
+
$HOME/Library/LaunchAgents/ai.openclaw.gateway.plist
|
|
156
|
+
|
|
157
|
+
# Restore cron jobs
|
|
158
|
+
cp $HOME/.openclaw/cron/jobs.json.backup-* \
|
|
159
|
+
$HOME/.openclaw/cron/jobs.json
|
|
160
|
+
|
|
161
|
+
# Reload
|
|
162
|
+
launchctl unload $HOME/Library/LaunchAgents/ai.openclaw.gateway.plist
|
|
163
|
+
launchctl load $HOME/Library/LaunchAgents/ai.openclaw.gateway.plist
|
|
164
|
+
```
|
|
165
|
+
|
|
166
|
+
Or downgrade to OpenClaw 2026.4.2:
|
|
167
|
+
```bash
|
|
168
|
+
npm install -g openclaw@2026.4.2
|
|
169
|
+
```
|
|
170
|
+
|
|
171
|
+
---
|
|
172
|
+
|
|
173
|
+
## Documentation
|
|
174
|
+
|
|
175
|
+
**Full Details:** See `OpenClaw_Stability_Fix_Summary.md`
|
|
176
|
+
**Original Analysis:** See `crash_observations.md`
|
|
177
|
+
**Status Report:** See `OpenClaw_Hermes_Status_Report_20260407.md`
|
|
178
|
+
|
|
179
|
+
**GitHub Issues to Watch:**
|
|
180
|
+
- https://github.com/openclaw/openclaw/issues/62137
|
|
181
|
+
- https://github.com/openclaw/openclaw/issues/61592
|
|
182
|
+
- https://github.com/openclaw/openclaw/issues/61812
|
|
183
|
+
- https://github.com/openclaw/openclaw/issues/61733
|
|
184
|
+
|
|
185
|
+
---
|
|
186
|
+
|
|
187
|
+
**Resolution Date:** April 7, 2026
|
|
188
|
+
**Gateway Status:** ✅ Running with mitigations
|
|
189
|
+
**Next Check:** Monitor for 24 hours
|
|
190
|
+
**Sanitized for Publication:** April 8, 2026
|
|
@@ -0,0 +1,142 @@
|
|
|
1
|
+
# OpenClaw & Hermes — Status Report
|
|
2
|
+
**Date:** April 7, 2026
|
|
3
|
+
|
|
4
|
+
> **Note:** This document is a sanitized version of internal status reporting. System-specific paths, PIDs, and operational details have been generalized. Created during OGP development to compare gateway stability.
|
|
5
|
+
|
|
6
|
+
---
|
|
7
|
+
|
|
8
|
+
## Executive Summary
|
|
9
|
+
|
|
10
|
+
Both local AI gateways (OpenClaw, Hermes) were evaluated. OpenClaw crashed multiple times in 24 hours from two distinct bugs. Hermes has been stable. A fact-check of an agent-drafted comparison article revealed it was substantially wrong. Config changes were made to switch OpenClaw's primary model to GPT-5.4 and adjust provider configurations.
|
|
11
|
+
|
|
12
|
+
---
|
|
13
|
+
|
|
14
|
+
## OpenClaw Gateway — Crash Analysis
|
|
15
|
+
|
|
16
|
+
### Crash #1: OOM (April 6, evening)
|
|
17
|
+
|
|
18
|
+
**Root cause:** The BrainLift plugin kicked off its nightly run for all 5 agents simultaneously. The agents hit Anthropic's rate limit (429) on Claude Sonnet 4.6. The embedded agent runner retried aggressively with no backoff ceiling and no memory cleanup between attempts. Heap grew to 4.08 GB and Node.js SIGABRT'd.
|
|
19
|
+
|
|
20
|
+
**Contributing factors:**
|
|
21
|
+
- All 5 agents scheduled at the same time
|
|
22
|
+
- No exponential backoff on 429 retries
|
|
23
|
+
- Default V8 heap limit (4 GB) with no `--max-old-space-size` override
|
|
24
|
+
- Auth error mixed in (API key issue)
|
|
25
|
+
|
|
26
|
+
**Evidence:** Logs showed repeated 429s, followed by `FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory`
|
|
27
|
+
|
|
28
|
+
### Crash #2: Unhandled Promise Rejection (April 7, afternoon)
|
|
29
|
+
|
|
30
|
+
**Root cause:** Bug in `pi-agent-core` — the exec tool's stdout handler fired a callback after the agent run had already ended. The gateway's global unhandled rejection handler treated this as fatal.
|
|
31
|
+
|
|
32
|
+
**Stack trace origin:** `Agent.processEvents` in `pi-agent-core/src/agent.ts:533` — "Agent listener invoked outside active run"
|
|
33
|
+
|
|
34
|
+
**Trigger:** Agent was editing files via the exec tool when the run completed but the exec process continued emitting stdout.
|
|
35
|
+
|
|
36
|
+
### Crash #3: Same as #2 (April 7, evening)
|
|
37
|
+
|
|
38
|
+
Identical stack trace. Same exec lifecycle bug. Reproducible.
|
|
39
|
+
|
|
40
|
+
---
|
|
41
|
+
|
|
42
|
+
## OpenClaw — Config Issues Found
|
|
43
|
+
|
|
44
|
+
### 1. LaunchAgent Environment Variables Don't Work
|
|
45
|
+
|
|
46
|
+
The LaunchAgent plist uses `$(security find-generic-password ...)` shell expansion syntax for API keys. **This doesn't work in launchd plists** — plist values are literal strings, not shell-evaluated. Keychain-derived env vars are empty when launched via launchd.
|
|
47
|
+
|
|
48
|
+
**Impact:** Gateway starts without API keys → auth failures → retry loops → OOM.
|
|
49
|
+
|
|
50
|
+
**Fix needed:** Either use a wrapper shell script in the plist that resolves keys before exec'ing the gateway, or store keys directly in the plist (less secure).
|
|
51
|
+
|
|
52
|
+
### 2. Kimi Provider Configuration Issue
|
|
53
|
+
|
|
54
|
+
The Kimi direct provider was referencing an API key that wasn't properly configured. The gateway's secret resolver treated this as a hard failure.
|
|
55
|
+
|
|
56
|
+
**Current workaround:** Disabled the kimi plugin, removed kimi auth profile. The Fireworks-routed Kimi K2.5 still works via FIREWORKS_API_KEY.
|
|
57
|
+
|
|
58
|
+
### 3. Skills Loading Issues
|
|
59
|
+
|
|
60
|
+
On every startup, many skills log `"Skipping skill path that resolves outside its configured root."` These are likely symlinks or relative path references. A significant portion of the skill set is silently not loading.
|
|
61
|
+
|
|
62
|
+
**Impact:** Agent capabilities reduced without any user-visible error.
|
|
63
|
+
|
|
64
|
+
### 4. BrainLift Double-Fires
|
|
65
|
+
|
|
66
|
+
The BrainLift plugin logged two `"starting nightly run"` entries within seconds of each other — running the full 5-agent sweep twice. This doubles API usage and compounds the rate limit problem.
|
|
67
|
+
|
|
68
|
+
---
|
|
69
|
+
|
|
70
|
+
## Changes Made This Session
|
|
71
|
+
|
|
72
|
+
| Change | File | Detail |
|
|
73
|
+
|--------|------|--------|
|
|
74
|
+
| Primary model → GPT-5.4 | `openclaw.json` | Was `anthropic/claude-sonnet-4-6` |
|
|
75
|
+
| Fallback chain updated | `openclaw.json` | Multiple fallback providers configured |
|
|
76
|
+
| `openai/gpt-5.4` added to models | `openclaw.json` | New model entry with Responses API |
|
|
77
|
+
| Kimi plugin disabled | `openclaw.json` | `plugins.entries.kimi.enabled: false` |
|
|
78
|
+
| Kimi auth profile removed | `openclaw.json` | Removed kimi auth profile |
|
|
79
|
+
| Gateway started with 8GB heap | Manual launch | `--max-old-space-size=8192` |
|
|
80
|
+
| Logs truncated | `logs/` | Log rotation applied |
|
|
81
|
+
|
|
82
|
+
---
|
|
83
|
+
|
|
84
|
+
## Hermes Gateway — Status
|
|
85
|
+
|
|
86
|
+
Hermes has been stable throughout. Running Python 3.11, `hermes gateway run --replace`. Port responding (403 on unauthenticated requests, expected). Low resource usage.
|
|
87
|
+
|
|
88
|
+
OGP bridge process also running.
|
|
89
|
+
|
|
90
|
+
No crashes, no issues.
|
|
91
|
+
|
|
92
|
+
---
|
|
93
|
+
|
|
94
|
+
## Article Fact-Check: "Hermes vs OpenClaw"
|
|
95
|
+
|
|
96
|
+
An agent-drafted article was **substantially wrong**. Its central thesis — "OpenClaw is desktop-first, Hermes is cloud-native" — is fabricated. Both are local daemons running on the same machine.
|
|
97
|
+
|
|
98
|
+
**Key errors corrected:**
|
|
99
|
+
- Hermes is NOT cloud-hosted (it's a local Python process)
|
|
100
|
+
- Hermes storage is NOT cloud-backed (it's local SQLite + markdown)
|
|
101
|
+
- Hermes skills are NOT synced cloud storage (local filesystem)
|
|
102
|
+
- Hermes does NOT have built-in public endpoints (needs tunnels like OpenClaw)
|
|
103
|
+
- "Turn off your phone and federation continues" is false (machine off = Hermes off)
|
|
104
|
+
|
|
105
|
+
**Corrected article delivered** with fact-checked claims against live configs and running processes.
|
|
106
|
+
|
|
107
|
+
---
|
|
108
|
+
|
|
109
|
+
## Recommended Actions
|
|
110
|
+
|
|
111
|
+
### Immediate (Stability)
|
|
112
|
+
|
|
113
|
+
1. **Fix the exec lifecycle crash** — File issue against `pi-agent-core`. The unhandled rejection in `Agent.processEvents` when exec stdout fires after run completion is a repeatable crasher. Until fixed, the gateway will keep dying.
|
|
114
|
+
|
|
115
|
+
2. **Fix LaunchAgent env vars** — Replace the `$(...)` plist values with a wrapper script:
|
|
116
|
+
```bash
|
|
117
|
+
#!/bin/bash
|
|
118
|
+
export ANTHROPIC_API_KEY=$(security find-generic-password ...)
|
|
119
|
+
# ... other keys ...
|
|
120
|
+
exec <node-path> --max-old-space-size=8192 \
|
|
121
|
+
<openclaw-path> gateway --port <port>
|
|
122
|
+
```
|
|
123
|
+
Point the plist's ProgramArguments at this script instead of node directly.
|
|
124
|
+
|
|
125
|
+
3. **Add `--max-old-space-size=8192`** to the LaunchAgent permanently (via the wrapper script above).
|
|
126
|
+
|
|
127
|
+
### Short-term (Reliability)
|
|
128
|
+
|
|
129
|
+
4. **Stagger BrainLift agent runs** — Don't fire all agents at the same cron tick. Space them apart to avoid rate limit contention.
|
|
130
|
+
|
|
131
|
+
5. **Investigate the skipped skills issue** — Check for broken symlinks or path traversal in skills directory. These represent a significant portion of the skill set not loading.
|
|
132
|
+
|
|
133
|
+
### Medium-term (Resilience)
|
|
134
|
+
|
|
135
|
+
6. **Request backoff/retry ceiling in embedded agent runner** — The 429 retry loop with no backoff is the #1 contributor to OOM crashes. Needs exponential backoff + max retry count + memory cleanup between attempts.
|
|
136
|
+
|
|
137
|
+
7. **Add process supervision** — Current state: launchd throttles after crash, manual nohup doesn't survive reboot. Consider a wrapper that catches SIGABRT and restarts with a cooldown.
|
|
138
|
+
|
|
139
|
+
---
|
|
140
|
+
|
|
141
|
+
**Document Created:** April 7, 2026
|
|
142
|
+
**Sanitized for Publication:** April 8, 2026
|
|
@@ -0,0 +1,209 @@
|
|
|
1
|
+
# OpenClaw Stability Fix Summary
|
|
2
|
+
**Date:** April 7, 2026
|
|
3
|
+
**Status:** RESOLVED - Mitigations Implemented
|
|
4
|
+
|
|
5
|
+
> **Note:** This document is a sanitized version of internal debugging notes. File paths, process IDs, and system-specific details have been generalized. The original was created during OGP development to debug an unrelated OpenClaw regression.
|
|
6
|
+
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
## Problem Summary
|
|
10
|
+
|
|
11
|
+
OpenClaw gateway (v2026.4.5) was crashing every 10-60 minutes with two distinct failure modes:
|
|
12
|
+
|
|
13
|
+
1. **Exec Lifecycle Bug** - "Agent listener invoked outside active run" error
|
|
14
|
+
2. **Browser Automation OOM** - V8 heap exhaustion from heavy browser use
|
|
15
|
+
|
|
16
|
+
---
|
|
17
|
+
|
|
18
|
+
## Root Cause Analysis
|
|
19
|
+
|
|
20
|
+
### Bug #1: Exec Lifecycle Crash (CRITICAL)
|
|
21
|
+
|
|
22
|
+
**Status:** **KNOWN BUG in OpenClaw 2026.4.5** - Regression from 2026.4.2
|
|
23
|
+
|
|
24
|
+
**Error:** `Unhandled promise rejection: Error: Agent listener invoked outside active run`
|
|
25
|
+
|
|
26
|
+
**GitHub Issues:**
|
|
27
|
+
- [#62137](https://github.com/openclaw/openclaw/issues/62137) - Exec/PTY unhandled promise rejection
|
|
28
|
+
- [#61592](https://github.com/openclaw/openclaw/issues/61592) - Background exec process crashes
|
|
29
|
+
- [#61812](https://github.com/openclaw/openclaw/issues/61812) - Regression in 2026.4.5
|
|
30
|
+
- [#61733](https://github.com/openclaw/openclaw/issues/61733) - Windows crashes with same error
|
|
31
|
+
|
|
32
|
+
**Technical Details:**
|
|
33
|
+
When a background exec process emits stdout after the agent run has completed, the gateway crashes instead of safely ignoring or buffering the output. The `pi-agent-core` library's `Agent.processEvents` method throws when called outside an active run context.
|
|
34
|
+
|
|
35
|
+
**Trigger Scenarios:**
|
|
36
|
+
- File operations
|
|
37
|
+
- Long-running exec processes
|
|
38
|
+
- Bash tools calling `openclaw message send`
|
|
39
|
+
- Cron jobs spawning exec sessions
|
|
40
|
+
|
|
41
|
+
**Impact:** Gateway crashes every 10-60 minutes during normal operation
|
|
42
|
+
|
|
43
|
+
### Bug #2: Browser Automation OOM
|
|
44
|
+
|
|
45
|
+
**Error:** `FATAL ERROR: v8::internal::HeapAllocator::AllocateRawWithLightRetrySlowPath Allocation failed - JavaScript heap out of memory`
|
|
46
|
+
|
|
47
|
+
**Root Cause:** Heavy browser automation creates large serialized objects that overflow the default V8 heap limit (4GB)
|
|
48
|
+
|
|
49
|
+
**Impact:** Gateway crashes after extended browser automation sessions (2-4 hours)
|
|
50
|
+
|
|
51
|
+
### Bug #3: Cron Job API Key Failures (FIXED)
|
|
52
|
+
|
|
53
|
+
**Status:** RESOLVED by disabling cron jobs
|
|
54
|
+
|
|
55
|
+
**Error:** `401 Incorrect API key provided` (environment variable expansion failing)
|
|
56
|
+
|
|
57
|
+
**Root Cause:** Environment variable evaluation failing in LaunchAgent context when cron jobs execute, triggering cascading model fallback failures and eventual OOM
|
|
58
|
+
|
|
59
|
+
**Fix:** Disabled all cron jobs + BrainLift plugin
|
|
60
|
+
|
|
61
|
+
---
|
|
62
|
+
|
|
63
|
+
## Solutions Implemented
|
|
64
|
+
|
|
65
|
+
### ✅ Solution #1: Wrapper Script with 8GB Heap Limit
|
|
66
|
+
|
|
67
|
+
**File:** `$HOME/.openclaw/bin/gateway-wrapper.sh`
|
|
68
|
+
|
|
69
|
+
**What it does:**
|
|
70
|
+
- Sets all required environment variables explicitly
|
|
71
|
+
- Launches gateway with `--max-old-space-size=8192` (8GB heap limit)
|
|
72
|
+
- Provides logging for debugging
|
|
73
|
+
|
|
74
|
+
**LaunchAgent Integration:**
|
|
75
|
+
Updated LaunchAgent plist to use wrapper instead of calling node directly
|
|
76
|
+
|
|
77
|
+
**Benefits:**
|
|
78
|
+
- Doubles heap limit to prevent browser OOM crashes
|
|
79
|
+
- Ensures env vars are always set correctly
|
|
80
|
+
- Survives OpenClaw updates (wrapper script is outside node_modules)
|
|
81
|
+
|
|
82
|
+
### ✅ Solution #2: Disabled All Cron Jobs
|
|
83
|
+
|
|
84
|
+
**Files Modified:**
|
|
85
|
+
- `$HOME/.openclaw/openclaw.json` - BrainLift plugin disabled
|
|
86
|
+
- `$HOME/.openclaw/cron/jobs.json` - All cron jobs disabled
|
|
87
|
+
|
|
88
|
+
**Impact:**
|
|
89
|
+
- Eliminates cron-triggered API key evaluation failures
|
|
90
|
+
- Prevents BrainLift OOM crashes from simultaneous agent runs
|
|
91
|
+
- Stops scheduled jobs that were triggering crashes
|
|
92
|
+
|
|
93
|
+
### ✅ Solution #3: API Keys in Config File
|
|
94
|
+
|
|
95
|
+
**Status:** Already fixed via config modification
|
|
96
|
+
|
|
97
|
+
**What happened:** OpenClaw config's `env` section was modified to include API keys directly instead of shell command expansion
|
|
98
|
+
|
|
99
|
+
**Effect:** Environment variables now always available, preventing auth cascades
|
|
100
|
+
|
|
101
|
+
---
|
|
102
|
+
|
|
103
|
+
## Remaining Issues
|
|
104
|
+
|
|
105
|
+
### ⚠️ Exec Lifecycle Bug - NOT FIXED, MITIGATED
|
|
106
|
+
|
|
107
|
+
**Status:** Waiting for OpenClaw developers to fix in pi-agent-core
|
|
108
|
+
|
|
109
|
+
**Mitigation:** Gateway will still crash when exec lifecycle bug triggers, but LaunchAgent will auto-restart it
|
|
110
|
+
|
|
111
|
+
**Upstream Fix Options:**
|
|
112
|
+
1. Wait for OpenClaw team to release patch
|
|
113
|
+
2. Roll back to 2026.4.2 (workaround mentioned in GitHub issues)
|
|
114
|
+
3. Avoid file operations that trigger long-running exec processes
|
|
115
|
+
|
|
116
|
+
**Recommended Action:** Monitor for OpenClaw 2026.4.6 or later that fixes these issues
|
|
117
|
+
|
|
118
|
+
---
|
|
119
|
+
|
|
120
|
+
## OGP Correlation
|
|
121
|
+
|
|
122
|
+
**Conclusion:** OGP work is **NOT** the cause of crashes
|
|
123
|
+
|
|
124
|
+
**Evidence:**
|
|
125
|
+
- Both bugs are known OpenClaw 2026.4.5 regressions affecting all users
|
|
126
|
+
- Crashes occur with zero OGP activity
|
|
127
|
+
- GitHub issues filed by users not using OGP
|
|
128
|
+
- Dual-assistant setup (OpenClaw + Hermes) may have exposed bugs faster due to higher load, but didn't create them
|
|
129
|
+
|
|
130
|
+
---
|
|
131
|
+
|
|
132
|
+
## Current Status
|
|
133
|
+
|
|
134
|
+
**Gateway:** ✅ Running
|
|
135
|
+
**Heap Limit:** ✅ 8GB (doubled from default 4GB)
|
|
136
|
+
**Cron Jobs:** ✅ Disabled
|
|
137
|
+
**BrainLift:** ✅ Disabled
|
|
138
|
+
**Wrapper Script:** ✅ Active via LaunchAgent
|
|
139
|
+
|
|
140
|
+
**Expected Stability:**
|
|
141
|
+
- ✅ No more cron-triggered crashes
|
|
142
|
+
- ✅ No more browser OOM crashes (unless >8GB heap usage)
|
|
143
|
+
- ⚠️ Exec lifecycle bug may still cause occasional crashes (auto-restart enabled)
|
|
144
|
+
|
|
145
|
+
---
|
|
146
|
+
|
|
147
|
+
## Testing & Monitoring
|
|
148
|
+
|
|
149
|
+
**To verify stability:**
|
|
150
|
+
|
|
151
|
+
```bash
|
|
152
|
+
# Check gateway status
|
|
153
|
+
launchctl list | grep openclaw
|
|
154
|
+
lsof -i :<gateway-port>
|
|
155
|
+
|
|
156
|
+
# Monitor for crashes
|
|
157
|
+
tail -f ~/.openclaw/logs/gateway.err.log | grep -E "unhandled|crash|FATAL"
|
|
158
|
+
|
|
159
|
+
# Check uptime
|
|
160
|
+
ps aux | grep openclaw-gateway
|
|
161
|
+
```
|
|
162
|
+
|
|
163
|
+
**Success Metrics:**
|
|
164
|
+
- Gateway uptime > 24 hours without manual restart
|
|
165
|
+
- No API key evaluation errors in logs
|
|
166
|
+
- No OOM crashes during browser automation
|
|
167
|
+
|
|
168
|
+
---
|
|
169
|
+
|
|
170
|
+
## Rollback Instructions
|
|
171
|
+
|
|
172
|
+
If issues persist, to rollback:
|
|
173
|
+
|
|
174
|
+
```bash
|
|
175
|
+
# Restore original LaunchAgent
|
|
176
|
+
cp $HOME/Library/LaunchAgents/ai.openclaw.gateway.plist.backup-* \
|
|
177
|
+
$HOME/Library/LaunchAgents/ai.openclaw.gateway.plist
|
|
178
|
+
|
|
179
|
+
# Restore cron jobs
|
|
180
|
+
cp $HOME/.openclaw/cron/jobs.json.backup-* \
|
|
181
|
+
$HOME/.openclaw/cron/jobs.json
|
|
182
|
+
|
|
183
|
+
# Re-enable BrainLift in openclaw.json
|
|
184
|
+
# (manually change "enabled": false to true)
|
|
185
|
+
|
|
186
|
+
# Reload LaunchAgent
|
|
187
|
+
launchctl unload $HOME/Library/LaunchAgents/ai.openclaw.gateway.plist
|
|
188
|
+
launchctl load $HOME/Library/LaunchAgents/ai.openclaw.gateway.plist
|
|
189
|
+
```
|
|
190
|
+
|
|
191
|
+
Or consider rolling back OpenClaw to 2026.4.2:
|
|
192
|
+
```bash
|
|
193
|
+
npm install -g openclaw@2026.4.2
|
|
194
|
+
# Note: May require removing plugins.entries.memory-core.config.dreaming from config
|
|
195
|
+
```
|
|
196
|
+
|
|
197
|
+
---
|
|
198
|
+
|
|
199
|
+
## Next Steps
|
|
200
|
+
|
|
201
|
+
1. ✅ Monitor gateway stability for 24-48 hours
|
|
202
|
+
2. ⏸️ Wait for OpenClaw 2026.4.6+ release with exec lifecycle fix
|
|
203
|
+
3. 🔍 Investigate skipped skills issue (low priority)
|
|
204
|
+
|
|
205
|
+
---
|
|
206
|
+
|
|
207
|
+
**Document Created:** April 7, 2026
|
|
208
|
+
**Last Updated:** April 7, 2026
|
|
209
|
+
**Sanitized for Publication:** April 8, 2026
|
|
@@ -0,0 +1,40 @@
|
|
|
1
|
+
# OGP Development Case Studies
|
|
2
|
+
|
|
3
|
+
This directory contains sanitized debugging notes from real-world OGP development and deployment challenges. These documents capture the messy reality of building federated AI systems — including the false starts, red herrings, and lessons learned.
|
|
4
|
+
|
|
5
|
+
> **⚠️ Note:** These files are sanitized versions of internal debugging notes. System-specific details (file paths, PIDs, API key fragments, port numbers) have been removed or generalized to protect operational security while preserving the technical narrative.
|
|
6
|
+
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
## Contents
|
|
10
|
+
|
|
11
|
+
| File | Description |
|
|
12
|
+
|------|-------------|
|
|
13
|
+
| `OpenClaw_Stability_Fix_Summary.md` | Comprehensive analysis of OpenClaw 2026.4.5 regression bugs encountered during OGP development. Includes root cause analysis, mitigations, and wrapper script implementation. |
|
|
14
|
+
| `CRASH_RESOLUTION_20260407.md` | Quick reference guide for the same stability issues — condensed version for immediate action. |
|
|
15
|
+
| `crash_observations.md` | Raw timeline and observations from the debugging session. Shows the iterative process of elimination that ultimately cleared OGP of suspicion. |
|
|
16
|
+
| `OpenClaw_Hermes_Status_Report_20260407.md` | Comparative analysis of OpenClaw vs. Hermes gateway stability during federation testing. Includes fact-check of an AI-drafted article that was substantially wrong. |
|
|
17
|
+
|
|
18
|
+
---
|
|
19
|
+
|
|
20
|
+
## Context
|
|
21
|
+
|
|
22
|
+
These documents were created on April 7, 2026, during intensive OGP federation testing. The initial hypothesis was that OGP's dual-assistant setup (OpenClaw + Hermes) was causing gateway instability. **The reality:** OpenClaw 2026.4.5 had known regression bugs affecting all users.
|
|
23
|
+
|
|
24
|
+
**Key Lesson:** When debugging complex systems, correlation is not causation. The OGP work exposed OpenClaw bugs faster due to higher load, but didn't create them.
|
|
25
|
+
|
|
26
|
+
---
|
|
27
|
+
|
|
28
|
+
## Related Article
|
|
29
|
+
|
|
30
|
+
The debugging narrative behind these files is documented in:
|
|
31
|
+
|
|
32
|
+
**"[Case Study] When Your AI Tools Keep Crashing: A Meta-Debugging Loop with OpenClaw and Claude"**
|
|
33
|
+
|
|
34
|
+
This Substack article tells the story of using Claude (via Dispatch) to diagnose OpenClaw crashes while OpenClaw was down, then using OpenClaw/Claude Code to fix OGP bugs, then back to Claude when OpenClaw crashed again — a meta-loop that became the only way forward.
|
|
35
|
+
|
|
36
|
+
---
|
|
37
|
+
|
|
38
|
+
**Why These Are Here:**
|
|
39
|
+
|
|
40
|
+
The article promised these files would be "available in dp-pcs/ogp." Rather than leave them as unverified claims, we're publishing the sanitized source material. Real debugging is messy. Real systems fail in unexpected ways. Federation requires resilience not just in protocol design, but in the development process itself.
|