@dp-pcs/ogp 0.7.2 → 0.8.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +59 -12
- package/dist/cli/config.d.ts +4 -0
- package/dist/cli/config.d.ts.map +1 -1
- package/dist/cli/config.js +45 -2
- package/dist/cli/config.js.map +1 -1
- package/dist/cli/expose.d.ts +4 -1
- package/dist/cli/expose.d.ts.map +1 -1
- package/dist/cli/expose.js +7 -106
- package/dist/cli/expose.js.map +1 -1
- package/dist/cli/install.d.ts +1 -0
- package/dist/cli/install.d.ts.map +1 -1
- package/dist/cli/install.js +8 -2
- package/dist/cli/install.js.map +1 -1
- package/dist/cli/project.d.ts +24 -0
- package/dist/cli/project.d.ts.map +1 -1
- package/dist/cli/project.js +68 -15
- package/dist/cli/project.js.map +1 -1
- package/dist/cli/tunnel.d.ts +65 -0
- package/dist/cli/tunnel.d.ts.map +1 -0
- package/dist/cli/tunnel.js +413 -0
- package/dist/cli/tunnel.js.map +1 -0
- package/dist/cli.js +21 -8
- package/dist/cli.js.map +1 -1
- package/dist/daemon/contribution-signing.d.ts +49 -0
- package/dist/daemon/contribution-signing.d.ts.map +1 -0
- package/dist/daemon/contribution-signing.js +102 -0
- package/dist/daemon/contribution-signing.js.map +1 -0
- package/dist/daemon/message-handler.js +41 -18
- package/dist/daemon/message-handler.js.map +1 -1
- package/dist/daemon/openclaw-bridge.d.ts +6 -0
- package/dist/daemon/openclaw-bridge.d.ts.map +1 -1
- package/dist/daemon/openclaw-bridge.js +27 -12
- package/dist/daemon/openclaw-bridge.js.map +1 -1
- package/dist/daemon/peers.d.ts.map +1 -1
- package/dist/daemon/peers.js +19 -0
- package/dist/daemon/peers.js.map +1 -1
- package/dist/daemon/projects.d.ts +20 -0
- package/dist/daemon/projects.d.ts.map +1 -1
- package/dist/daemon/projects.js +70 -0
- package/dist/daemon/projects.js.map +1 -1
- package/dist/daemon/server.d.ts.map +1 -1
- package/dist/daemon/server.js +43 -2
- package/dist/daemon/server.js.map +1 -1
- package/dist/daemon/state-lock.d.ts +23 -0
- package/dist/daemon/state-lock.d.ts.map +1 -0
- package/dist/daemon/state-lock.js +115 -0
- package/dist/daemon/state-lock.js.map +1 -0
- package/package.json +13 -3
- package/scripts/completion.bash +25 -6
- package/scripts/completion.zsh +26 -8
- package/skills/ogp-expose/SKILL.md +40 -10
- package/docs/RC1-FEDERATION-TEST-CHECKLIST.md +0 -477
- package/docs/case-studies/CRASH_RESOLUTION_20260407.md +0 -190
- package/docs/case-studies/OpenClaw_Hermes_Status_Report_20260407.md +0 -142
- package/docs/case-studies/OpenClaw_Stability_Fix_Summary.md +0 -209
- package/docs/case-studies/README.md +0 -40
- package/docs/case-studies/crash_observations.md +0 -250
- package/docs/nat-hole-punch-spike.md +0 -399
- package/docs/project-intent-testing.md +0 -97
- package/scripts/render-ogp-overview-video.mjs +0 -454
- package/scripts/test-migration-execute.js +0 -74
- package/scripts/test-migration.js +0 -42
- package/scripts/test-project-intents.mjs +0 -614
|
@@ -1,142 +0,0 @@
|
|
|
1
|
-
# OpenClaw & Hermes — Status Report
|
|
2
|
-
**Date:** April 7, 2026
|
|
3
|
-
|
|
4
|
-
> **Note:** This document is a sanitized version of internal status reporting. System-specific paths, PIDs, and operational details have been generalized. Created during OGP development to compare gateway stability.
|
|
5
|
-
|
|
6
|
-
---
|
|
7
|
-
|
|
8
|
-
## Executive Summary
|
|
9
|
-
|
|
10
|
-
Both local AI gateways (OpenClaw, Hermes) were evaluated. OpenClaw crashed multiple times in 24 hours from two distinct bugs. Hermes has been stable. A fact-check of an agent-drafted comparison article revealed it was substantially wrong. Config changes were made to switch OpenClaw's primary model to GPT-5.4 and adjust provider configurations.
|
|
11
|
-
|
|
12
|
-
---
|
|
13
|
-
|
|
14
|
-
## OpenClaw Gateway — Crash Analysis
|
|
15
|
-
|
|
16
|
-
### Crash #1: OOM (April 6, evening)
|
|
17
|
-
|
|
18
|
-
**Root cause:** The BrainLift plugin kicked off its nightly run for all 5 agents simultaneously. The agents hit Anthropic's rate limit (429) on Claude Sonnet 4.6. The embedded agent runner retried aggressively with no backoff ceiling and no memory cleanup between attempts. Heap grew to 4.08 GB and Node.js SIGABRT'd.
|
|
19
|
-
|
|
20
|
-
**Contributing factors:**
|
|
21
|
-
- All 5 agents scheduled at the same time
|
|
22
|
-
- No exponential backoff on 429 retries
|
|
23
|
-
- Default V8 heap limit (4 GB) with no `--max-old-space-size` override
|
|
24
|
-
- Auth error mixed in (API key issue)
|
|
25
|
-
|
|
26
|
-
**Evidence:** Logs showed repeated 429s, followed by `FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory`
|
|
27
|
-
|
|
28
|
-
### Crash #2: Unhandled Promise Rejection (April 7, afternoon)
|
|
29
|
-
|
|
30
|
-
**Root cause:** Bug in `pi-agent-core` — the exec tool's stdout handler fired a callback after the agent run had already ended. The gateway's global unhandled rejection handler treated this as fatal.
|
|
31
|
-
|
|
32
|
-
**Stack trace origin:** `Agent.processEvents` in `pi-agent-core/src/agent.ts:533` — "Agent listener invoked outside active run"
|
|
33
|
-
|
|
34
|
-
**Trigger:** Agent was editing files via the exec tool when the run completed but the exec process continued emitting stdout.
|
|
35
|
-
|
|
36
|
-
### Crash #3: Same as #2 (April 7, evening)
|
|
37
|
-
|
|
38
|
-
Identical stack trace. Same exec lifecycle bug. Reproducible.
|
|
39
|
-
|
|
40
|
-
---
|
|
41
|
-
|
|
42
|
-
## OpenClaw — Config Issues Found
|
|
43
|
-
|
|
44
|
-
### 1. LaunchAgent Environment Variables Don't Work
|
|
45
|
-
|
|
46
|
-
The LaunchAgent plist uses `$(security find-generic-password ...)` shell expansion syntax for API keys. **This doesn't work in launchd plists** — plist values are literal strings, not shell-evaluated. Keychain-derived env vars are empty when launched via launchd.
|
|
47
|
-
|
|
48
|
-
**Impact:** Gateway starts without API keys → auth failures → retry loops → OOM.
|
|
49
|
-
|
|
50
|
-
**Fix needed:** Either use a wrapper shell script in the plist that resolves keys before exec'ing the gateway, or store keys directly in the plist (less secure).
|
|
51
|
-
|
|
52
|
-
### 2. Kimi Provider Configuration Issue
|
|
53
|
-
|
|
54
|
-
The Kimi direct provider was referencing an API key that wasn't properly configured. The gateway's secret resolver treated this as a hard failure.
|
|
55
|
-
|
|
56
|
-
**Current workaround:** Disabled the kimi plugin, removed kimi auth profile. The Fireworks-routed Kimi K2.5 still works via FIREWORKS_API_KEY.
|
|
57
|
-
|
|
58
|
-
### 3. Skills Loading Issues
|
|
59
|
-
|
|
60
|
-
On every startup, many skills log `"Skipping skill path that resolves outside its configured root."` These are likely symlinks or relative path references. A significant portion of the skill set is silently not loading.
|
|
61
|
-
|
|
62
|
-
**Impact:** Agent capabilities reduced without any user-visible error.
|
|
63
|
-
|
|
64
|
-
### 4. BrainLift Double-Fires
|
|
65
|
-
|
|
66
|
-
The BrainLift plugin logged two `"starting nightly run"` entries within seconds of each other — running the full 5-agent sweep twice. This doubles API usage and compounds the rate limit problem.
|
|
67
|
-
|
|
68
|
-
---
|
|
69
|
-
|
|
70
|
-
## Changes Made This Session
|
|
71
|
-
|
|
72
|
-
| Change | File | Detail |
|
|
73
|
-
|--------|------|--------|
|
|
74
|
-
| Primary model → GPT-5.4 | `openclaw.json` | Was `anthropic/claude-sonnet-4-6` |
|
|
75
|
-
| Fallback chain updated | `openclaw.json` | Multiple fallback providers configured |
|
|
76
|
-
| `openai/gpt-5.4` added to models | `openclaw.json` | New model entry with Responses API |
|
|
77
|
-
| Kimi plugin disabled | `openclaw.json` | `plugins.entries.kimi.enabled: false` |
|
|
78
|
-
| Kimi auth profile removed | `openclaw.json` | Removed kimi auth profile |
|
|
79
|
-
| Gateway started with 8GB heap | Manual launch | `--max-old-space-size=8192` |
|
|
80
|
-
| Logs truncated | `logs/` | Log rotation applied |
|
|
81
|
-
|
|
82
|
-
---
|
|
83
|
-
|
|
84
|
-
## Hermes Gateway — Status
|
|
85
|
-
|
|
86
|
-
Hermes has been stable throughout. Running Python 3.11, `hermes gateway run --replace`. Port responding (403 on unauthenticated requests, expected). Low resource usage.
|
|
87
|
-
|
|
88
|
-
OGP bridge process also running.
|
|
89
|
-
|
|
90
|
-
No crashes, no issues.
|
|
91
|
-
|
|
92
|
-
---
|
|
93
|
-
|
|
94
|
-
## Article Fact-Check: "Hermes vs OpenClaw"
|
|
95
|
-
|
|
96
|
-
An agent-drafted article was **substantially wrong**. Its central thesis — "OpenClaw is desktop-first, Hermes is cloud-native" — is fabricated. Both are local daemons running on the same machine.
|
|
97
|
-
|
|
98
|
-
**Key errors corrected:**
|
|
99
|
-
- Hermes is NOT cloud-hosted (it's a local Python process)
|
|
100
|
-
- Hermes storage is NOT cloud-backed (it's local SQLite + markdown)
|
|
101
|
-
- Hermes skills are NOT synced cloud storage (local filesystem)
|
|
102
|
-
- Hermes does NOT have built-in public endpoints (needs tunnels like OpenClaw)
|
|
103
|
-
- "Turn off your phone and federation continues" is false (machine off = Hermes off)
|
|
104
|
-
|
|
105
|
-
**Corrected article delivered** with fact-checked claims against live configs and running processes.
|
|
106
|
-
|
|
107
|
-
---
|
|
108
|
-
|
|
109
|
-
## Recommended Actions
|
|
110
|
-
|
|
111
|
-
### Immediate (Stability)
|
|
112
|
-
|
|
113
|
-
1. **Fix the exec lifecycle crash** — File issue against `pi-agent-core`. The unhandled rejection in `Agent.processEvents` when exec stdout fires after run completion is a repeatable crasher. Until fixed, the gateway will keep dying.
|
|
114
|
-
|
|
115
|
-
2. **Fix LaunchAgent env vars** — Replace the `$(...)` plist values with a wrapper script:
|
|
116
|
-
```bash
|
|
117
|
-
#!/bin/bash
|
|
118
|
-
export ANTHROPIC_API_KEY=$(security find-generic-password ...)
|
|
119
|
-
# ... other keys ...
|
|
120
|
-
exec <node-path> --max-old-space-size=8192 \
|
|
121
|
-
<openclaw-path> gateway --port <port>
|
|
122
|
-
```
|
|
123
|
-
Point the plist's ProgramArguments at this script instead of node directly.
|
|
124
|
-
|
|
125
|
-
3. **Add `--max-old-space-size=8192`** to the LaunchAgent permanently (via the wrapper script above).
|
|
126
|
-
|
|
127
|
-
### Short-term (Reliability)
|
|
128
|
-
|
|
129
|
-
4. **Stagger BrainLift agent runs** — Don't fire all agents at the same cron tick. Space them apart to avoid rate limit contention.
|
|
130
|
-
|
|
131
|
-
5. **Investigate the skipped skills issue** — Check for broken symlinks or path traversal in skills directory. These represent a significant portion of the skill set not loading.
|
|
132
|
-
|
|
133
|
-
### Medium-term (Resilience)
|
|
134
|
-
|
|
135
|
-
6. **Request backoff/retry ceiling in embedded agent runner** — The 429 retry loop with no backoff is the #1 contributor to OOM crashes. Needs exponential backoff + max retry count + memory cleanup between attempts.
|
|
136
|
-
|
|
137
|
-
7. **Add process supervision** — Current state: launchd throttles after crash, manual nohup doesn't survive reboot. Consider a wrapper that catches SIGABRT and restarts with a cooldown.
|
|
138
|
-
|
|
139
|
-
---
|
|
140
|
-
|
|
141
|
-
**Document Created:** April 7, 2026
|
|
142
|
-
**Sanitized for Publication:** April 8, 2026
|
|
@@ -1,209 +0,0 @@
|
|
|
1
|
-
# OpenClaw Stability Fix Summary
|
|
2
|
-
**Date:** April 7, 2026
|
|
3
|
-
**Status:** RESOLVED - Mitigations Implemented
|
|
4
|
-
|
|
5
|
-
> **Note:** This document is a sanitized version of internal debugging notes. File paths, process IDs, and system-specific details have been generalized. The original was created during OGP development to debug an unrelated OpenClaw regression.
|
|
6
|
-
|
|
7
|
-
---
|
|
8
|
-
|
|
9
|
-
## Problem Summary
|
|
10
|
-
|
|
11
|
-
OpenClaw gateway (v2026.4.5) was crashing every 10-60 minutes with two distinct failure modes:
|
|
12
|
-
|
|
13
|
-
1. **Exec Lifecycle Bug** - "Agent listener invoked outside active run" error
|
|
14
|
-
2. **Browser Automation OOM** - V8 heap exhaustion from heavy browser use
|
|
15
|
-
|
|
16
|
-
---
|
|
17
|
-
|
|
18
|
-
## Root Cause Analysis
|
|
19
|
-
|
|
20
|
-
### Bug #1: Exec Lifecycle Crash (CRITICAL)
|
|
21
|
-
|
|
22
|
-
**Status:** **KNOWN BUG in OpenClaw 2026.4.5** - Regression from 2026.4.2
|
|
23
|
-
|
|
24
|
-
**Error:** `Unhandled promise rejection: Error: Agent listener invoked outside active run`
|
|
25
|
-
|
|
26
|
-
**GitHub Issues:**
|
|
27
|
-
- [#62137](https://github.com/openclaw/openclaw/issues/62137) - Exec/PTY unhandled promise rejection
|
|
28
|
-
- [#61592](https://github.com/openclaw/openclaw/issues/61592) - Background exec process crashes
|
|
29
|
-
- [#61812](https://github.com/openclaw/openclaw/issues/61812) - Regression in 2026.4.5
|
|
30
|
-
- [#61733](https://github.com/openclaw/openclaw/issues/61733) - Windows crashes with same error
|
|
31
|
-
|
|
32
|
-
**Technical Details:**
|
|
33
|
-
When a background exec process emits stdout after the agent run has completed, the gateway crashes instead of safely ignoring or buffering the output. The `pi-agent-core` library's `Agent.processEvents` method throws when called outside an active run context.
|
|
34
|
-
|
|
35
|
-
**Trigger Scenarios:**
|
|
36
|
-
- File operations
|
|
37
|
-
- Long-running exec processes
|
|
38
|
-
- Bash tools calling `openclaw message send`
|
|
39
|
-
- Cron jobs spawning exec sessions
|
|
40
|
-
|
|
41
|
-
**Impact:** Gateway crashes every 10-60 minutes during normal operation
|
|
42
|
-
|
|
43
|
-
### Bug #2: Browser Automation OOM
|
|
44
|
-
|
|
45
|
-
**Error:** `FATAL ERROR: v8::internal::HeapAllocator::AllocateRawWithLightRetrySlowPath Allocation failed - JavaScript heap out of memory`
|
|
46
|
-
|
|
47
|
-
**Root Cause:** Heavy browser automation creates large serialized objects that overflow the default V8 heap limit (4GB)
|
|
48
|
-
|
|
49
|
-
**Impact:** Gateway crashes after extended browser automation sessions (2-4 hours)
|
|
50
|
-
|
|
51
|
-
### Bug #3: Cron Job API Key Failures (FIXED)
|
|
52
|
-
|
|
53
|
-
**Status:** RESOLVED by disabling cron jobs
|
|
54
|
-
|
|
55
|
-
**Error:** `401 Incorrect API key provided` (environment variable expansion failing)
|
|
56
|
-
|
|
57
|
-
**Root Cause:** Environment variable evaluation failing in LaunchAgent context when cron jobs execute, triggering cascading model fallback failures and eventual OOM
|
|
58
|
-
|
|
59
|
-
**Fix:** Disabled all cron jobs + BrainLift plugin
|
|
60
|
-
|
|
61
|
-
---
|
|
62
|
-
|
|
63
|
-
## Solutions Implemented
|
|
64
|
-
|
|
65
|
-
### ✅ Solution #1: Wrapper Script with 8GB Heap Limit
|
|
66
|
-
|
|
67
|
-
**File:** `$HOME/.openclaw/bin/gateway-wrapper.sh`
|
|
68
|
-
|
|
69
|
-
**What it does:**
|
|
70
|
-
- Sets all required environment variables explicitly
|
|
71
|
-
- Launches gateway with `--max-old-space-size=8192` (8GB heap limit)
|
|
72
|
-
- Provides logging for debugging
|
|
73
|
-
|
|
74
|
-
**LaunchAgent Integration:**
|
|
75
|
-
Updated LaunchAgent plist to use wrapper instead of calling node directly
|
|
76
|
-
|
|
77
|
-
**Benefits:**
|
|
78
|
-
- Doubles heap limit to prevent browser OOM crashes
|
|
79
|
-
- Ensures env vars are always set correctly
|
|
80
|
-
- Survives OpenClaw updates (wrapper script is outside node_modules)
|
|
81
|
-
|
|
82
|
-
### ✅ Solution #2: Disabled All Cron Jobs
|
|
83
|
-
|
|
84
|
-
**Files Modified:**
|
|
85
|
-
- `$HOME/.openclaw/openclaw.json` - BrainLift plugin disabled
|
|
86
|
-
- `$HOME/.openclaw/cron/jobs.json` - All cron jobs disabled
|
|
87
|
-
|
|
88
|
-
**Impact:**
|
|
89
|
-
- Eliminates cron-triggered API key evaluation failures
|
|
90
|
-
- Prevents BrainLift OOM crashes from simultaneous agent runs
|
|
91
|
-
- Stops scheduled jobs that were triggering crashes
|
|
92
|
-
|
|
93
|
-
### ✅ Solution #3: API Keys in Config File
|
|
94
|
-
|
|
95
|
-
**Status:** Already fixed via config modification
|
|
96
|
-
|
|
97
|
-
**What happened:** OpenClaw config's `env` section was modified to include API keys directly instead of shell command expansion
|
|
98
|
-
|
|
99
|
-
**Effect:** Environment variables now always available, preventing auth cascades
|
|
100
|
-
|
|
101
|
-
---
|
|
102
|
-
|
|
103
|
-
## Remaining Issues
|
|
104
|
-
|
|
105
|
-
### ⚠️ Exec Lifecycle Bug - NOT FIXED, MITIGATED
|
|
106
|
-
|
|
107
|
-
**Status:** Waiting for OpenClaw developers to fix in pi-agent-core
|
|
108
|
-
|
|
109
|
-
**Mitigation:** Gateway will still crash when exec lifecycle bug triggers, but LaunchAgent will auto-restart it
|
|
110
|
-
|
|
111
|
-
**Upstream Fix Options:**
|
|
112
|
-
1. Wait for OpenClaw team to release patch
|
|
113
|
-
2. Roll back to 2026.4.2 (workaround mentioned in GitHub issues)
|
|
114
|
-
3. Avoid file operations that trigger long-running exec processes
|
|
115
|
-
|
|
116
|
-
**Recommended Action:** Monitor for OpenClaw 2026.4.6 or later that fixes these issues
|
|
117
|
-
|
|
118
|
-
---
|
|
119
|
-
|
|
120
|
-
## OGP Correlation
|
|
121
|
-
|
|
122
|
-
**Conclusion:** OGP work is **NOT** the cause of crashes
|
|
123
|
-
|
|
124
|
-
**Evidence:**
|
|
125
|
-
- Both bugs are known OpenClaw 2026.4.5 regressions affecting all users
|
|
126
|
-
- Crashes occur with zero OGP activity
|
|
127
|
-
- GitHub issues filed by users not using OGP
|
|
128
|
-
- Dual-assistant setup (OpenClaw + Hermes) may have exposed bugs faster due to higher load, but didn't create them
|
|
129
|
-
|
|
130
|
-
---
|
|
131
|
-
|
|
132
|
-
## Current Status
|
|
133
|
-
|
|
134
|
-
**Gateway:** ✅ Running
|
|
135
|
-
**Heap Limit:** ✅ 8GB (doubled from default 4GB)
|
|
136
|
-
**Cron Jobs:** ✅ Disabled
|
|
137
|
-
**BrainLift:** ✅ Disabled
|
|
138
|
-
**Wrapper Script:** ✅ Active via LaunchAgent
|
|
139
|
-
|
|
140
|
-
**Expected Stability:**
|
|
141
|
-
- ✅ No more cron-triggered crashes
|
|
142
|
-
- ✅ No more browser OOM crashes (unless >8GB heap usage)
|
|
143
|
-
- ⚠️ Exec lifecycle bug may still cause occasional crashes (auto-restart enabled)
|
|
144
|
-
|
|
145
|
-
---
|
|
146
|
-
|
|
147
|
-
## Testing & Monitoring
|
|
148
|
-
|
|
149
|
-
**To verify stability:**
|
|
150
|
-
|
|
151
|
-
```bash
|
|
152
|
-
# Check gateway status
|
|
153
|
-
launchctl list | grep openclaw
|
|
154
|
-
lsof -i :<gateway-port>
|
|
155
|
-
|
|
156
|
-
# Monitor for crashes
|
|
157
|
-
tail -f ~/.openclaw/logs/gateway.err.log | grep -E "unhandled|crash|FATAL"
|
|
158
|
-
|
|
159
|
-
# Check uptime
|
|
160
|
-
ps aux | grep openclaw-gateway
|
|
161
|
-
```
|
|
162
|
-
|
|
163
|
-
**Success Metrics:**
|
|
164
|
-
- Gateway uptime > 24 hours without manual restart
|
|
165
|
-
- No API key evaluation errors in logs
|
|
166
|
-
- No OOM crashes during browser automation
|
|
167
|
-
|
|
168
|
-
---
|
|
169
|
-
|
|
170
|
-
## Rollback Instructions
|
|
171
|
-
|
|
172
|
-
If issues persist, to rollback:
|
|
173
|
-
|
|
174
|
-
```bash
|
|
175
|
-
# Restore original LaunchAgent
|
|
176
|
-
cp $HOME/Library/LaunchAgents/ai.openclaw.gateway.plist.backup-* \
|
|
177
|
-
$HOME/Library/LaunchAgents/ai.openclaw.gateway.plist
|
|
178
|
-
|
|
179
|
-
# Restore cron jobs
|
|
180
|
-
cp $HOME/.openclaw/cron/jobs.json.backup-* \
|
|
181
|
-
$HOME/.openclaw/cron/jobs.json
|
|
182
|
-
|
|
183
|
-
# Re-enable BrainLift in openclaw.json
|
|
184
|
-
# (manually change "enabled": false to true)
|
|
185
|
-
|
|
186
|
-
# Reload LaunchAgent
|
|
187
|
-
launchctl unload $HOME/Library/LaunchAgents/ai.openclaw.gateway.plist
|
|
188
|
-
launchctl load $HOME/Library/LaunchAgents/ai.openclaw.gateway.plist
|
|
189
|
-
```
|
|
190
|
-
|
|
191
|
-
Or consider rolling back OpenClaw to 2026.4.2:
|
|
192
|
-
```bash
|
|
193
|
-
npm install -g openclaw@2026.4.2
|
|
194
|
-
# Note: May require removing plugins.entries.memory-core.config.dreaming from config
|
|
195
|
-
```
|
|
196
|
-
|
|
197
|
-
---
|
|
198
|
-
|
|
199
|
-
## Next Steps
|
|
200
|
-
|
|
201
|
-
1. ✅ Monitor gateway stability for 24-48 hours
|
|
202
|
-
2. ⏸️ Wait for OpenClaw 2026.4.6+ release with exec lifecycle fix
|
|
203
|
-
3. 🔍 Investigate skipped skills issue (low priority)
|
|
204
|
-
|
|
205
|
-
---
|
|
206
|
-
|
|
207
|
-
**Document Created:** April 7, 2026
|
|
208
|
-
**Last Updated:** April 7, 2026
|
|
209
|
-
**Sanitized for Publication:** April 8, 2026
|
|
@@ -1,40 +0,0 @@
|
|
|
1
|
-
# OGP Development Case Studies
|
|
2
|
-
|
|
3
|
-
This directory contains sanitized debugging notes from real-world OGP development and deployment challenges. These documents capture the messy reality of building federated AI systems — including the false starts, red herrings, and lessons learned.
|
|
4
|
-
|
|
5
|
-
> **⚠️ Note:** These files are sanitized versions of internal debugging notes. System-specific details (file paths, PIDs, API key fragments, port numbers) have been removed or generalized to protect operational security while preserving the technical narrative.
|
|
6
|
-
|
|
7
|
-
---
|
|
8
|
-
|
|
9
|
-
## Contents
|
|
10
|
-
|
|
11
|
-
| File | Description |
|
|
12
|
-
|------|-------------|
|
|
13
|
-
| `OpenClaw_Stability_Fix_Summary.md` | Comprehensive analysis of OpenClaw 2026.4.5 regression bugs encountered during OGP development. Includes root cause analysis, mitigations, and wrapper script implementation. |
|
|
14
|
-
| `CRASH_RESOLUTION_20260407.md` | Quick reference guide for the same stability issues — condensed version for immediate action. |
|
|
15
|
-
| `crash_observations.md` | Raw timeline and observations from the debugging session. Shows the iterative process of elimination that ultimately cleared OGP of suspicion. |
|
|
16
|
-
| `OpenClaw_Hermes_Status_Report_20260407.md` | Comparative analysis of OpenClaw vs. Hermes gateway stability during federation testing. Includes fact-check of an AI-drafted article that was substantially wrong. |
|
|
17
|
-
|
|
18
|
-
---
|
|
19
|
-
|
|
20
|
-
## Context
|
|
21
|
-
|
|
22
|
-
These documents were created on April 7, 2026, during intensive OGP federation testing. The initial hypothesis was that OGP's dual-assistant setup (OpenClaw + Hermes) was causing gateway instability. **The reality:** OpenClaw 2026.4.5 had known regression bugs affecting all users.
|
|
23
|
-
|
|
24
|
-
**Key Lesson:** When debugging complex systems, correlation is not causation. The OGP work exposed OpenClaw bugs faster due to higher load, but didn't create them.
|
|
25
|
-
|
|
26
|
-
---
|
|
27
|
-
|
|
28
|
-
## Related Article
|
|
29
|
-
|
|
30
|
-
The debugging narrative behind these files is documented in:
|
|
31
|
-
|
|
32
|
-
**"[Case Study] When Your AI Tools Keep Crashing: A Meta-Debugging Loop with OpenClaw and Claude"**
|
|
33
|
-
|
|
34
|
-
This Substack article tells the story of using Claude (via Dispatch) to diagnose OpenClaw crashes while OpenClaw was down, then using OpenClaw/Claude Code to fix OGP bugs, then back to Claude when OpenClaw crashed again — a meta-loop that became the only way forward.
|
|
35
|
-
|
|
36
|
-
---
|
|
37
|
-
|
|
38
|
-
**Why These Are Here:**
|
|
39
|
-
|
|
40
|
-
The article promised these files would be "available in dp-pcs/ogp." Rather than leave them as unverified claims, we're publishing the sanitized source material. Real debugging is messy. Real systems fail in unexpected ways. Federation requires resilience not just in protocol design, but in the development process itself.
|
|
@@ -1,250 +0,0 @@
|
|
|
1
|
-
# OpenClaw Crash Observations - April 7, 2026
|
|
2
|
-
|
|
3
|
-
> **Note:** This document is a sanitized version of internal debugging notes. API key fragments, file paths, and system-specific details have been removed or generalized. Created during OGP development to document OpenClaw regression analysis.
|
|
4
|
-
|
|
5
|
-
---
|
|
6
|
-
|
|
7
|
-
## Timeline Summary
|
|
8
|
-
|
|
9
|
-
User reported that OpenClaw has been crashing non-stop for the last 24 hours after previously working without issues. Multiple agents were affected, with the gateway itself crashing repeatedly.
|
|
10
|
-
|
|
11
|
-
---
|
|
12
|
-
|
|
13
|
-
## Issues Identified & Fixed
|
|
14
|
-
|
|
15
|
-
### 1. Cascading Authentication Failures (8:00 AM)
|
|
16
|
-
|
|
17
|
-
**Symptoms:**
|
|
18
|
-
- Agent failing with all configured model providers
|
|
19
|
-
- Error sequence: Multiple providers failing in sequence
|
|
20
|
-
- "All models failed" errors in logs
|
|
21
|
-
|
|
22
|
-
**Root Causes:**
|
|
23
|
-
- Kimi API: HTTP 401 - Invalid/expired API key
|
|
24
|
-
- Anthropic API: HTTP 401 - Invalid API key (rate limits also hit)
|
|
25
|
-
- OpenAI API: 401 - Malformed API key (env var expansion failing)
|
|
26
|
-
|
|
27
|
-
**Analysis:**
|
|
28
|
-
The `openclaw doctor` command appeared to have modified the configuration file, simplifying the env section and causing environment variable expansion to fail. The OpenAI provider had a misconfigured API key reference.
|
|
29
|
-
|
|
30
|
-
**Fix Applied:**
|
|
31
|
-
- Restored proper API key references in env section
|
|
32
|
-
- Changed OpenAI provider to use environment variable references
|
|
33
|
-
|
|
34
|
-
### 2. Model Configuration Corruption (Multiple Occurrences)
|
|
35
|
-
|
|
36
|
-
**Symptoms:**
|
|
37
|
-
- Agent configuration simplified to only primary model, no fallbacks
|
|
38
|
-
- Default model referencing non-existent model ID
|
|
39
|
-
- Invalid model IDs causing 404 errors
|
|
40
|
-
|
|
41
|
-
**Root Cause:**
|
|
42
|
-
Configuration file was being modified (likely by `openclaw doctor` command or auto-formatting) which:
|
|
43
|
-
- Removed fallback models from agent configurations
|
|
44
|
-
- Simplified environment variable definitions
|
|
45
|
-
- Changed model references
|
|
46
|
-
|
|
47
|
-
**Example Configuration Issue:**
|
|
48
|
-
```json
|
|
49
|
-
// Broken (no fallbacks)
|
|
50
|
-
"model": {
|
|
51
|
-
"primary": "anthropic/claude-sonnet-4-6"
|
|
52
|
-
}
|
|
53
|
-
|
|
54
|
-
// Restored (with fallbacks)
|
|
55
|
-
"model": {
|
|
56
|
-
"primary": "openai/gpt-5.4",
|
|
57
|
-
"fallbacks": [
|
|
58
|
-
"anthropic/claude-sonnet-4-6",
|
|
59
|
-
"openai/gpt-4o",
|
|
60
|
-
"fireworks/accounts/fireworks/models/kimi-k2p5"
|
|
61
|
-
]
|
|
62
|
-
}
|
|
63
|
-
```
|
|
64
|
-
|
|
65
|
-
### 3. OpenAI API 404 Errors (9:00-9:50 AM)
|
|
66
|
-
|
|
67
|
-
**Symptoms:**
|
|
68
|
-
- Continuous 404 errors on OpenAI models
|
|
69
|
-
- All OpenAI models failing despite being available via API
|
|
70
|
-
|
|
71
|
-
**Root Cause:**
|
|
72
|
-
GPT-5.4 and newer models require the **Responses API** endpoint (`/v1/responses`) instead of the Chat Completions API endpoint (`/v1/chat/completions`). OpenClaw was using the wrong endpoint.
|
|
73
|
-
|
|
74
|
-
**Evidence:**
|
|
75
|
-
- Manual API query confirmed models exist: `gpt-5.4`, `gpt-5.4-2026-03-05`, `gpt-4o` all available
|
|
76
|
-
- OpenAI documentation confirmed GPT-5.4 requires Responses API
|
|
77
|
-
- Error pattern: 404 with no body = wrong endpoint
|
|
78
|
-
|
|
79
|
-
**Fix Applied:**
|
|
80
|
-
Added `"api": "openai-responses"` to OpenAI provider configuration:
|
|
81
|
-
```json
|
|
82
|
-
"openai": {
|
|
83
|
-
"baseUrl": "https://api.openai.com/v1",
|
|
84
|
-
"apiKey": "${PERSONAL_OPENAI_API_KEY}",
|
|
85
|
-
"api": "openai-responses",
|
|
86
|
-
"models": [...]
|
|
87
|
-
}
|
|
88
|
-
```
|
|
89
|
-
|
|
90
|
-
**References:**
|
|
91
|
-
- GitHub Issue: openclaw/openclaw#38706 - "GPT-5.4 via openai-codex OAuth uses wrong API"
|
|
92
|
-
- OpenAI Docs: Responses API required for GPT-5.4+
|
|
93
|
-
|
|
94
|
-
### 4. Gateway Crash - "Agent listener invoked outside active run" (12:08 PM)
|
|
95
|
-
|
|
96
|
-
**Symptoms:**
|
|
97
|
-
- Gateway completely unreachable
|
|
98
|
-
- LaunchAgent exit status: -1
|
|
99
|
-
- Error: `Unhandled promise rejection: Error: Agent listener invoked outside active run`
|
|
100
|
-
|
|
101
|
-
**Stack Trace:**
|
|
102
|
-
```
|
|
103
|
-
at Agent.processEvents (pi-agent-core/src/agent.ts:533:10)
|
|
104
|
-
at emitUpdate (exec-defaults-*.js:1524:8)
|
|
105
|
-
at handleStdout (exec-defaults-*.js:1546:4)
|
|
106
|
-
```
|
|
107
|
-
|
|
108
|
-
**Context:**
|
|
109
|
-
Crash occurred during OGP federation testing operations. Preceding log entries show:
|
|
110
|
-
- Edit operations failing
|
|
111
|
-
- Read operations failing for config files
|
|
112
|
-
- Multiple edit retry attempts
|
|
113
|
-
|
|
114
|
-
**Hypothesis:**
|
|
115
|
-
The crash may be related to:
|
|
116
|
-
1. OGP operations triggering edge cases in the exec/agent framework
|
|
117
|
-
2. File operations failing and causing state inconsistencies
|
|
118
|
-
3. Agent event processing happening outside the expected execution context
|
|
119
|
-
|
|
120
|
-
**Fix Applied:**
|
|
121
|
-
Gateway restart resolved the immediate issue, but underlying cause remained unclear at the time.
|
|
122
|
-
|
|
123
|
-
---
|
|
124
|
-
|
|
125
|
-
## Configuration Stability Concerns
|
|
126
|
-
|
|
127
|
-
### Observed Pattern:
|
|
128
|
-
1. Manual configuration changes applied
|
|
129
|
-
2. Gateway restart
|
|
130
|
-
3. Configuration file modified by unknown process
|
|
131
|
-
4. Settings reverted or simplified
|
|
132
|
-
5. Agents fail again
|
|
133
|
-
|
|
134
|
-
### Suspected Culprits:
|
|
135
|
-
- `openclaw doctor` command
|
|
136
|
-
- Auto-formatting/validation on config reload
|
|
137
|
-
- Hot reload mechanism modifying config
|
|
138
|
-
|
|
139
|
-
---
|
|
140
|
-
|
|
141
|
-
## Gateway Stability Observations
|
|
142
|
-
|
|
143
|
-
### Crash Frequency:
|
|
144
|
-
- Multiple gateway restarts required during debugging session
|
|
145
|
-
- Many restarts over 4-hour period
|
|
146
|
-
- One complete crash requiring manual intervention
|
|
147
|
-
|
|
148
|
-
### Memory/CPU Usage:
|
|
149
|
-
- Gateway process consistently using high CPU during startup
|
|
150
|
-
- Process ID changing frequently
|
|
151
|
-
|
|
152
|
-
### LaunchAgent Behavior:
|
|
153
|
-
- LaunchAgent showing status `-1` during crashes
|
|
154
|
-
- Sometimes showing status `0` but process not actually running
|
|
155
|
-
- Restart command occasionally reports "stale process" and force-kills
|
|
156
|
-
|
|
157
|
-
---
|
|
158
|
-
|
|
159
|
-
## OGP-Related Observations
|
|
160
|
-
|
|
161
|
-
### Timing Correlation:
|
|
162
|
-
User mentioned doing OGP work and the timeline suggests:
|
|
163
|
-
- OpenClaw was stable before OGP work
|
|
164
|
-
- Issues began within last 24 hours
|
|
165
|
-
- Gateway crash occurred during OGP federation operations
|
|
166
|
-
|
|
167
|
-
### OGP Operations Observed in Logs:
|
|
168
|
-
- Federation requests to Clawporate gateway
|
|
169
|
-
- Agent-to-agent communication attempts
|
|
170
|
-
- File operations on OGP-related files
|
|
171
|
-
- Attempts to read OGP config (file not found)
|
|
172
|
-
|
|
173
|
-
### Potential OGP-Related Issues:
|
|
174
|
-
1. **File Operation Failures**: Multiple edit/read failures on OGP-related files
|
|
175
|
-
2. **Agent Event Processing**: Crash occurred during stdout handling from supervised process
|
|
176
|
-
3. **Missing Config Files**: OGP config expected but not found in multiple locations
|
|
177
|
-
|
|
178
|
-
---
|
|
179
|
-
|
|
180
|
-
## Mitigation Steps Applied - 5:56 PM
|
|
181
|
-
|
|
182
|
-
### Changes Made to Test Crash Prevention:
|
|
183
|
-
|
|
184
|
-
**1. Disabled Heartbeat Tasks**
|
|
185
|
-
- Removed `heartbeat` configurations from agent defaults
|
|
186
|
-
- **Hypothesis**: Hourly heartbeats triggering cron jobs that hit API failures and crashed the gateway
|
|
187
|
-
|
|
188
|
-
**2. Replaced Keychain Lookups with Direct API Keys**
|
|
189
|
-
- Changed from: `$(security find-generic-password ...)`
|
|
190
|
-
- Changed to: Direct environment variable references
|
|
191
|
-
- **Reason**: Keychain lookups repeatedly failing with env var expansion errors
|
|
192
|
-
|
|
193
|
-
**3. BrainLift Plugin**
|
|
194
|
-
- Already disabled (enabled: false)
|
|
195
|
-
- No changes needed
|
|
196
|
-
|
|
197
|
-
---
|
|
198
|
-
|
|
199
|
-
## Current Working Configuration
|
|
200
|
-
|
|
201
|
-
### Models:
|
|
202
|
-
- **Primary**: openai/gpt-5.4 (via Responses API)
|
|
203
|
-
- **Fallbacks**: anthropic/claude-sonnet-4-6, openai/gpt-4o, fireworks/kimi-k2p5
|
|
204
|
-
|
|
205
|
-
### API Keys (via environment variables):
|
|
206
|
-
- ANTHROPIC_API_KEY: Working
|
|
207
|
-
- OPENAI_API_KEY: Working (via Responses API)
|
|
208
|
-
- FIREWORKS_API_KEY: Working
|
|
209
|
-
|
|
210
|
-
### Critical Config Settings:
|
|
211
|
-
```json
|
|
212
|
-
{
|
|
213
|
-
"models": {
|
|
214
|
-
"providers": {
|
|
215
|
-
"openai": {
|
|
216
|
-
"api": "openai-responses",
|
|
217
|
-
"baseUrl": "https://api.openai.com/v1",
|
|
218
|
-
"apiKey": "${PERSONAL_OPENAI_API_KEY}"
|
|
219
|
-
}
|
|
220
|
-
}
|
|
221
|
-
}
|
|
222
|
-
}
|
|
223
|
-
```
|
|
224
|
-
|
|
225
|
-
---
|
|
226
|
-
|
|
227
|
-
## Open Questions
|
|
228
|
-
|
|
229
|
-
1. **What triggers config file modifications?** Is it automatic or user-initiated?
|
|
230
|
-
2. **Is OGP plugin causing instability?** Correlation suggests possible connection
|
|
231
|
-
3. **Why are keychain lookups failing intermittently?** Sometimes work, sometimes fail
|
|
232
|
-
4. **What is the expected behavior for "Agent listener invoked outside active run"?** Is this a known edge case?
|
|
233
|
-
|
|
234
|
-
---
|
|
235
|
-
|
|
236
|
-
## Files Modified During Session
|
|
237
|
-
|
|
238
|
-
- `$HOME/.openclaw/openclaw.json` (multiple times)
|
|
239
|
-
- API key configurations
|
|
240
|
-
- Model provider settings
|
|
241
|
-
- Agent model configurations
|
|
242
|
-
- Environment variables
|
|
243
|
-
|
|
244
|
-
---
|
|
245
|
-
|
|
246
|
-
**Session Date**: April 7, 2026
|
|
247
|
-
**OpenClaw Version**: 2026.4.5
|
|
248
|
-
**Total Crashes**: Multiple
|
|
249
|
-
**Average Uptime Between Crashes**: 10-20 minutes
|
|
250
|
-
**Sanitized for Publication**: April 8, 2026
|