the-grid-cc 1.7.13 → 1.7.14
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/02-SUMMARY.md +156 -0
- package/agents/grid-accountant.md +519 -0
- package/agents/grid-git-operator.md +661 -0
- package/agents/grid-researcher.md +421 -0
- package/agents/grid-scout.md +376 -0
- package/commands/grid/VERSION +1 -1
- package/commands/grid/branch.md +567 -0
- package/commands/grid/budget.md +438 -0
- package/commands/grid/daemon.md +637 -0
- package/commands/grid/init.md +375 -18
- package/commands/grid/mc.md +103 -1098
- package/commands/grid/resume.md +656 -0
- package/docs/BUDGET_SYSTEM.md +745 -0
- package/docs/DAEMON_ARCHITECTURE.md +780 -0
- package/docs/GIT_AUTONOMY.md +981 -0
- package/docs/MC_OPTIMIZATION.md +181 -0
- package/docs/MC_PROTOCOLS.md +950 -0
- package/docs/PERSISTENCE.md +962 -0
- package/docs/RESEARCH_FIRST.md +591 -0
- package/package.json +1 -1
|
@@ -0,0 +1,962 @@
|
|
|
1
|
+
# Grid State Persistence - Technical Design Document
|
|
2
|
+
|
|
3
|
+
**Version:** 1.0
|
|
4
|
+
**Author:** Program 5 (Persistence Orchestration)
|
|
5
|
+
**Date:** 2026-01-23
|
|
6
|
+
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
## Executive Summary
|
|
10
|
+
|
|
11
|
+
This document specifies the state persistence system that enables Grid missions to survive session death. When a session ends unexpectedly (context exhaustion, timeout, user disconnect), the Grid preserves complete state in `.grid/` files. A fresh Master Control can resume any in-progress mission from the exact stopping point using the `/grid:resume` command.
|
|
12
|
+
|
|
13
|
+
**Design Principles:**
|
|
14
|
+
- **File-only persistence** - No external databases, Redis, or services
|
|
15
|
+
- **Human-readable state** - All files are markdown/YAML for debugging
|
|
16
|
+
- **Checkpoint-based recovery** - Saga pattern with compensating actions
|
|
17
|
+
- **Warmth preservation** - Institutional knowledge survives across sessions
|
|
18
|
+
|
|
19
|
+
---
|
|
20
|
+
|
|
21
|
+
## Architecture Overview
|
|
22
|
+
|
|
23
|
+
### Saga Pattern Implementation
|
|
24
|
+
|
|
25
|
+
The Grid implements the **orchestration saga pattern** for distributed workflow management:
|
|
26
|
+
|
|
27
|
+
```
|
|
28
|
+
┌─────────────────────────────────────────────────────────────────────┐
|
|
29
|
+
│ SAGA ORCHESTRATOR │
|
|
30
|
+
│ (Master Control) │
|
|
31
|
+
├─────────────────────────────────────────────────────────────────────┤
|
|
32
|
+
│ │
|
|
33
|
+
│ Step 1 Step 2 Step 3 Step N │
|
|
34
|
+
│ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ │
|
|
35
|
+
│ │Plan │ ──▶ │Exec │ ──▶ │Verify│ ──▶ │ ... │ │
|
|
36
|
+
│ └──────┘ └──────┘ └──────┘ └──────┘ │
|
|
37
|
+
│ │ │ │ │ │
|
|
38
|
+
│ ▼ ▼ ▼ ▼ │
|
|
39
|
+
│ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ │
|
|
40
|
+
│ │State │ │State │ │State │ │State │ │
|
|
41
|
+
│ │ File │ │ File │ │ File │ │ File │ │
|
|
42
|
+
│ └──────┘ └──────┘ └──────┘ └──────┘ │
|
|
43
|
+
│ │
|
|
44
|
+
│ STATE.md (Central State) │
|
|
45
|
+
└─────────────────────────────────────────────────────────────────────┘
|
|
46
|
+
```
|
|
47
|
+
|
|
48
|
+
**Recovery Modes:**
|
|
49
|
+
1. **Forward Recovery (Continuation)** - Resume from last checkpoint
|
|
50
|
+
2. **Backward Recovery (Compensation)** - Rollback partial work if needed
|
|
51
|
+
|
|
52
|
+
### Checkpoint Hierarchy
|
|
53
|
+
|
|
54
|
+
State is captured at multiple granularities:
|
|
55
|
+
|
|
56
|
+
| Level | Frequency | File | Purpose |
|
|
57
|
+
|-------|-----------|------|---------|
|
|
58
|
+
| **Session** | On start/end | `STATE.md` | Top-level position tracking |
|
|
59
|
+
| **Wave** | After each wave | `WAVE_STATUS.md` | Wave completion status |
|
|
60
|
+
| **Block** | After each block | `*-SUMMARY.md` | Detailed work records |
|
|
61
|
+
| **Thread** | On checkpoint | `CHECKPOINT.md` | Interrupted thread state |
|
|
62
|
+
| **Discovery** | During work | `SCRATCHPAD.md` | Live findings |
|
|
63
|
+
|
|
64
|
+
---
|
|
65
|
+
|
|
66
|
+
## State That Must Survive
|
|
67
|
+
|
|
68
|
+
### 1. Execution Position
|
|
69
|
+
|
|
70
|
+
```yaml
|
|
71
|
+
# .grid/STATE.md
|
|
72
|
+
---
|
|
73
|
+
cluster: "james-weatherhead-blog"
|
|
74
|
+
session_id: "2026-01-23T14:30:00Z-blog"
|
|
75
|
+
status: interrupted # active | interrupted | completed | failed
|
|
76
|
+
|
|
77
|
+
position:
|
|
78
|
+
phase: 2
|
|
79
|
+
phase_total: 4
|
|
80
|
+
phase_name: "Authentication"
|
|
81
|
+
block: 1
|
|
82
|
+
block_total: 3
|
|
83
|
+
wave: 2
|
|
84
|
+
wave_total: 3
|
|
85
|
+
thread: 3
|
|
86
|
+
thread_total: 5
|
|
87
|
+
thread_name: "Implement JWT refresh"
|
|
88
|
+
|
|
89
|
+
progress_percent: 45
|
|
90
|
+
energy_remaining: 7500
|
|
91
|
+
---
|
|
92
|
+
```
|
|
93
|
+
|
|
94
|
+
### 2. Completed Work
|
|
95
|
+
|
|
96
|
+
All completed work persists in SUMMARY files with commit hashes:
|
|
97
|
+
|
|
98
|
+
```yaml
|
|
99
|
+
# .grid/phases/01-foundation/01-SUMMARY.md
|
|
100
|
+
---
|
|
101
|
+
cluster: james-weatherhead-blog
|
|
102
|
+
block: 01
|
|
103
|
+
status: complete
|
|
104
|
+
completed_at: "2026-01-23T14:00:00Z"
|
|
105
|
+
|
|
106
|
+
commits:
|
|
107
|
+
- hash: "abc123"
|
|
108
|
+
message: "feat(01): Initialize Astro project"
|
|
109
|
+
files: ["package.json", "astro.config.mjs"]
|
|
110
|
+
- hash: "def456"
|
|
111
|
+
message: "feat(01): Configure Tailwind dark mode"
|
|
112
|
+
files: ["tailwind.config.mjs"]
|
|
113
|
+
|
|
114
|
+
artifacts_created:
|
|
115
|
+
- path: "package.json"
|
|
116
|
+
lines: 45
|
|
117
|
+
verified: true
|
|
118
|
+
- path: "astro.config.mjs"
|
|
119
|
+
lines: 32
|
|
120
|
+
verified: true
|
|
121
|
+
---
|
|
122
|
+
```
|
|
123
|
+
|
|
124
|
+
### 3. Pending Work
|
|
125
|
+
|
|
126
|
+
Plans not yet executed are preserved in their original form:
|
|
127
|
+
|
|
128
|
+
```
|
|
129
|
+
.grid/plans/
|
|
130
|
+
├── blog-block-01.md # Contains full PLAN spec
|
|
131
|
+
├── blog-block-02.md
|
|
132
|
+
├── blog-block-03.md
|
|
133
|
+
├── blog-block-04.md
|
|
134
|
+
├── blog-block-05.md
|
|
135
|
+
├── blog-block-06.md
|
|
136
|
+
└── blog-PLAN-SUMMARY.md # Master plan with wave structure
|
|
137
|
+
```
|
|
138
|
+
|
|
139
|
+
### 4. Interrupted Thread State
|
|
140
|
+
|
|
141
|
+
When a checkpoint or interruption occurs mid-thread:
|
|
142
|
+
|
|
143
|
+
```yaml
|
|
144
|
+
# .grid/CHECKPOINT.md
|
|
145
|
+
---
|
|
146
|
+
type: session_death # checkpoint | session_death | failure
|
|
147
|
+
timestamp: "2026-01-23T15:45:00Z"
|
|
148
|
+
block: "02"
|
|
149
|
+
thread: 3
|
|
150
|
+
thread_name: "Implement dark mode toggle"
|
|
151
|
+
|
|
152
|
+
completed_threads:
|
|
153
|
+
- id: "02.1"
|
|
154
|
+
name: "Create BaseLayout"
|
|
155
|
+
commit: "ghi789"
|
|
156
|
+
status: complete
|
|
157
|
+
- id: "02.2"
|
|
158
|
+
name: "Build Header component"
|
|
159
|
+
commit: "jkl012"
|
|
160
|
+
status: complete
|
|
161
|
+
|
|
162
|
+
current_thread:
|
|
163
|
+
id: "02.3"
|
|
164
|
+
name: "Implement dark mode toggle"
|
|
165
|
+
status: in_progress
|
|
166
|
+
partial_work:
|
|
167
|
+
files_created: ["src/components/DarkModeToggle.astro"]
|
|
168
|
+
files_modified: ["src/layouts/BaseLayout.astro"]
|
|
169
|
+
staged_changes: true
|
|
170
|
+
last_action: "Writing localStorage persistence logic"
|
|
171
|
+
|
|
172
|
+
pending_threads:
|
|
173
|
+
- id: "02.4"
|
|
174
|
+
name: "CHECKPOINT: Human verification"
|
|
175
|
+
status: pending
|
|
176
|
+
---
|
|
177
|
+
```
|
|
178
|
+
|
|
179
|
+
### 5. Warmth (Institutional Knowledge)
|
|
180
|
+
|
|
181
|
+
Accumulated learnings from all Programs:
|
|
182
|
+
|
|
183
|
+
```yaml
|
|
184
|
+
# .grid/WARMTH.md
|
|
185
|
+
---
|
|
186
|
+
cluster: james-weatherhead-blog
|
|
187
|
+
accumulated_from:
|
|
188
|
+
- executor-01 (block 01)
|
|
189
|
+
- executor-02 (block 02)
|
|
190
|
+
- executor-03 (block 02 parallel)
|
|
191
|
+
---
|
|
192
|
+
|
|
193
|
+
## Codebase Patterns
|
|
194
|
+
- "This project uses Astro content collections for blog posts"
|
|
195
|
+
- "Dark mode uses class strategy with localStorage"
|
|
196
|
+
- "Tailwind typography plugin required for prose styling"
|
|
197
|
+
|
|
198
|
+
## Gotchas
|
|
199
|
+
- "Astro config must use .mjs extension for ESM"
|
|
200
|
+
- "Shiki syntax highlighting is build-time only"
|
|
201
|
+
- "Content collection schema in src/content/config.ts"
|
|
202
|
+
|
|
203
|
+
## User Preferences
|
|
204
|
+
- "User prefers minimal dependencies"
|
|
205
|
+
- "User wants dark mode as default"
|
|
206
|
+
- "No unnecessary abstractions"
|
|
207
|
+
|
|
208
|
+
## Decisions Made
|
|
209
|
+
- "Chose serverless adapter over static for future flexibility"
|
|
210
|
+
- "Using Shiki over Prism for better VS Code parity"
|
|
211
|
+
|
|
212
|
+
## Almost Did (Rejected)
|
|
213
|
+
- "Considered MDX but stuck with plain MD for simplicity"
|
|
214
|
+
- "Considered Zustand for state but localStorage sufficient"
|
|
215
|
+
```
|
|
216
|
+
|
|
217
|
+
### 6. User Decisions
|
|
218
|
+
|
|
219
|
+
Critical decisions made by the user during I/O Tower interactions:
|
|
220
|
+
|
|
221
|
+
```yaml
|
|
222
|
+
# .grid/DECISIONS.md
|
|
223
|
+
---
|
|
224
|
+
cluster: james-weatherhead-blog
|
|
225
|
+
---
|
|
226
|
+
|
|
227
|
+
## Decision 1: 2026-01-23T14:15:00Z
|
|
228
|
+
**Question:** Deploy to Vercel or Netlify?
|
|
229
|
+
**Options Presented:**
|
|
230
|
+
- vercel: "Native Astro support, edge functions"
|
|
231
|
+
- netlify: "Simpler config, build plugins"
|
|
232
|
+
**User Choice:** vercel
|
|
233
|
+
**Rationale:** "Already have Vercel account"
|
|
234
|
+
**Affects:** Block 01 (adapter choice), Block 06 (deploy config)
|
|
235
|
+
|
|
236
|
+
## Decision 2: 2026-01-23T14:30:00Z
|
|
237
|
+
**Question:** Dark mode default state?
|
|
238
|
+
**Options Presented:**
|
|
239
|
+
- light: "Traditional default"
|
|
240
|
+
- dark: "Modern preference"
|
|
241
|
+
- system: "Respect OS setting"
|
|
242
|
+
**User Choice:** dark
|
|
243
|
+
**Rationale:** "Developer blog, dark is expected"
|
|
244
|
+
**Affects:** Block 02 (toggle default)
|
|
245
|
+
```
|
|
246
|
+
|
|
247
|
+
### 7. Blockers Encountered
|
|
248
|
+
|
|
249
|
+
Issues that blocked progress:
|
|
250
|
+
|
|
251
|
+
```yaml
|
|
252
|
+
# .grid/BLOCKERS.md
|
|
253
|
+
---
|
|
254
|
+
cluster: james-weatherhead-blog
|
|
255
|
+
---
|
|
256
|
+
|
|
257
|
+
## Blocker 1: RESOLVED
|
|
258
|
+
**Block:** 02
|
|
259
|
+
**Thread:** 02.3
|
|
260
|
+
**Type:** dependency_missing
|
|
261
|
+
**Description:** @tailwindcss/typography not installed
|
|
262
|
+
**Resolution:** Added dependency in thread 02.1
|
|
263
|
+
**Resolved At:** 2026-01-23T14:20:00Z
|
|
264
|
+
|
|
265
|
+
## Blocker 2: ACTIVE
|
|
266
|
+
**Block:** 06
|
|
267
|
+
**Thread:** 06.3
|
|
268
|
+
**Type:** human_action_required
|
|
269
|
+
**Description:** Vercel CLI authentication needed
|
|
270
|
+
**Waiting For:** User to run `vercel login`
|
|
271
|
+
**Created At:** 2026-01-23T16:00:00Z
|
|
272
|
+
```
|
|
273
|
+
|
|
274
|
+
---
|
|
275
|
+
|
|
276
|
+
## Directory Structure
|
|
277
|
+
|
|
278
|
+
### Complete .grid/ Layout
|
|
279
|
+
|
|
280
|
+
```
|
|
281
|
+
.grid/
|
|
282
|
+
├── STATE.md # Central state file (ALWAYS read first)
|
|
283
|
+
├── CHECKPOINT.md # Current checkpoint if interrupted
|
|
284
|
+
├── WARMTH.md # Accumulated institutional knowledge
|
|
285
|
+
├── DECISIONS.md # User decisions log
|
|
286
|
+
├── BLOCKERS.md # Active and resolved blockers
|
|
287
|
+
├── SCRATCHPAD.md # Live discoveries during execution
|
|
288
|
+
├── SCRATCHPAD_ARCHIVE.md # Archived scratchpad entries
|
|
289
|
+
├── LEARNINGS.md # Cross-project patterns
|
|
290
|
+
├── config.json # Grid configuration (model tier, etc.)
|
|
291
|
+
│
|
|
292
|
+
├── plans/ # Execution plans
|
|
293
|
+
│ ├── {cluster}-PLAN-SUMMARY.md
|
|
294
|
+
│ ├── {cluster}-block-01.md
|
|
295
|
+
│ ├── {cluster}-block-02.md
|
|
296
|
+
│ └── ...
|
|
297
|
+
│
|
|
298
|
+
├── phases/ # Execution artifacts
|
|
299
|
+
│ ├── 01-foundation/
|
|
300
|
+
│ │ ├── 01-PLAN.md # Plan (copy from plans/)
|
|
301
|
+
│ │ └── 01-SUMMARY.md # Completion record
|
|
302
|
+
│ ├── 02-auth/
|
|
303
|
+
│ │ ├── 02-PLAN.md
|
|
304
|
+
│ │ └── 02-SUMMARY.md
|
|
305
|
+
│ └── ...
|
|
306
|
+
│
|
|
307
|
+
├── debug/ # Debug session state
|
|
308
|
+
│ └── {timestamp}-{slug}/
|
|
309
|
+
│ └── session.md
|
|
310
|
+
│
|
|
311
|
+
└── refinement/ # Refinement swarm outputs
|
|
312
|
+
├── screenshots/
|
|
313
|
+
├── e2e/
|
|
314
|
+
└── personas/
|
|
315
|
+
```
|
|
316
|
+
|
|
317
|
+
### File Ownership and Locking
|
|
318
|
+
|
|
319
|
+
| File | Written By | Read By | Update Frequency |
|
|
320
|
+
|------|------------|---------|------------------|
|
|
321
|
+
| `STATE.md` | MC | MC, Resume | Every wave |
|
|
322
|
+
| `CHECKPOINT.md` | Executor | MC, Resume | On interrupt |
|
|
323
|
+
| `WARMTH.md` | MC | Executors | After each block |
|
|
324
|
+
| `*-SUMMARY.md` | Executor | MC, Recognizer | On block complete |
|
|
325
|
+
| `SCRATCHPAD.md` | Executors | MC, Executors | During execution |
|
|
326
|
+
| `DECISIONS.md` | MC | All | On user decision |
|
|
327
|
+
|
|
328
|
+
---
|
|
329
|
+
|
|
330
|
+
## Resume Protocol
|
|
331
|
+
|
|
332
|
+
### Pre-Resume State Validation
|
|
333
|
+
|
|
334
|
+
Before resuming, validate state consistency:
|
|
335
|
+
|
|
336
|
+
```python
|
|
337
|
+
def validate_state():
|
|
338
|
+
"""Validate .grid/ state is consistent and resumable."""
|
|
339
|
+
|
|
340
|
+
checks = [
|
|
341
|
+
# 1. STATE.md exists and is parseable
|
|
342
|
+
("STATE.md exists", file_exists(".grid/STATE.md")),
|
|
343
|
+
|
|
344
|
+
# 2. Claimed commits actually exist
|
|
345
|
+
("Commits exist", verify_commits(get_claimed_commits())),
|
|
346
|
+
|
|
347
|
+
# 3. Claimed files exist
|
|
348
|
+
("Files exist", verify_files(get_claimed_artifacts())),
|
|
349
|
+
|
|
350
|
+
# 4. No conflicting checkpoints
|
|
351
|
+
("Single checkpoint", count_checkpoints() <= 1),
|
|
352
|
+
|
|
353
|
+
# 5. Plan files exist for pending blocks
|
|
354
|
+
("Plans available", verify_pending_plans()),
|
|
355
|
+
]
|
|
356
|
+
|
|
357
|
+
return all(check[1] for check in checks)
|
|
358
|
+
```
|
|
359
|
+
|
|
360
|
+
### Resume Flow
|
|
361
|
+
|
|
362
|
+
```
|
|
363
|
+
┌─────────────────────────────────────────────────────────────────────┐
|
|
364
|
+
│ /grid:resume │
|
|
365
|
+
├─────────────────────────────────────────────────────────────────────┤
|
|
366
|
+
│ │
|
|
367
|
+
│ 1. READ STATE │
|
|
368
|
+
│ ├─ Load .grid/STATE.md │
|
|
369
|
+
│ ├─ Parse position (phase/block/wave/thread) │
|
|
370
|
+
│ └─ Determine status (interrupted/checkpoint/failed) │
|
|
371
|
+
│ │
|
|
372
|
+
│ 2. VALIDATE STATE │
|
|
373
|
+
│ ├─ Verify commits exist in git │
|
|
374
|
+
│ ├─ Verify claimed files exist │
|
|
375
|
+
│ ├─ Check for state corruption │
|
|
376
|
+
│ └─ If invalid: ENTER RECOVERY MODE │
|
|
377
|
+
│ │
|
|
378
|
+
│ 3. RECONSTRUCT CONTEXT │
|
|
379
|
+
│ ├─ Load WARMTH.md (institutional knowledge) │
|
|
380
|
+
│ ├─ Load DECISIONS.md (user decisions) │
|
|
381
|
+
│ ├─ Load CHECKPOINT.md if exists (interrupted thread) │
|
|
382
|
+
│ ├─ Collect all SUMMARY.md files (completed work) │
|
|
383
|
+
│ └─ Build execution context object │
|
|
384
|
+
│ │
|
|
385
|
+
│ 4. DETERMINE RESUME POINT │
|
|
386
|
+
│ ├─ If CHECKPOINT: Resume from interrupted thread │
|
|
387
|
+
│ ├─ If wave complete: Start next wave │
|
|
388
|
+
│ ├─ If block complete: Start next block in wave │
|
|
389
|
+
│ └─ If session death: Resume from last checkpoint │
|
|
390
|
+
│ │
|
|
391
|
+
│ 5. SPAWN CONTINUATION │
|
|
392
|
+
│ ├─ Pass warmth to fresh executor │
|
|
393
|
+
│ ├─ Pass completed_threads table │
|
|
394
|
+
│ ├─ Pass remaining plan │
|
|
395
|
+
│ └─ Execute from resume point │
|
|
396
|
+
│ │
|
|
397
|
+
└─────────────────────────────────────────────────────────────────────┘
|
|
398
|
+
```
|
|
399
|
+
|
|
400
|
+
### Context Reconstruction
|
|
401
|
+
|
|
402
|
+
When resuming, MC must rebuild complete context from files:
|
|
403
|
+
|
|
404
|
+
```python
|
|
405
|
+
def reconstruct_context():
|
|
406
|
+
"""Rebuild execution context from .grid/ files."""
|
|
407
|
+
|
|
408
|
+
context = {
|
|
409
|
+
"cluster": None,
|
|
410
|
+
"position": None,
|
|
411
|
+
"completed_blocks": [],
|
|
412
|
+
"completed_threads": [],
|
|
413
|
+
"pending_plans": [],
|
|
414
|
+
"warmth": None,
|
|
415
|
+
"decisions": [],
|
|
416
|
+
"blockers": [],
|
|
417
|
+
"checkpoint": None,
|
|
418
|
+
}
|
|
419
|
+
|
|
420
|
+
# 1. Load central state
|
|
421
|
+
state = parse_yaml(read(".grid/STATE.md"))
|
|
422
|
+
context["cluster"] = state["cluster"]
|
|
423
|
+
context["position"] = state["position"]
|
|
424
|
+
|
|
425
|
+
# 2. Collect completed work
|
|
426
|
+
for summary_path in glob(".grid/phases/*/SUMMARY.md"):
|
|
427
|
+
summary = parse_yaml(read(summary_path))
|
|
428
|
+
context["completed_blocks"].append(summary)
|
|
429
|
+
|
|
430
|
+
# Extract completed threads from summary
|
|
431
|
+
for commit in summary.get("commits", []):
|
|
432
|
+
context["completed_threads"].append({
|
|
433
|
+
"block": summary["block"],
|
|
434
|
+
"commit": commit["hash"],
|
|
435
|
+
"files": commit["files"],
|
|
436
|
+
})
|
|
437
|
+
|
|
438
|
+
# 3. Load warmth
|
|
439
|
+
if file_exists(".grid/WARMTH.md"):
|
|
440
|
+
context["warmth"] = read(".grid/WARMTH.md")
|
|
441
|
+
|
|
442
|
+
# 4. Load decisions
|
|
443
|
+
if file_exists(".grid/DECISIONS.md"):
|
|
444
|
+
context["decisions"] = parse_decisions(".grid/DECISIONS.md")
|
|
445
|
+
|
|
446
|
+
# 5. Load checkpoint if exists
|
|
447
|
+
if file_exists(".grid/CHECKPOINT.md"):
|
|
448
|
+
context["checkpoint"] = parse_yaml(read(".grid/CHECKPOINT.md"))
|
|
449
|
+
|
|
450
|
+
# 6. Identify pending plans
|
|
451
|
+
for plan_path in glob(".grid/plans/*-block-*.md"):
|
|
452
|
+
block_num = extract_block_number(plan_path)
|
|
453
|
+
if block_num not in [b["block"] for b in context["completed_blocks"]]:
|
|
454
|
+
context["pending_plans"].append({
|
|
455
|
+
"path": plan_path,
|
|
456
|
+
"block": block_num,
|
|
457
|
+
"content": read(plan_path),
|
|
458
|
+
})
|
|
459
|
+
|
|
460
|
+
return context
|
|
461
|
+
```
|
|
462
|
+
|
|
463
|
+
---
|
|
464
|
+
|
|
465
|
+
## Checkpoint Persistence
|
|
466
|
+
|
|
467
|
+
### Session Death Detection
|
|
468
|
+
|
|
469
|
+
Grid cannot always write a checkpoint before death. Detection strategies:
|
|
470
|
+
|
|
471
|
+
1. **Graceful shutdown** - MC writes checkpoint before context exhaustion
|
|
472
|
+
2. **Heartbeat timeout** - Scratchpad not updated in 10 minutes
|
|
473
|
+
3. **Partial state** - CHECKPOINT.md exists but no SUMMARY.md
|
|
474
|
+
|
|
475
|
+
### Checkpoint Types
|
|
476
|
+
|
|
477
|
+
| Type | Trigger | State Captured | Resume Strategy |
|
|
478
|
+
|------|---------|----------------|-----------------|
|
|
479
|
+
| `human_verify` | Thread requires verification | Full thread state | Wait for user approval, continue |
|
|
480
|
+
| `human_action` | External action needed | Blocker details | Wait for user to act, continue |
|
|
481
|
+
| `decision` | Architectural fork | Options, context | Present options, continue on choice |
|
|
482
|
+
| `session_death` | Context exhaustion | Last known state | Validate and continue |
|
|
483
|
+
| `failure` | Unrecoverable error | Error details | Present to user, may require rollback |
|
|
484
|
+
|
|
485
|
+
### Checkpoint File Format
|
|
486
|
+
|
|
487
|
+
```yaml
|
|
488
|
+
# .grid/CHECKPOINT.md
|
|
489
|
+
---
|
|
490
|
+
# Checkpoint metadata
|
|
491
|
+
type: human_verify
|
|
492
|
+
timestamp: "2026-01-23T15:45:00Z"
|
|
493
|
+
expires: null # or ISO timestamp if time-sensitive
|
|
494
|
+
|
|
495
|
+
# Position
|
|
496
|
+
cluster: james-weatherhead-blog
|
|
497
|
+
block: "02"
|
|
498
|
+
wave: 2
|
|
499
|
+
thread: 4
|
|
500
|
+
thread_name: "CHECKPOINT: Human verification of layout"
|
|
501
|
+
|
|
502
|
+
# Progress
|
|
503
|
+
progress:
|
|
504
|
+
blocks_complete: 1
|
|
505
|
+
blocks_total: 6
|
|
506
|
+
threads_complete: 3
|
|
507
|
+
threads_total: 4
|
|
508
|
+
|
|
509
|
+
# Completed work (essential for continuation)
|
|
510
|
+
completed_threads:
|
|
511
|
+
- id: "02.1"
|
|
512
|
+
name: "Create BaseLayout"
|
|
513
|
+
commit: "abc123"
|
|
514
|
+
files: ["src/layouts/BaseLayout.astro"]
|
|
515
|
+
status: complete
|
|
516
|
+
- id: "02.2"
|
|
517
|
+
name: "Build Header component"
|
|
518
|
+
commit: "def456"
|
|
519
|
+
files: ["src/components/Header.astro"]
|
|
520
|
+
status: complete
|
|
521
|
+
- id: "02.3"
|
|
522
|
+
name: "Implement dark mode toggle"
|
|
523
|
+
commit: "ghi789"
|
|
524
|
+
files: ["src/components/DarkModeToggle.astro"]
|
|
525
|
+
status: complete
|
|
526
|
+
|
|
527
|
+
# Current checkpoint details
|
|
528
|
+
checkpoint_details:
|
|
529
|
+
verification_instructions: |
|
|
530
|
+
1. Run: npm run dev
|
|
531
|
+
2. Visit: http://localhost:4321
|
|
532
|
+
3. Click dark mode toggle
|
|
533
|
+
4. Verify theme persists on refresh
|
|
534
|
+
expected_behavior: |
|
|
535
|
+
- Toggle switches between light/dark
|
|
536
|
+
- Theme persists after page reload
|
|
537
|
+
- No flash of wrong theme (FOUC)
|
|
538
|
+
|
|
539
|
+
# User action needed
|
|
540
|
+
awaiting: "User to test dark mode and respond 'approved' or describe issues"
|
|
541
|
+
|
|
542
|
+
# Warmth for continuation
|
|
543
|
+
warmth:
|
|
544
|
+
codebase_patterns:
|
|
545
|
+
- "Astro components use .astro extension"
|
|
546
|
+
- "Dark mode uses class strategy"
|
|
547
|
+
gotchas:
|
|
548
|
+
- "Must inline theme script in <head> to prevent FOUC"
|
|
549
|
+
user_preferences:
|
|
550
|
+
- "User prefers dark as default"
|
|
551
|
+
---
|
|
552
|
+
```
|
|
553
|
+
|
|
554
|
+
---
|
|
555
|
+
|
|
556
|
+
## State Update Protocol
|
|
557
|
+
|
|
558
|
+
### When to Update STATE.md
|
|
559
|
+
|
|
560
|
+
| Event | Update Action |
|
|
561
|
+
|-------|---------------|
|
|
562
|
+
| Wave starts | Set status: active, update wave position |
|
|
563
|
+
| Wave completes | Increment wave, update progress |
|
|
564
|
+
| Block completes | Increment block, log completion |
|
|
565
|
+
| Checkpoint hit | Set status: checkpoint, write CHECKPOINT.md |
|
|
566
|
+
| Session ending | Set status: interrupted, write checkpoint |
|
|
567
|
+
| Mission complete | Set status: completed, archive warmth |
|
|
568
|
+
| Failure | Set status: failed, write failure details |
|
|
569
|
+
|
|
570
|
+
### Atomic State Updates
|
|
571
|
+
|
|
572
|
+
State updates must be atomic to prevent corruption:
|
|
573
|
+
|
|
574
|
+
```python
|
|
575
|
+
def update_state(updates: dict):
|
|
576
|
+
"""Atomically update STATE.md."""
|
|
577
|
+
|
|
578
|
+
# 1. Read current state
|
|
579
|
+
current = parse_yaml(read(".grid/STATE.md"))
|
|
580
|
+
|
|
581
|
+
# 2. Apply updates
|
|
582
|
+
merged = deep_merge(current, updates)
|
|
583
|
+
merged["updated_at"] = datetime.now().isoformat()
|
|
584
|
+
|
|
585
|
+
# 3. Write to temp file first
|
|
586
|
+
temp_path = ".grid/STATE.md.tmp"
|
|
587
|
+
write(temp_path, to_yaml(merged))
|
|
588
|
+
|
|
589
|
+
# 4. Atomic rename (POSIX atomic)
|
|
590
|
+
rename(temp_path, ".grid/STATE.md")
|
|
591
|
+
```
|
|
592
|
+
|
|
593
|
+
---
|
|
594
|
+
|
|
595
|
+
## Warmth Aggregation
|
|
596
|
+
|
|
597
|
+
### Warmth Collection
|
|
598
|
+
|
|
599
|
+
After each block completes, aggregate warmth:
|
|
600
|
+
|
|
601
|
+
```python
|
|
602
|
+
def aggregate_warmth(block_summary_path: str):
|
|
603
|
+
"""Aggregate warmth from block summary into WARMTH.md."""
|
|
604
|
+
|
|
605
|
+
# 1. Parse block summary
|
|
606
|
+
summary = parse_yaml(read(block_summary_path))
|
|
607
|
+
block_warmth = summary.get("lessons_learned", {})
|
|
608
|
+
|
|
609
|
+
# 2. Load existing warmth
|
|
610
|
+
warmth_path = ".grid/WARMTH.md"
|
|
611
|
+
if file_exists(warmth_path):
|
|
612
|
+
existing = parse_yaml(read(warmth_path))
|
|
613
|
+
else:
|
|
614
|
+
existing = {
|
|
615
|
+
"codebase_patterns": [],
|
|
616
|
+
"gotchas": [],
|
|
617
|
+
"user_preferences": [],
|
|
618
|
+
"decisions_made": [],
|
|
619
|
+
"almost_did": [],
|
|
620
|
+
}
|
|
621
|
+
|
|
622
|
+
# 3. Merge (deduplicate)
|
|
623
|
+
for category in ["codebase_patterns", "gotchas", "user_preferences", "almost_did"]:
|
|
624
|
+
new_items = block_warmth.get(category, [])
|
|
625
|
+
for item in new_items:
|
|
626
|
+
if item not in existing[category]:
|
|
627
|
+
existing[category].append(item)
|
|
628
|
+
|
|
629
|
+
# 4. Write aggregated warmth
|
|
630
|
+
write(warmth_path, to_yaml(existing))
|
|
631
|
+
```
|
|
632
|
+
|
|
633
|
+
### Warmth Injection
|
|
634
|
+
|
|
635
|
+
When spawning continuation, inject aggregated warmth:
|
|
636
|
+
|
|
637
|
+
```python
|
|
638
|
+
def build_continuation_prompt(context):
|
|
639
|
+
"""Build prompt with warmth for continuation executor."""
|
|
640
|
+
|
|
641
|
+
return f"""
|
|
642
|
+
First, read ~/.claude/agents/grid-executor.md for your role.
|
|
643
|
+
|
|
644
|
+
<warmth>
|
|
645
|
+
Accumulated knowledge from prior Programs:
|
|
646
|
+
|
|
647
|
+
{context["warmth"]}
|
|
648
|
+
|
|
649
|
+
Apply this warmth. Don't repeat mistakes. Build on discoveries.
|
|
650
|
+
</warmth>
|
|
651
|
+
|
|
652
|
+
<completed_threads>
|
|
653
|
+
{format_completed_threads(context["completed_threads"])}
|
|
654
|
+
</completed_threads>
|
|
655
|
+
|
|
656
|
+
<resume_point>
|
|
657
|
+
Continue from: Thread {context["checkpoint"]["thread"]}
|
|
658
|
+
User feedback: {context.get("user_feedback", "None")}
|
|
659
|
+
</resume_point>
|
|
660
|
+
|
|
661
|
+
<plan>
|
|
662
|
+
{context["pending_plan"]["content"]}
|
|
663
|
+
</plan>
|
|
664
|
+
|
|
665
|
+
Execute the remaining threads. Verify previous commits exist before starting.
|
|
666
|
+
"""
|
|
667
|
+
```
|
|
668
|
+
|
|
669
|
+
---
|
|
670
|
+
|
|
671
|
+
## Recovery Scenarios
|
|
672
|
+
|
|
673
|
+
### Scenario 1: Clean Checkpoint Resume
|
|
674
|
+
|
|
675
|
+
**Situation:** User approved checkpoint, MC context expired before continuation.
|
|
676
|
+
|
|
677
|
+
**State Files Present:**
|
|
678
|
+
- STATE.md (status: checkpoint)
|
|
679
|
+
- CHECKPOINT.md (type: human_verify, user_response: approved)
|
|
680
|
+
- WARMTH.md (accumulated)
|
|
681
|
+
- 01-SUMMARY.md (complete)
|
|
682
|
+
|
|
683
|
+
**Resume Action:**
|
|
684
|
+
1. Load state, verify checkpoint approved
|
|
685
|
+
2. Load warmth
|
|
686
|
+
3. Spawn executor with resume_point = next thread after checkpoint
|
|
687
|
+
4. Continue execution
|
|
688
|
+
|
|
689
|
+
### Scenario 2: Session Death Recovery
|
|
690
|
+
|
|
691
|
+
**Situation:** MC context exhausted mid-execution, no checkpoint written.
|
|
692
|
+
|
|
693
|
+
**State Files Present:**
|
|
694
|
+
- STATE.md (status: active, stale timestamp)
|
|
695
|
+
- SCRATCHPAD.md (recent entries)
|
|
696
|
+
- 01-SUMMARY.md (complete)
|
|
697
|
+
- No CHECKPOINT.md
|
|
698
|
+
|
|
699
|
+
**Resume Action:**
|
|
700
|
+
1. Detect stale state (no updates in 10+ minutes)
|
|
701
|
+
2. Read last scratchpad entry for position
|
|
702
|
+
3. Check git log for recent commits
|
|
703
|
+
4. Reconstruct position from commits + scratchpad
|
|
704
|
+
5. Create synthetic checkpoint
|
|
705
|
+
6. Resume from last known good state
|
|
706
|
+
|
|
707
|
+
### Scenario 3: Failure Recovery
|
|
708
|
+
|
|
709
|
+
**Situation:** Executor failed, partial work exists.
|
|
710
|
+
|
|
711
|
+
**State Files Present:**
|
|
712
|
+
- STATE.md (status: active)
|
|
713
|
+
- CHECKPOINT.md (type: failure, partial_work present)
|
|
714
|
+
- Uncommitted changes in working directory
|
|
715
|
+
|
|
716
|
+
**Resume Action:**
|
|
717
|
+
1. Present failure report to user
|
|
718
|
+
2. Options:
|
|
719
|
+
- Rollback: `git reset --hard` to last commit
|
|
720
|
+
- Retry: Spawn executor with failure context
|
|
721
|
+
- Manual: User fixes, then continue
|
|
722
|
+
3. Update state based on choice
|
|
723
|
+
4. Continue or restart
|
|
724
|
+
|
|
725
|
+
### Scenario 4: Corrupted State
|
|
726
|
+
|
|
727
|
+
**Situation:** STATE.md is invalid or missing.
|
|
728
|
+
|
|
729
|
+
**Recovery Action:**
|
|
730
|
+
1. Scan for SUMMARY.md files to find completed work
|
|
731
|
+
2. Scan git log for Grid commits (commit message patterns)
|
|
732
|
+
3. Rebuild STATE.md from discovered state
|
|
733
|
+
4. Present reconstruction to user for approval
|
|
734
|
+
5. Resume or restart based on confidence
|
|
735
|
+
|
|
736
|
+
```python
|
|
737
|
+
def recover_corrupted_state():
|
|
738
|
+
"""Attempt to rebuild state from artifacts."""
|
|
739
|
+
|
|
740
|
+
recovered = {
|
|
741
|
+
"cluster": "unknown",
|
|
742
|
+
"position": {},
|
|
743
|
+
"completed_blocks": [],
|
|
744
|
+
"recovered": True,
|
|
745
|
+
"confidence": "medium",
|
|
746
|
+
}
|
|
747
|
+
|
|
748
|
+
# 1. Find summaries
|
|
749
|
+
summaries = glob(".grid/phases/*/SUMMARY.md")
|
|
750
|
+
for s in summaries:
|
|
751
|
+
try:
|
|
752
|
+
data = parse_yaml(read(s))
|
|
753
|
+
recovered["completed_blocks"].append(data["block"])
|
|
754
|
+
recovered["cluster"] = data.get("cluster", recovered["cluster"])
|
|
755
|
+
except:
|
|
756
|
+
pass
|
|
757
|
+
|
|
758
|
+
# 2. Find commits
|
|
759
|
+
commits = bash("git log --oneline --grep='feat(' --grep='fix(' -20")
|
|
760
|
+
# Parse commits for block references
|
|
761
|
+
|
|
762
|
+
# 3. Calculate position
|
|
763
|
+
if recovered["completed_blocks"]:
|
|
764
|
+
last_block = max(recovered["completed_blocks"])
|
|
765
|
+
recovered["position"]["block"] = last_block + 1
|
|
766
|
+
|
|
767
|
+
return recovered
|
|
768
|
+
```
|
|
769
|
+
|
|
770
|
+
---
|
|
771
|
+
|
|
772
|
+
## State File Schemas
|
|
773
|
+
|
|
774
|
+
### STATE.md Schema
|
|
775
|
+
|
|
776
|
+
```yaml
|
|
777
|
+
# Required fields
|
|
778
|
+
cluster: string # Cluster name
|
|
779
|
+
session_id: string # Unique session identifier
|
|
780
|
+
status: enum # active | checkpoint | interrupted | completed | failed
|
|
781
|
+
created_at: ISO8601
|
|
782
|
+
updated_at: ISO8601
|
|
783
|
+
|
|
784
|
+
# Position tracking
|
|
785
|
+
position:
|
|
786
|
+
phase: integer # Current phase number
|
|
787
|
+
phase_total: integer # Total phases
|
|
788
|
+
phase_name: string # Phase name
|
|
789
|
+
block: integer # Current block number
|
|
790
|
+
block_total: integer # Total blocks
|
|
791
|
+
wave: integer # Current wave number
|
|
792
|
+
wave_total: integer # Total waves
|
|
793
|
+
thread: integer # Current thread number (optional)
|
|
794
|
+
thread_total: integer # Total threads in block (optional)
|
|
795
|
+
|
|
796
|
+
# Progress
|
|
797
|
+
progress_percent: integer # 0-100
|
|
798
|
+
energy_remaining: integer # Energy budget
|
|
799
|
+
|
|
800
|
+
# Execution mode
|
|
801
|
+
mode: enum # autopilot | guided | hands_on
|
|
802
|
+
|
|
803
|
+
# Optional
|
|
804
|
+
error: string # If status is failed
|
|
805
|
+
last_checkpoint: string # Path to last checkpoint file
|
|
806
|
+
```
|
|
807
|
+
|
|
808
|
+
### CHECKPOINT.md Schema
|
|
809
|
+
|
|
810
|
+
```yaml
|
|
811
|
+
# Required
|
|
812
|
+
type: enum # human_verify | decision | human_action | session_death | failure
|
|
813
|
+
timestamp: ISO8601
|
|
814
|
+
cluster: string
|
|
815
|
+
block: string
|
|
816
|
+
thread: integer
|
|
817
|
+
thread_name: string
|
|
818
|
+
|
|
819
|
+
# Progress snapshot
|
|
820
|
+
progress:
|
|
821
|
+
blocks_complete: integer
|
|
822
|
+
blocks_total: integer
|
|
823
|
+
threads_complete: integer
|
|
824
|
+
threads_total: integer
|
|
825
|
+
|
|
826
|
+
# Completed work
|
|
827
|
+
completed_threads:
|
|
828
|
+
- id: string
|
|
829
|
+
name: string
|
|
830
|
+
commit: string
|
|
831
|
+
files: [string]
|
|
832
|
+
status: enum # complete | partial
|
|
833
|
+
|
|
834
|
+
# Checkpoint-specific
|
|
835
|
+
checkpoint_details: object # Type-specific content
|
|
836
|
+
awaiting: string # What user needs to do
|
|
837
|
+
|
|
838
|
+
# Optional
|
|
839
|
+
user_response: string # User's response when resuming
|
|
840
|
+
warmth: object # Warmth for continuation
|
|
841
|
+
expires: ISO8601 # If time-sensitive
|
|
842
|
+
```
|
|
843
|
+
|
|
844
|
+
### WARMTH.md Schema
|
|
845
|
+
|
|
846
|
+
```yaml
|
|
847
|
+
# Metadata
|
|
848
|
+
cluster: string
|
|
849
|
+
accumulated_from: [string] # List of program IDs
|
|
850
|
+
last_updated: ISO8601
|
|
851
|
+
|
|
852
|
+
# Knowledge categories
|
|
853
|
+
codebase_patterns: [string]
|
|
854
|
+
gotchas: [string]
|
|
855
|
+
user_preferences: [string]
|
|
856
|
+
decisions_made: [string]
|
|
857
|
+
almost_did: [string]
|
|
858
|
+
fragile_areas: [string]
|
|
859
|
+
```
|
|
860
|
+
|
|
861
|
+
---
|
|
862
|
+
|
|
863
|
+
## Implementation Checklist
|
|
864
|
+
|
|
865
|
+
### Phase 1: Core Persistence (Required)
|
|
866
|
+
|
|
867
|
+
- [ ] STATE.md write on every wave complete
|
|
868
|
+
- [ ] CHECKPOINT.md write on checkpoint/interrupt
|
|
869
|
+
- [ ] WARMTH.md aggregation after block complete
|
|
870
|
+
- [ ] /grid:resume command implementation
|
|
871
|
+
- [ ] Basic context reconstruction
|
|
872
|
+
|
|
873
|
+
### Phase 2: Enhanced Recovery (Recommended)
|
|
874
|
+
|
|
875
|
+
- [ ] Session death detection via scratchpad staleness
|
|
876
|
+
- [ ] Git-based state reconstruction
|
|
877
|
+
- [ ] Corrupted state recovery
|
|
878
|
+
- [ ] Rollback support
|
|
879
|
+
|
|
880
|
+
### Phase 3: Advanced Features (Future)
|
|
881
|
+
|
|
882
|
+
- [ ] Multi-cluster support (multiple .grid/ directories)
|
|
883
|
+
- [ ] State diff visualization
|
|
884
|
+
- [ ] Time-travel debugging (restore any checkpoint)
|
|
885
|
+
- [ ] Cross-session analytics
|
|
886
|
+
|
|
887
|
+
---
|
|
888
|
+
|
|
889
|
+
## Testing the Persistence System
|
|
890
|
+
|
|
891
|
+
### Test 1: Clean Resume
|
|
892
|
+
|
|
893
|
+
```bash
|
|
894
|
+
# Start a mission
|
|
895
|
+
/grid
|
|
896
|
+
# Build a blog
|
|
897
|
+
|
|
898
|
+
# Wait for checkpoint
|
|
899
|
+
# User approves checkpoint
|
|
900
|
+
# Close terminal (simulate session death)
|
|
901
|
+
|
|
902
|
+
# New session
|
|
903
|
+
/grid:resume
|
|
904
|
+
# Should continue from approved checkpoint
|
|
905
|
+
```
|
|
906
|
+
|
|
907
|
+
### Test 2: Session Death Recovery
|
|
908
|
+
|
|
909
|
+
```bash
|
|
910
|
+
# Start a mission
|
|
911
|
+
/grid
|
|
912
|
+
# Build a blog
|
|
913
|
+
|
|
914
|
+
# Wait for mid-execution
|
|
915
|
+
# Kill terminal (SIGKILL)
|
|
916
|
+
|
|
917
|
+
# New session
|
|
918
|
+
/grid:resume
|
|
919
|
+
# Should detect stale state
|
|
920
|
+
# Should reconstruct from scratchpad + git
|
|
921
|
+
# Should continue from last known point
|
|
922
|
+
```
|
|
923
|
+
|
|
924
|
+
### Test 3: Failure Recovery
|
|
925
|
+
|
|
926
|
+
```bash
|
|
927
|
+
# Start a mission that will fail
|
|
928
|
+
# (e.g., missing API key)
|
|
929
|
+
|
|
930
|
+
# Executor returns failure
|
|
931
|
+
# MC writes failure checkpoint
|
|
932
|
+
|
|
933
|
+
# New session
|
|
934
|
+
/grid:resume
|
|
935
|
+
# Should present failure report
|
|
936
|
+
# Should offer rollback/retry/manual options
|
|
937
|
+
```
|
|
938
|
+
|
|
939
|
+
---
|
|
940
|
+
|
|
941
|
+
## References
|
|
942
|
+
|
|
943
|
+
### Research Sources
|
|
944
|
+
|
|
945
|
+
- [Mastering LangGraph Checkpointing: Best Practices for 2025](https://sparkco.ai/blog/mastering-langgraph-checkpointing-best-practices-for-2025) - LangGraph built-in checkpointing patterns
|
|
946
|
+
- [Build durable AI agents with LangGraph and Amazon DynamoDB](https://aws.amazon.com/blogs/database/build-durable-ai-agents-with-langgraph-and-amazon-dynamodb/) - AWS production persistence patterns
|
|
947
|
+
- [LangGraph & Redis: Build smarter AI agents](https://redis.io/blog/langgraph-redis-build-smarter-ai-agents-with-memory-persistence/) - Thread-level persistence
|
|
948
|
+
- [Saga Pattern in Microservices](https://microservices.io/patterns/data/saga.html) - Orchestration saga pattern
|
|
949
|
+
- [Saga Design Pattern - Azure Architecture Center](https://learn.microsoft.com/en-us/azure/architecture/patterns/saga) - Microsoft saga implementation
|
|
950
|
+
- [Saga orchestration pattern - AWS](https://docs.aws.amazon.com/prescriptive-guidance/latest/cloud-design-patterns/saga-orchestration.html) - AWS saga orchestration
|
|
951
|
+
|
|
952
|
+
### Key Principles Applied
|
|
953
|
+
|
|
954
|
+
1. **Checkpoint-based recovery** (LangGraph) - State saved at each step
|
|
955
|
+
2. **Orchestration saga** (microservices) - Central coordinator with compensation
|
|
956
|
+
3. **Human-in-the-loop** (LangGraph) - Pause for approval, resume without context loss
|
|
957
|
+
4. **Idempotent operations** (saga pattern) - Retryable steps
|
|
958
|
+
5. **File-based persistence** (Grid constraint) - No external services
|
|
959
|
+
|
|
960
|
+
---
|
|
961
|
+
|
|
962
|
+
End of Line.
|