the-grid-cc 1.7.12 → 1.7.14

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,962 @@
1
+ # Grid State Persistence - Technical Design Document
2
+
3
+ **Version:** 1.0
4
+ **Author:** Program 5 (Persistence Orchestration)
5
+ **Date:** 2026-01-23
6
+
7
+ ---
8
+
9
+ ## Executive Summary
10
+
11
+ This document specifies the state persistence system that enables Grid missions to survive session death. When a session ends unexpectedly (context exhaustion, timeout, user disconnect), the Grid preserves complete state in `.grid/` files. A fresh Master Control can resume any in-progress mission from the exact stopping point using the `/grid:resume` command.
12
+
13
+ **Design Principles:**
14
+ - **File-only persistence** - No external databases, Redis, or services
15
+ - **Human-readable state** - All files are markdown/YAML for debugging
16
+ - **Checkpoint-based recovery** - Saga pattern with compensating actions
17
+ - **Warmth preservation** - Institutional knowledge survives across sessions
18
+
19
+ ---
20
+
21
+ ## Architecture Overview
22
+
23
+ ### Saga Pattern Implementation
24
+
25
+ The Grid implements the **orchestration saga pattern** for distributed workflow management:
26
+
27
+ ```
28
+ ┌─────────────────────────────────────────────────────────────────────┐
29
+ │ SAGA ORCHESTRATOR │
30
+ │ (Master Control) │
31
+ ├─────────────────────────────────────────────────────────────────────┤
32
+ │ │
33
+ │ Step 1 Step 2 Step 3 Step N │
34
+ │ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ │
35
+ │ │Plan │ ──▶ │Exec │ ──▶ │Verify│ ──▶ │ ... │ │
36
+ │ └──────┘ └──────┘ └──────┘ └──────┘ │
37
+ │ │ │ │ │ │
38
+ │ ▼ ▼ ▼ ▼ │
39
+ │ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ │
40
+ │ │State │ │State │ │State │ │State │ │
41
+ │ │ File │ │ File │ │ File │ │ File │ │
42
+ │ └──────┘ └──────┘ └──────┘ └──────┘ │
43
+ │ │
44
+ │ STATE.md (Central State) │
45
+ └─────────────────────────────────────────────────────────────────────┘
46
+ ```
47
+
48
+ **Recovery Modes:**
49
+ 1. **Forward Recovery (Continuation)** - Resume from last checkpoint
50
+ 2. **Backward Recovery (Compensation)** - Rollback partial work if needed
51
+
52
+ ### Checkpoint Hierarchy
53
+
54
+ State is captured at multiple granularities:
55
+
56
+ | Level | Frequency | File | Purpose |
57
+ |-------|-----------|------|---------|
58
+ | **Session** | On start/end | `STATE.md` | Top-level position tracking |
59
+ | **Wave** | After each wave | `WAVE_STATUS.md` | Wave completion status |
60
+ | **Block** | After each block | `*-SUMMARY.md` | Detailed work records |
61
+ | **Thread** | On checkpoint | `CHECKPOINT.md` | Interrupted thread state |
62
+ | **Discovery** | During work | `SCRATCHPAD.md` | Live findings |
63
+
64
+ ---
65
+
66
+ ## State That Must Survive
67
+
68
+ ### 1. Execution Position
69
+
70
+ ```yaml
71
+ # .grid/STATE.md
72
+ ---
73
+ cluster: "james-weatherhead-blog"
74
+ session_id: "2026-01-23T14:30:00Z-blog"
75
+ status: interrupted # active | interrupted | completed | failed
76
+
77
+ position:
78
+ phase: 2
79
+ phase_total: 4
80
+ phase_name: "Authentication"
81
+ block: 1
82
+ block_total: 3
83
+ wave: 2
84
+ wave_total: 3
85
+ thread: 3
86
+ thread_total: 5
87
+ thread_name: "Implement JWT refresh"
88
+
89
+ progress_percent: 45
90
+ energy_remaining: 7500
91
+ ---
92
+ ```
93
+
94
+ ### 2. Completed Work
95
+
96
+ All completed work persists in SUMMARY files with commit hashes:
97
+
98
+ ```yaml
99
+ # .grid/phases/01-foundation/01-SUMMARY.md
100
+ ---
101
+ cluster: james-weatherhead-blog
102
+ block: 01
103
+ status: complete
104
+ completed_at: "2026-01-23T14:00:00Z"
105
+
106
+ commits:
107
+ - hash: "abc123"
108
+ message: "feat(01): Initialize Astro project"
109
+ files: ["package.json", "astro.config.mjs"]
110
+ - hash: "def456"
111
+ message: "feat(01): Configure Tailwind dark mode"
112
+ files: ["tailwind.config.mjs"]
113
+
114
+ artifacts_created:
115
+ - path: "package.json"
116
+ lines: 45
117
+ verified: true
118
+ - path: "astro.config.mjs"
119
+ lines: 32
120
+ verified: true
121
+ ---
122
+ ```
123
+
124
+ ### 3. Pending Work
125
+
126
+ Plans not yet executed are preserved in their original form:
127
+
128
+ ```
129
+ .grid/plans/
130
+ ├── blog-block-01.md # Contains full PLAN spec
131
+ ├── blog-block-02.md
132
+ ├── blog-block-03.md
133
+ ├── blog-block-04.md
134
+ ├── blog-block-05.md
135
+ ├── blog-block-06.md
136
+ └── blog-PLAN-SUMMARY.md # Master plan with wave structure
137
+ ```
138
+
139
+ ### 4. Interrupted Thread State
140
+
141
+ When a checkpoint or interruption occurs mid-thread:
142
+
143
+ ```yaml
144
+ # .grid/CHECKPOINT.md
145
+ ---
146
+ type: session_death # checkpoint | session_death | failure
147
+ timestamp: "2026-01-23T15:45:00Z"
148
+ block: "02"
149
+ thread: 3
150
+ thread_name: "Implement dark mode toggle"
151
+
152
+ completed_threads:
153
+ - id: "02.1"
154
+ name: "Create BaseLayout"
155
+ commit: "ghi789"
156
+ status: complete
157
+ - id: "02.2"
158
+ name: "Build Header component"
159
+ commit: "jkl012"
160
+ status: complete
161
+
162
+ current_thread:
163
+ id: "02.3"
164
+ name: "Implement dark mode toggle"
165
+ status: in_progress
166
+ partial_work:
167
+ files_created: ["src/components/DarkModeToggle.astro"]
168
+ files_modified: ["src/layouts/BaseLayout.astro"]
169
+ staged_changes: true
170
+ last_action: "Writing localStorage persistence logic"
171
+
172
+ pending_threads:
173
+ - id: "02.4"
174
+ name: "CHECKPOINT: Human verification"
175
+ status: pending
176
+ ---
177
+ ```
178
+
179
+ ### 5. Warmth (Institutional Knowledge)
180
+
181
+ Accumulated learnings from all Programs:
182
+
183
+ ```yaml
184
+ # .grid/WARMTH.md
185
+ ---
186
+ cluster: james-weatherhead-blog
187
+ accumulated_from:
188
+ - executor-01 (block 01)
189
+ - executor-02 (block 02)
190
+ - executor-03 (block 02 parallel)
191
+ ---
192
+
193
+ ## Codebase Patterns
194
+ - "This project uses Astro content collections for blog posts"
195
+ - "Dark mode uses class strategy with localStorage"
196
+ - "Tailwind typography plugin required for prose styling"
197
+
198
+ ## Gotchas
199
+ - "Astro config must use .mjs extension for ESM"
200
+ - "Shiki syntax highlighting is build-time only"
201
+ - "Content collection schema in src/content/config.ts"
202
+
203
+ ## User Preferences
204
+ - "User prefers minimal dependencies"
205
+ - "User wants dark mode as default"
206
+ - "No unnecessary abstractions"
207
+
208
+ ## Decisions Made
209
+ - "Chose serverless adapter over static for future flexibility"
210
+ - "Using Shiki over Prism for better VS Code parity"
211
+
212
+ ## Almost Did (Rejected)
213
+ - "Considered MDX but stuck with plain MD for simplicity"
214
+ - "Considered Zustand for state but localStorage sufficient"
215
+ ```
216
+
217
+ ### 6. User Decisions
218
+
219
+ Critical decisions made by the user during I/O Tower interactions:
220
+
221
+ ```yaml
222
+ # .grid/DECISIONS.md
223
+ ---
224
+ cluster: james-weatherhead-blog
225
+ ---
226
+
227
+ ## Decision 1: 2026-01-23T14:15:00Z
228
+ **Question:** Deploy to Vercel or Netlify?
229
+ **Options Presented:**
230
+ - vercel: "Native Astro support, edge functions"
231
+ - netlify: "Simpler config, build plugins"
232
+ **User Choice:** vercel
233
+ **Rationale:** "Already have Vercel account"
234
+ **Affects:** Block 01 (adapter choice), Block 06 (deploy config)
235
+
236
+ ## Decision 2: 2026-01-23T14:30:00Z
237
+ **Question:** Dark mode default state?
238
+ **Options Presented:**
239
+ - light: "Traditional default"
240
+ - dark: "Modern preference"
241
+ - system: "Respect OS setting"
242
+ **User Choice:** dark
243
+ **Rationale:** "Developer blog, dark is expected"
244
+ **Affects:** Block 02 (toggle default)
245
+ ```
246
+
247
+ ### 7. Blockers Encountered
248
+
249
+ Issues that blocked progress:
250
+
251
+ ```yaml
252
+ # .grid/BLOCKERS.md
253
+ ---
254
+ cluster: james-weatherhead-blog
255
+ ---
256
+
257
+ ## Blocker 1: RESOLVED
258
+ **Block:** 02
259
+ **Thread:** 02.3
260
+ **Type:** dependency_missing
261
+ **Description:** @tailwindcss/typography not installed
262
+ **Resolution:** Added dependency in thread 02.1
263
+ **Resolved At:** 2026-01-23T14:20:00Z
264
+
265
+ ## Blocker 2: ACTIVE
266
+ **Block:** 06
267
+ **Thread:** 06.3
268
+ **Type:** human_action_required
269
+ **Description:** Vercel CLI authentication needed
270
+ **Waiting For:** User to run `vercel login`
271
+ **Created At:** 2026-01-23T16:00:00Z
272
+ ```
273
+
274
+ ---
275
+
276
+ ## Directory Structure
277
+
278
+ ### Complete .grid/ Layout
279
+
280
+ ```
281
+ .grid/
282
+ ├── STATE.md # Central state file (ALWAYS read first)
283
+ ├── CHECKPOINT.md # Current checkpoint if interrupted
284
+ ├── WARMTH.md # Accumulated institutional knowledge
285
+ ├── DECISIONS.md # User decisions log
286
+ ├── BLOCKERS.md # Active and resolved blockers
287
+ ├── SCRATCHPAD.md # Live discoveries during execution
288
+ ├── SCRATCHPAD_ARCHIVE.md # Archived scratchpad entries
289
+ ├── LEARNINGS.md # Cross-project patterns
290
+ ├── config.json # Grid configuration (model tier, etc.)
291
+
292
+ ├── plans/ # Execution plans
293
+ │ ├── {cluster}-PLAN-SUMMARY.md
294
+ │ ├── {cluster}-block-01.md
295
+ │ ├── {cluster}-block-02.md
296
+ │ └── ...
297
+
298
+ ├── phases/ # Execution artifacts
299
+ │ ├── 01-foundation/
300
+ │ │ ├── 01-PLAN.md # Plan (copy from plans/)
301
+ │ │ └── 01-SUMMARY.md # Completion record
302
+ │ ├── 02-auth/
303
+ │ │ ├── 02-PLAN.md
304
+ │ │ └── 02-SUMMARY.md
305
+ │ └── ...
306
+
307
+ ├── debug/ # Debug session state
308
+ │ └── {timestamp}-{slug}/
309
+ │ └── session.md
310
+
311
+ └── refinement/ # Refinement swarm outputs
312
+ ├── screenshots/
313
+ ├── e2e/
314
+ └── personas/
315
+ ```
316
+
317
+ ### File Ownership and Locking
318
+
319
+ | File | Written By | Read By | Update Frequency |
320
+ |------|------------|---------|------------------|
321
+ | `STATE.md` | MC | MC, Resume | Every wave |
322
+ | `CHECKPOINT.md` | Executor | MC, Resume | On interrupt |
323
+ | `WARMTH.md` | MC | Executors | After each block |
324
+ | `*-SUMMARY.md` | Executor | MC, Recognizer | On block complete |
325
+ | `SCRATCHPAD.md` | Executors | MC, Executors | During execution |
326
+ | `DECISIONS.md` | MC | All | On user decision |
327
+
328
+ ---
329
+
330
+ ## Resume Protocol
331
+
332
+ ### Pre-Resume State Validation
333
+
334
+ Before resuming, validate state consistency:
335
+
336
+ ```python
337
+ def validate_state():
338
+ """Validate .grid/ state is consistent and resumable."""
339
+
340
+ checks = [
341
+ # 1. STATE.md exists and is parseable
342
+ ("STATE.md exists", file_exists(".grid/STATE.md")),
343
+
344
+ # 2. Claimed commits actually exist
345
+ ("Commits exist", verify_commits(get_claimed_commits())),
346
+
347
+ # 3. Claimed files exist
348
+ ("Files exist", verify_files(get_claimed_artifacts())),
349
+
350
+ # 4. No conflicting checkpoints
351
+ ("Single checkpoint", count_checkpoints() <= 1),
352
+
353
+ # 5. Plan files exist for pending blocks
354
+ ("Plans available", verify_pending_plans()),
355
+ ]
356
+
357
+ return all(check[1] for check in checks)
358
+ ```
359
+
360
+ ### Resume Flow
361
+
362
+ ```
363
+ ┌─────────────────────────────────────────────────────────────────────┐
364
+ │ /grid:resume │
365
+ ├─────────────────────────────────────────────────────────────────────┤
366
+ │ │
367
+ │ 1. READ STATE │
368
+ │ ├─ Load .grid/STATE.md │
369
+ │ ├─ Parse position (phase/block/wave/thread) │
370
+ │ └─ Determine status (interrupted/checkpoint/failed) │
371
+ │ │
372
+ │ 2. VALIDATE STATE │
373
+ │ ├─ Verify commits exist in git │
374
+ │ ├─ Verify claimed files exist │
375
+ │ ├─ Check for state corruption │
376
+ │ └─ If invalid: ENTER RECOVERY MODE │
377
+ │ │
378
+ │ 3. RECONSTRUCT CONTEXT │
379
+ │ ├─ Load WARMTH.md (institutional knowledge) │
380
+ │ ├─ Load DECISIONS.md (user decisions) │
381
+ │ ├─ Load CHECKPOINT.md if exists (interrupted thread) │
382
+ │ ├─ Collect all SUMMARY.md files (completed work) │
383
+ │ └─ Build execution context object │
384
+ │ │
385
+ │ 4. DETERMINE RESUME POINT │
386
+ │ ├─ If CHECKPOINT: Resume from interrupted thread │
387
+ │ ├─ If wave complete: Start next wave │
388
+ │ ├─ If block complete: Start next block in wave │
389
+ │ └─ If session death: Resume from last checkpoint │
390
+ │ │
391
+ │ 5. SPAWN CONTINUATION │
392
+ │ ├─ Pass warmth to fresh executor │
393
+ │ ├─ Pass completed_threads table │
394
+ │ ├─ Pass remaining plan │
395
+ │ └─ Execute from resume point │
396
+ │ │
397
+ └─────────────────────────────────────────────────────────────────────┘
398
+ ```
399
+
400
+ ### Context Reconstruction
401
+
402
+ When resuming, MC must rebuild complete context from files:
403
+
404
+ ```python
405
+ def reconstruct_context():
406
+ """Rebuild execution context from .grid/ files."""
407
+
408
+ context = {
409
+ "cluster": None,
410
+ "position": None,
411
+ "completed_blocks": [],
412
+ "completed_threads": [],
413
+ "pending_plans": [],
414
+ "warmth": None,
415
+ "decisions": [],
416
+ "blockers": [],
417
+ "checkpoint": None,
418
+ }
419
+
420
+ # 1. Load central state
421
+ state = parse_yaml(read(".grid/STATE.md"))
422
+ context["cluster"] = state["cluster"]
423
+ context["position"] = state["position"]
424
+
425
+ # 2. Collect completed work
426
+ for summary_path in glob(".grid/phases/*/SUMMARY.md"):
427
+ summary = parse_yaml(read(summary_path))
428
+ context["completed_blocks"].append(summary)
429
+
430
+ # Extract completed threads from summary
431
+ for commit in summary.get("commits", []):
432
+ context["completed_threads"].append({
433
+ "block": summary["block"],
434
+ "commit": commit["hash"],
435
+ "files": commit["files"],
436
+ })
437
+
438
+ # 3. Load warmth
439
+ if file_exists(".grid/WARMTH.md"):
440
+ context["warmth"] = read(".grid/WARMTH.md")
441
+
442
+ # 4. Load decisions
443
+ if file_exists(".grid/DECISIONS.md"):
444
+ context["decisions"] = parse_decisions(".grid/DECISIONS.md")
445
+
446
+ # 5. Load checkpoint if exists
447
+ if file_exists(".grid/CHECKPOINT.md"):
448
+ context["checkpoint"] = parse_yaml(read(".grid/CHECKPOINT.md"))
449
+
450
+ # 6. Identify pending plans
451
+ for plan_path in glob(".grid/plans/*-block-*.md"):
452
+ block_num = extract_block_number(plan_path)
453
+ if block_num not in [b["block"] for b in context["completed_blocks"]]:
454
+ context["pending_plans"].append({
455
+ "path": plan_path,
456
+ "block": block_num,
457
+ "content": read(plan_path),
458
+ })
459
+
460
+ return context
461
+ ```
462
+
463
+ ---
464
+
465
+ ## Checkpoint Persistence
466
+
467
+ ### Session Death Detection
468
+
469
+ Grid cannot always write a checkpoint before death. Detection strategies:
470
+
471
+ 1. **Graceful shutdown** - MC writes checkpoint before context exhaustion
472
+ 2. **Heartbeat timeout** - Scratchpad not updated in 10 minutes
473
+ 3. **Partial state** - CHECKPOINT.md exists but no SUMMARY.md
474
+
475
+ ### Checkpoint Types
476
+
477
+ | Type | Trigger | State Captured | Resume Strategy |
478
+ |------|---------|----------------|-----------------|
479
+ | `human_verify` | Thread requires verification | Full thread state | Wait for user approval, continue |
480
+ | `human_action` | External action needed | Blocker details | Wait for user to act, continue |
481
+ | `decision` | Architectural fork | Options, context | Present options, continue on choice |
482
+ | `session_death` | Context exhaustion | Last known state | Validate and continue |
483
+ | `failure` | Unrecoverable error | Error details | Present to user, may require rollback |
484
+
485
+ ### Checkpoint File Format
486
+
487
+ ```yaml
488
+ # .grid/CHECKPOINT.md
489
+ ---
490
+ # Checkpoint metadata
491
+ type: human_verify
492
+ timestamp: "2026-01-23T15:45:00Z"
493
+ expires: null # or ISO timestamp if time-sensitive
494
+
495
+ # Position
496
+ cluster: james-weatherhead-blog
497
+ block: "02"
498
+ wave: 2
499
+ thread: 4
500
+ thread_name: "CHECKPOINT: Human verification of layout"
501
+
502
+ # Progress
503
+ progress:
504
+ blocks_complete: 1
505
+ blocks_total: 6
506
+ threads_complete: 3
507
+ threads_total: 4
508
+
509
+ # Completed work (essential for continuation)
510
+ completed_threads:
511
+ - id: "02.1"
512
+ name: "Create BaseLayout"
513
+ commit: "abc123"
514
+ files: ["src/layouts/BaseLayout.astro"]
515
+ status: complete
516
+ - id: "02.2"
517
+ name: "Build Header component"
518
+ commit: "def456"
519
+ files: ["src/components/Header.astro"]
520
+ status: complete
521
+ - id: "02.3"
522
+ name: "Implement dark mode toggle"
523
+ commit: "ghi789"
524
+ files: ["src/components/DarkModeToggle.astro"]
525
+ status: complete
526
+
527
+ # Current checkpoint details
528
+ checkpoint_details:
529
+ verification_instructions: |
530
+ 1. Run: npm run dev
531
+ 2. Visit: http://localhost:4321
532
+ 3. Click dark mode toggle
533
+ 4. Verify theme persists on refresh
534
+ expected_behavior: |
535
+ - Toggle switches between light/dark
536
+ - Theme persists after page reload
537
+ - No flash of wrong theme (FOUC)
538
+
539
+ # User action needed
540
+ awaiting: "User to test dark mode and respond 'approved' or describe issues"
541
+
542
+ # Warmth for continuation
543
+ warmth:
544
+ codebase_patterns:
545
+ - "Astro components use .astro extension"
546
+ - "Dark mode uses class strategy"
547
+ gotchas:
548
+ - "Must inline theme script in <head> to prevent FOUC"
549
+ user_preferences:
550
+ - "User prefers dark as default"
551
+ ---
552
+ ```
553
+
554
+ ---
555
+
556
+ ## State Update Protocol
557
+
558
+ ### When to Update STATE.md
559
+
560
+ | Event | Update Action |
561
+ |-------|---------------|
562
+ | Wave starts | Set status: active, update wave position |
563
+ | Wave completes | Increment wave, update progress |
564
+ | Block completes | Increment block, log completion |
565
+ | Checkpoint hit | Set status: checkpoint, write CHECKPOINT.md |
566
+ | Session ending | Set status: interrupted, write checkpoint |
567
+ | Mission complete | Set status: completed, archive warmth |
568
+ | Failure | Set status: failed, write failure details |
569
+
570
+ ### Atomic State Updates
571
+
572
+ State updates must be atomic to prevent corruption:
573
+
574
+ ```python
575
+ def update_state(updates: dict):
576
+ """Atomically update STATE.md."""
577
+
578
+ # 1. Read current state
579
+ current = parse_yaml(read(".grid/STATE.md"))
580
+
581
+ # 2. Apply updates
582
+ merged = deep_merge(current, updates)
583
+ merged["updated_at"] = datetime.now().isoformat()
584
+
585
+ # 3. Write to temp file first
586
+ temp_path = ".grid/STATE.md.tmp"
587
+ write(temp_path, to_yaml(merged))
588
+
589
+ # 4. Atomic rename (POSIX atomic)
590
+ rename(temp_path, ".grid/STATE.md")
591
+ ```
592
+
593
+ ---
594
+
595
+ ## Warmth Aggregation
596
+
597
+ ### Warmth Collection
598
+
599
+ After each block completes, aggregate warmth:
600
+
601
+ ```python
602
+ def aggregate_warmth(block_summary_path: str):
603
+ """Aggregate warmth from block summary into WARMTH.md."""
604
+
605
+ # 1. Parse block summary
606
+ summary = parse_yaml(read(block_summary_path))
607
+ block_warmth = summary.get("lessons_learned", {})
608
+
609
+ # 2. Load existing warmth
610
+ warmth_path = ".grid/WARMTH.md"
611
+ if file_exists(warmth_path):
612
+ existing = parse_yaml(read(warmth_path))
613
+ else:
614
+ existing = {
615
+ "codebase_patterns": [],
616
+ "gotchas": [],
617
+ "user_preferences": [],
618
+ "decisions_made": [],
619
+ "almost_did": [],
620
+ }
621
+
622
+ # 3. Merge (deduplicate)
623
+ for category in ["codebase_patterns", "gotchas", "user_preferences", "almost_did"]:
624
+ new_items = block_warmth.get(category, [])
625
+ for item in new_items:
626
+ if item not in existing[category]:
627
+ existing[category].append(item)
628
+
629
+ # 4. Write aggregated warmth
630
+ write(warmth_path, to_yaml(existing))
631
+ ```
632
+
633
+ ### Warmth Injection
634
+
635
+ When spawning continuation, inject aggregated warmth:
636
+
637
+ ```python
638
+ def build_continuation_prompt(context):
639
+ """Build prompt with warmth for continuation executor."""
640
+
641
+ return f"""
642
+ First, read ~/.claude/agents/grid-executor.md for your role.
643
+
644
+ <warmth>
645
+ Accumulated knowledge from prior Programs:
646
+
647
+ {context["warmth"]}
648
+
649
+ Apply this warmth. Don't repeat mistakes. Build on discoveries.
650
+ </warmth>
651
+
652
+ <completed_threads>
653
+ {format_completed_threads(context["completed_threads"])}
654
+ </completed_threads>
655
+
656
+ <resume_point>
657
+ Continue from: Thread {context["checkpoint"]["thread"]}
658
+ User feedback: {context.get("user_feedback", "None")}
659
+ </resume_point>
660
+
661
+ <plan>
662
+ {context["pending_plan"]["content"]}
663
+ </plan>
664
+
665
+ Execute the remaining threads. Verify previous commits exist before starting.
666
+ """
667
+ ```
668
+
669
+ ---
670
+
671
+ ## Recovery Scenarios
672
+
673
+ ### Scenario 1: Clean Checkpoint Resume
674
+
675
+ **Situation:** User approved checkpoint, MC context expired before continuation.
676
+
677
+ **State Files Present:**
678
+ - STATE.md (status: checkpoint)
679
+ - CHECKPOINT.md (type: human_verify, user_response: approved)
680
+ - WARMTH.md (accumulated)
681
+ - 01-SUMMARY.md (complete)
682
+
683
+ **Resume Action:**
684
+ 1. Load state, verify checkpoint approved
685
+ 2. Load warmth
686
+ 3. Spawn executor with resume_point = next thread after checkpoint
687
+ 4. Continue execution
688
+
689
+ ### Scenario 2: Session Death Recovery
690
+
691
+ **Situation:** MC context exhausted mid-execution, no checkpoint written.
692
+
693
+ **State Files Present:**
694
+ - STATE.md (status: active, stale timestamp)
695
+ - SCRATCHPAD.md (recent entries)
696
+ - 01-SUMMARY.md (complete)
697
+ - No CHECKPOINT.md
698
+
699
+ **Resume Action:**
700
+ 1. Detect stale state (no updates in 10+ minutes)
701
+ 2. Read last scratchpad entry for position
702
+ 3. Check git log for recent commits
703
+ 4. Reconstruct position from commits + scratchpad
704
+ 5. Create synthetic checkpoint
705
+ 6. Resume from last known good state
706
+
707
+ ### Scenario 3: Failure Recovery
708
+
709
+ **Situation:** Executor failed, partial work exists.
710
+
711
+ **State Files Present:**
712
+ - STATE.md (status: active)
713
+ - CHECKPOINT.md (type: failure, partial_work present)
714
+ - Uncommitted changes in working directory
715
+
716
+ **Resume Action:**
717
+ 1. Present failure report to user
718
+ 2. Options:
719
+ - Rollback: `git reset --hard` to last commit
720
+ - Retry: Spawn executor with failure context
721
+ - Manual: User fixes, then continue
722
+ 3. Update state based on choice
723
+ 4. Continue or restart
724
+
725
+ ### Scenario 4: Corrupted State
726
+
727
+ **Situation:** STATE.md is invalid or missing.
728
+
729
+ **Recovery Action:**
730
+ 1. Scan for SUMMARY.md files to find completed work
731
+ 2. Scan git log for Grid commits (commit message patterns)
732
+ 3. Rebuild STATE.md from discovered state
733
+ 4. Present reconstruction to user for approval
734
+ 5. Resume or restart based on confidence
735
+
736
+ ```python
737
+ def recover_corrupted_state():
738
+ """Attempt to rebuild state from artifacts."""
739
+
740
+ recovered = {
741
+ "cluster": "unknown",
742
+ "position": {},
743
+ "completed_blocks": [],
744
+ "recovered": True,
745
+ "confidence": "medium",
746
+ }
747
+
748
+ # 1. Find summaries
749
+ summaries = glob(".grid/phases/*/SUMMARY.md")
750
+ for s in summaries:
751
+ try:
752
+ data = parse_yaml(read(s))
753
+ recovered["completed_blocks"].append(data["block"])
754
+ recovered["cluster"] = data.get("cluster", recovered["cluster"])
755
+ except:
756
+ pass
757
+
758
+ # 2. Find commits
759
+ commits = bash("git log --oneline --grep='feat(' --grep='fix(' -20")
760
+ # Parse commits for block references
761
+
762
+ # 3. Calculate position
763
+ if recovered["completed_blocks"]:
764
+ last_block = max(recovered["completed_blocks"])
765
+ recovered["position"]["block"] = last_block + 1
766
+
767
+ return recovered
768
+ ```
769
+
770
+ ---
771
+
772
+ ## State File Schemas
773
+
774
+ ### STATE.md Schema
775
+
776
+ ```yaml
777
+ # Required fields
778
+ cluster: string # Cluster name
779
+ session_id: string # Unique session identifier
780
+ status: enum # active | checkpoint | interrupted | completed | failed
781
+ created_at: ISO8601
782
+ updated_at: ISO8601
783
+
784
+ # Position tracking
785
+ position:
786
+ phase: integer # Current phase number
787
+ phase_total: integer # Total phases
788
+ phase_name: string # Phase name
789
+ block: integer # Current block number
790
+ block_total: integer # Total blocks
791
+ wave: integer # Current wave number
792
+ wave_total: integer # Total waves
793
+ thread: integer # Current thread number (optional)
794
+ thread_total: integer # Total threads in block (optional)
795
+
796
+ # Progress
797
+ progress_percent: integer # 0-100
798
+ energy_remaining: integer # Energy budget
799
+
800
+ # Execution mode
801
+ mode: enum # autopilot | guided | hands_on
802
+
803
+ # Optional
804
+ error: string # If status is failed
805
+ last_checkpoint: string # Path to last checkpoint file
806
+ ```
807
+
808
+ ### CHECKPOINT.md Schema
809
+
810
+ ```yaml
811
+ # Required
812
+ type: enum # human_verify | decision | human_action | session_death | failure
813
+ timestamp: ISO8601
814
+ cluster: string
815
+ block: string
816
+ thread: integer
817
+ thread_name: string
818
+
819
+ # Progress snapshot
820
+ progress:
821
+ blocks_complete: integer
822
+ blocks_total: integer
823
+ threads_complete: integer
824
+ threads_total: integer
825
+
826
+ # Completed work
827
+ completed_threads:
828
+ - id: string
829
+ name: string
830
+ commit: string
831
+ files: [string]
832
+ status: enum # complete | partial
833
+
834
+ # Checkpoint-specific
835
+ checkpoint_details: object # Type-specific content
836
+ awaiting: string # What user needs to do
837
+
838
+ # Optional
839
+ user_response: string # User's response when resuming
840
+ warmth: object # Warmth for continuation
841
+ expires: ISO8601 # If time-sensitive
842
+ ```
843
+
844
+ ### WARMTH.md Schema
845
+
846
+ ```yaml
847
+ # Metadata
848
+ cluster: string
849
+ accumulated_from: [string] # List of program IDs
850
+ last_updated: ISO8601
851
+
852
+ # Knowledge categories
853
+ codebase_patterns: [string]
854
+ gotchas: [string]
855
+ user_preferences: [string]
856
+ decisions_made: [string]
857
+ almost_did: [string]
858
+ fragile_areas: [string]
859
+ ```
860
+
861
+ ---
862
+
863
+ ## Implementation Checklist
864
+
865
+ ### Phase 1: Core Persistence (Required)
866
+
867
+ - [ ] STATE.md write on every wave complete
868
+ - [ ] CHECKPOINT.md write on checkpoint/interrupt
869
+ - [ ] WARMTH.md aggregation after block complete
870
+ - [ ] /grid:resume command implementation
871
+ - [ ] Basic context reconstruction
872
+
873
+ ### Phase 2: Enhanced Recovery (Recommended)
874
+
875
+ - [ ] Session death detection via scratchpad staleness
876
+ - [ ] Git-based state reconstruction
877
+ - [ ] Corrupted state recovery
878
+ - [ ] Rollback support
879
+
880
+ ### Phase 3: Advanced Features (Future)
881
+
882
+ - [ ] Multi-cluster support (multiple .grid/ directories)
883
+ - [ ] State diff visualization
884
+ - [ ] Time-travel debugging (restore any checkpoint)
885
+ - [ ] Cross-session analytics
886
+
887
+ ---
888
+
889
+ ## Testing the Persistence System
890
+
891
+ ### Test 1: Clean Resume
892
+
893
+ ```bash
894
+ # Start a mission
895
+ /grid
896
+ # Build a blog
897
+
898
+ # Wait for checkpoint
899
+ # User approves checkpoint
900
+ # Close terminal (simulate session death)
901
+
902
+ # New session
903
+ /grid:resume
904
+ # Should continue from approved checkpoint
905
+ ```
906
+
907
+ ### Test 2: Session Death Recovery
908
+
909
+ ```bash
910
+ # Start a mission
911
+ /grid
912
+ # Build a blog
913
+
914
+ # Wait for mid-execution
915
+ # Kill terminal (SIGKILL)
916
+
917
+ # New session
918
+ /grid:resume
919
+ # Should detect stale state
920
+ # Should reconstruct from scratchpad + git
921
+ # Should continue from last known point
922
+ ```
923
+
924
+ ### Test 3: Failure Recovery
925
+
926
+ ```bash
927
+ # Start a mission that will fail
928
+ # (e.g., missing API key)
929
+
930
+ # Executor returns failure
931
+ # MC writes failure checkpoint
932
+
933
+ # New session
934
+ /grid:resume
935
+ # Should present failure report
936
+ # Should offer rollback/retry/manual options
937
+ ```
938
+
939
+ ---
940
+
941
+ ## References
942
+
943
+ ### Research Sources
944
+
945
+ - [Mastering LangGraph Checkpointing: Best Practices for 2025](https://sparkco.ai/blog/mastering-langgraph-checkpointing-best-practices-for-2025) - LangGraph built-in checkpointing patterns
946
+ - [Build durable AI agents with LangGraph and Amazon DynamoDB](https://aws.amazon.com/blogs/database/build-durable-ai-agents-with-langgraph-and-amazon-dynamodb/) - AWS production persistence patterns
947
+ - [LangGraph & Redis: Build smarter AI agents](https://redis.io/blog/langgraph-redis-build-smarter-ai-agents-with-memory-persistence/) - Thread-level persistence
948
+ - [Saga Pattern in Microservices](https://microservices.io/patterns/data/saga.html) - Orchestration saga pattern
949
+ - [Saga Design Pattern - Azure Architecture Center](https://learn.microsoft.com/en-us/azure/architecture/patterns/saga) - Microsoft saga implementation
950
+ - [Saga orchestration pattern - AWS](https://docs.aws.amazon.com/prescriptive-guidance/latest/cloud-design-patterns/saga-orchestration.html) - AWS saga orchestration
951
+
952
+ ### Key Principles Applied
953
+
954
+ 1. **Checkpoint-based recovery** (LangGraph) - State saved at each step
955
+ 2. **Orchestration saga** (microservices) - Central coordinator with compensation
956
+ 3. **Human-in-the-loop** (LangGraph) - Pause for approval, resume without context loss
957
+ 4. **Idempotent operations** (saga pattern) - Retryable steps
958
+ 5. **File-based persistence** (Grid constraint) - No external services
959
+
960
+ ---
961
+
962
+ End of Line.