groundswell 0.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (120) hide show
  1. package/.claude/settings.local.json +9 -0
  2. package/.claude/system_prompts/task-breakdown.md +100 -0
  3. package/PRPs/001-hierarchical-workflow-engine.md +2438 -0
  4. package/PRPs/PRDs/001-hierarchical-workflow-engine.md +543 -0
  5. package/PRPs/PRDs/002-agent-prompt.md +390 -0
  6. package/PRPs/PRDs/003-agent-prompt.md +943 -0
  7. package/PRPs/PRDs/004-agent-prompt.md +1136 -0
  8. package/PRPs/PRDs/tasks-001.json +492 -0
  9. package/PRPs/README.md +83 -0
  10. package/PRPs/templates/prp_base.md +222 -0
  11. package/README.md +218 -0
  12. package/docs/agent.md +422 -0
  13. package/docs/prompt.md +419 -0
  14. package/docs/workflow.md +600 -0
  15. package/examples/README.md +244 -0
  16. package/examples/examples/01-basic-workflow.ts +100 -0
  17. package/examples/examples/02-decorator-options.ts +217 -0
  18. package/examples/examples/03-parent-child.ts +241 -0
  19. package/examples/examples/04-observers-debugger.ts +340 -0
  20. package/examples/examples/05-error-handling.ts +387 -0
  21. package/examples/examples/06-concurrent-tasks.ts +352 -0
  22. package/examples/examples/07-agent-loops.ts +432 -0
  23. package/examples/examples/08-sdk-features.ts +667 -0
  24. package/examples/examples/09-reflection.ts +573 -0
  25. package/examples/examples/10-introspection.ts +550 -0
  26. package/examples/index.ts +143 -0
  27. package/examples/utils/helpers.ts +57 -0
  28. package/llms_full.txt +5890 -0
  29. package/package.json +63 -0
  30. package/plan/P1P2/PRP.md +527 -0
  31. package/plan/P1P2/research/LRU_CACHE_BEST_PRACTICES.md +1929 -0
  32. package/plan/P1P2/research/LRU_CACHE_CODE_PATTERNS.md +857 -0
  33. package/plan/P1P2/research/LRU_CACHE_INTEGRATION_GUIDE.md +738 -0
  34. package/plan/P1P2/research/LRU_CACHE_RESEARCH_INDEX.md +424 -0
  35. package/plan/P1P2/research/REFLECTION_INDEX.md +291 -0
  36. package/plan/P1P2/research/REFLECTION_RESEARCH_REPORT.md +1342 -0
  37. package/plan/P1P2/research/RESEARCH_SUMMARY.md +342 -0
  38. package/plan/P1P2/research/anthropic-sdk.md +174 -0
  39. package/plan/P1P2/research/async-local-storage.md +200 -0
  40. package/plan/P1P2/research/reflection-code-patterns.md +1205 -0
  41. package/plan/P1P2/research/reflection-decision-matrix.md +421 -0
  42. package/plan/P1P2/research/reflection-implementation-guide.md +1341 -0
  43. package/plan/P1P2/research/reflection-integration-guide.md +834 -0
  44. package/plan/P1P2/research/reflection-patterns.md +1468 -0
  45. package/plan/P1P2/research/reflection-quick-reference.md +558 -0
  46. package/plan/P1P2/research/zod-schema.md +152 -0
  47. package/plan/P3P4/PRP.md +1388 -0
  48. package/plan/P3P4/research/caching-lru.md +116 -0
  49. package/plan/P3P4/research/introspection-tools.md +177 -0
  50. package/plan/P3P4/research/reflection-patterns.md +117 -0
  51. package/plan/P4P5/PRP.md +1136 -0
  52. package/plan/P4P5/research/RESEARCH_SUMMARY.md +151 -0
  53. package/plan/architecture/external_deps.md +358 -0
  54. package/plan/architecture/system_context.md +242 -0
  55. package/plan/backlog.json +867 -0
  56. package/plan/research/INTROSPECTION_RESEARCH_SUMMARY.md +378 -0
  57. package/plan/research/README-INTROSPECTION.md +352 -0
  58. package/plan/research/agent-introspection-patterns.md +1085 -0
  59. package/plan/research/introspection-security-guide.md +928 -0
  60. package/plan/research/introspection-tool-examples.md +875 -0
  61. package/scripts/generate-llms-full.ts +206 -0
  62. package/src/__tests__/integration/agent-workflow.test.ts +256 -0
  63. package/src/__tests__/integration/tree-mirroring.test.ts +114 -0
  64. package/src/__tests__/unit/agent.test.ts +169 -0
  65. package/src/__tests__/unit/cache-key.test.ts +182 -0
  66. package/src/__tests__/unit/cache.test.ts +172 -0
  67. package/src/__tests__/unit/context.test.ts +138 -0
  68. package/src/__tests__/unit/decorators.test.ts +100 -0
  69. package/src/__tests__/unit/introspection-tools.test.ts +277 -0
  70. package/src/__tests__/unit/prompt.test.ts +135 -0
  71. package/src/__tests__/unit/reflection.test.ts +210 -0
  72. package/src/__tests__/unit/tree-debugger.test.ts +85 -0
  73. package/src/__tests__/unit/workflow.test.ts +81 -0
  74. package/src/cache/cache-key.ts +244 -0
  75. package/src/cache/cache.ts +236 -0
  76. package/src/cache/index.ts +8 -0
  77. package/src/core/agent.ts +573 -0
  78. package/src/core/context.ts +119 -0
  79. package/src/core/event-tree.ts +260 -0
  80. package/src/core/factory.ts +123 -0
  81. package/src/core/index.ts +17 -0
  82. package/src/core/logger.ts +87 -0
  83. package/src/core/mcp-handler.ts +184 -0
  84. package/src/core/prompt.ts +150 -0
  85. package/src/core/workflow-context.ts +349 -0
  86. package/src/core/workflow.ts +302 -0
  87. package/src/debugger/index.ts +1 -0
  88. package/src/debugger/tree-debugger.ts +210 -0
  89. package/src/decorators/index.ts +3 -0
  90. package/src/decorators/observed-state.ts +95 -0
  91. package/src/decorators/step.ts +139 -0
  92. package/src/decorators/task.ts +96 -0
  93. package/src/examples/index.ts +2 -0
  94. package/src/examples/tdd-orchestrator.ts +65 -0
  95. package/src/examples/test-cycle-workflow.ts +64 -0
  96. package/src/index.ts +140 -0
  97. package/src/reflection/index.ts +5 -0
  98. package/src/reflection/reflection.ts +407 -0
  99. package/src/tools/index.ts +36 -0
  100. package/src/tools/introspection.ts +464 -0
  101. package/src/types/agent.ts +90 -0
  102. package/src/types/decorators.ts +25 -0
  103. package/src/types/error-strategy.ts +13 -0
  104. package/src/types/error.ts +20 -0
  105. package/src/types/events.ts +74 -0
  106. package/src/types/index.ts +55 -0
  107. package/src/types/logging.ts +24 -0
  108. package/src/types/observer.ts +18 -0
  109. package/src/types/prompt.ts +40 -0
  110. package/src/types/reflection.ts +117 -0
  111. package/src/types/sdk-primitives.ts +128 -0
  112. package/src/types/snapshot.ts +14 -0
  113. package/src/types/workflow-context.ts +163 -0
  114. package/src/types/workflow.ts +37 -0
  115. package/src/utils/id.ts +11 -0
  116. package/src/utils/index.ts +3 -0
  117. package/src/utils/observable.ts +77 -0
  118. package/tasks.json +0 -0
  119. package/tsconfig.json +22 -0
  120. package/vitest.config.ts +16 -0
@@ -0,0 +1,1468 @@
1
+ # AI Agent Reflection & Self-Correction Patterns - Research Summary
2
+
3
+ **Date**: December 2025
4
+ **Focus**: Comprehensive research on reflection and self-correction patterns for AI agent frameworks
5
+
6
+ ## Table of Contents
7
+
8
+ 1. [Reflection in AI Systems](#reflection-in-ai-systems)
9
+ 2. [Reflection Levels & Triggers](#reflection-levels--triggers)
10
+ 3. [Implementation Patterns](#implementation-patterns)
11
+ 4. [Reflection Prompt Templates](#reflection-prompt-templates)
12
+ 5. [Existing Framework Approaches](#existing-framework-approaches)
13
+ 6. [Best Practices & Guardrails](#best-practices--guardrails)
14
+ 7. [When NOT to Reflect](#when-not-to-reflect)
15
+ 8. [State Capture Patterns](#state-capture-patterns)
16
+ 9. [Code Implementation Examples](#code-implementation-examples)
17
+
18
+ ---
19
+
20
+ ## Reflection in AI Systems
21
+
22
+ ### What is Reflection?
23
+
24
+ Reflection in AI agent contexts refers to an agent's ability to **think about its own actions and results in order to self-correct and improve**. It's essentially the AI analog of human introspection or "System 2" deliberative thinking. Rather than merely reacting instinctively, a reflective AI will pause to analyze what it has done, identify errors or suboptimal steps, and adjust its strategy.
25
+
26
+ Key insight: Agents that can check and improve their own output are fundamentally more reliable because they catch mistakes before they compound, self-correct when they drift, and get better as they iterate.
27
+
28
+ ### Core Components of Reflection
29
+
30
+ The reflection pattern typically follows a three-phase cycle:
31
+
32
+ 1. **Generation** - The model creates an initial output based on a prompt
33
+ 2. **Reflection** - The AI critiques its own work, identifying areas for improvement
34
+ 3. **Iteration/Refinement** - The AI refines its output based on feedback and continues until quality thresholds are met
35
+
36
+ ### Why Reflection Matters
37
+
38
+ Research demonstrates significant performance improvements:
39
+ - **Reflexion (Shinn et al., 2023)**: Achieved 91% success rates in complex tasks
40
+ - **CRITIC (Gou et al., 2024)**: Showed 10-30% improvement in accuracy across multiple domains
41
+ - **Reflexion + GPT-4**: Reached 91% on HumanEval coding benchmark vs 80% without reflection
42
+
43
+ ---
44
+
45
+ ## Reflection Levels & Triggers
46
+
47
+ ### Three Levels of Reflection
48
+
49
+ #### 1. **Prompt-Level Reflection**
50
+ - Occurs within a single LLM call
51
+ - Model is prompted to "check your work" after generation
52
+ - Lightweight, uses only one additional prompt
53
+ - Good for: Basic quality improvement, simple validations
54
+ - Cost: 1 additional LLM call
55
+
56
+ #### 2. **Agent-Level Reflection**
57
+ - Occurs between tool calls or action sequences
58
+ - Agent pauses after each major step to evaluate progress
59
+ - Can include both self-assessment and external tool feedback
60
+ - Good for: Multi-step tasks, tool-based workflows
61
+ - Cost: Adds latency but improves task success
62
+
63
+ #### 3. **Workflow-Level Reflection**
64
+ - Occurs at the orchestration level
65
+ - Multiple agents or sub-tasks are evaluated together
66
+ - Captures systemic improvements and pattern recognition
67
+ - Good for: Complex multi-agent systems, long-running workflows
68
+ - Cost: Significant additional compute, reserved for high-value tasks
69
+
70
+ ### Trigger Mechanisms
71
+
72
+ #### **Error-Driven Reflection** (Most Common)
73
+ Triggered when:
74
+ - Tool call fails with error
75
+ - Output validation rules fail
76
+ - Test/assertion fails
77
+ - Response status codes indicate problems
78
+
79
+ Pattern:
80
+ ```
81
+ Output → Validate → Error Detected → Reflect → Retry
82
+ ```
83
+
84
+ #### **Low-Confidence Reflection**
85
+ Triggered when:
86
+ - Model expresses uncertainty in output
87
+ - Confidence score below threshold
88
+ - Multiple alternative interpretations exist
89
+ - Ambiguous user input detected
90
+
91
+ Mechanism:
92
+ - Use certainty tokens or confidence metadata
93
+ - Leverage model's own uncertainty assessment
94
+ - Self-Reflection Certainty: Ask model "Does this seem correct to you?"
95
+ - Dynamically adjust confidence as model reasons (chain-of-thought)
96
+
97
+ #### **Manual/Explicit Triggers**
98
+ - User explicitly requests reflection
99
+ - Scheduled checkpoints in workflow
100
+ - Budget-based (after N tokens/steps)
101
+ - Performance-based (when metric drops below threshold)
102
+
103
+ #### **Progress-Based Triggers**
104
+ - No progress after N iterations
105
+ - State duplication (same output returned twice)
106
+ - Repeated error patterns
107
+ - Timeout approaching
108
+
109
+ ---
110
+
111
+ ## Implementation Patterns
112
+
113
+ ### Pattern 1: Error Detection → Reflection → Retry Loop
114
+
115
+ **Core Cycle:**
116
+ ```
117
+ 1. Generate Solution
118
+ 2. Execute/Validate → Capture Error
119
+ 3. Reflect: "What went wrong? Why did this fail?"
120
+ 4. Retry with improved strategy
121
+ 5. Loop until success or max attempts reached
122
+ ```
123
+
124
+ **Key Variables:**
125
+ - `max_retry_limit`: Total retry attempts (default: 3-5)
126
+ - `retry_count`: Current attempt number
127
+ - `error_context`: Captured error message/state
128
+ - `previous_attempts`: History of what was tried
129
+
130
+ **Implementation Considerations:**
131
+ - Always include error message in reflection prompt
132
+ - Maintain history of previous attempts to avoid loops
133
+ - Implement exponential backoff for API calls
134
+ - Track which approaches failed to suggest different strategies
135
+
136
+ ### Pattern 2: Instruction-Following Validation (IFE)
137
+
138
+ Treats LLM outputs as **untrusted inputs requiring explicit validation**.
139
+
140
+ **Flow:**
141
+ ```
142
+ Agent Generates Output
143
+
144
+ Validation Checkpoint:
145
+ - Check instruction compliance
146
+ - Verify format requirements
147
+ - Validate output constraints
148
+
149
+ If Violations Detected:
150
+ - Log specific failures
151
+ - Refine prompt with violation details
152
+ - Retry (up to max attempts)
153
+
154
+ If Passed:
155
+ - Accept and proceed
156
+ ```
157
+
158
+ **Example Constraints:**
159
+ ```
160
+ - Time estimates: numeric only, 0-4.0 range, no units
161
+ - Function names: snake_case, no special characters
162
+ - Response format: valid JSON, specific schema
163
+ - Length: within min/max bounds
164
+ ```
165
+
166
+ ### Pattern 3: Reflexion Architecture
167
+
168
+ Separates three distinct models/roles:
169
+
170
+ 1. **Actor** - Generates text and actions using Chain-of-Thought or ReAct
171
+ 2. **Evaluator** - Scores outputs by assigning reward signals
172
+ 3. **Self-Reflection** - Generates verbal feedback using rewards and memory
173
+
174
+ **Flow:**
175
+ ```
176
+ Task Definition
177
+
178
+ Generate Initial Trajectory (Actor)
179
+
180
+ Evaluate Outcome (Evaluator assigns reward score)
181
+
182
+ Generate Reflection (Self-Reflection creates verbal feedback)
183
+
184
+ Store in Memory
185
+
186
+ Generate Next Trajectory (with reflection context)
187
+ ```
188
+
189
+ Advantages:
190
+ - Structured feedback mechanism
191
+ - Interpretable reflection output
192
+ - Can learn from feedback across multiple attempts
193
+ - Grounded in external signals (rewards)
194
+
195
+ ### Pattern 4: Tool-Enhanced Reflection
196
+
197
+ Agent uses external tools to verify correctness before self-reflection.
198
+
199
+ **Tools Used:**
200
+ - Unit tests / test cases
201
+ - Code linters (for TypeScript/Python)
202
+ - Web search to verify facts
203
+ - APIs to validate data
204
+ - Sandbox execution to catch runtime errors
205
+
206
+ **Flow:**
207
+ ```
208
+ Generate Code
209
+
210
+ Run Tests/Linter
211
+
212
+ Capture Feedback
213
+
214
+ Reflect on Specific Failures
215
+
216
+ Revise Based on Concrete Evidence
217
+ ```
218
+
219
+ **Key Insight**: Type-checked languages (TypeScript vs JavaScript) provide multiple layers of automatic feedback, improving reflection quality.
220
+
221
+ ### Pattern 5: Multi-Agent Reflection
222
+
223
+ Rather than self-reflection, deploy two specialized agents:
224
+ 1. **Generator Agent** - Prompted to produce outputs
225
+ 2. **Critic Agent** - Prompted to provide constructive criticism
226
+
227
+ **Flow:**
228
+ ```
229
+ Generator creates output
230
+
231
+ Critic reviews and provides specific feedback:
232
+ - What works well
233
+ - What's missing or wrong
234
+ - Specific improvement suggestions
235
+
236
+ Generator receives critique
237
+
238
+ Revised output
239
+
240
+ (Loop up to N times or until satisfied)
241
+ ```
242
+
243
+ Benefits:
244
+ - More diverse feedback (different reasoning path)
245
+ - Can leverage specialized critic models
246
+ - Dialogue creates interactive improvement
247
+ - Often produces better results than self-reflection alone
248
+
249
+ ---
250
+
251
+ ## Reflection Prompt Templates
252
+
253
+ ### Template 1: Basic Self-Critique (Lightweight)
254
+
255
+ ```
256
+ Original Task: [TASK]
257
+
258
+ Your previous response:
259
+ [RESPONSE]
260
+
261
+ Please review your response for:
262
+ 1. Accuracy - Is the information correct?
263
+ 2. Completeness - Did you address all aspects of the task?
264
+ 3. Clarity - Is it easy to understand?
265
+ 4. Potential improvements - What could be better?
266
+
267
+ Identify any issues and provide a revised response.
268
+ ```
269
+
270
+ **Cost**: Single additional LLM call
271
+ **Best for**: Quick quality improvements, simple tasks
272
+
273
+ ---
274
+
275
+ ### Template 2: Error-Context Reflection (With Feedback)
276
+
277
+ ```
278
+ Original Task: [TASK]
279
+
280
+ Your previous attempt:
281
+ [PREVIOUS_RESPONSE]
282
+
283
+ Error encountered: [ERROR_MESSAGE]
284
+
285
+ Analysis: What specifically caused this error?
286
+ - Root cause analysis
287
+ - What assumption was wrong?
288
+ - What information was missing?
289
+
290
+ Revised approach: Provide a corrected solution that addresses the specific error.
291
+ Explain your reasoning for why this approach will work better.
292
+ ```
293
+
294
+ **Cost**: Single additional LLM call with rich context
295
+ **Best for**: Recovery from errors, learning from failures
296
+
297
+ ---
298
+
299
+ ### Template 3: Expert Persona Reflection
300
+
301
+ ```
302
+ Original Task: [TASK]
303
+
304
+ Response to evaluate:
305
+ [RESPONSE]
306
+
307
+ You are now a [EXPERT_ROLE: code reviewer | technical architect | quality assurance specialist].
308
+ Review the above response from the perspective of [EXPERT_ROLE].
309
+
310
+ Specifically evaluate:
311
+ 1. [TECHNICAL_CRITERIA]
312
+ 2. [BEST_PRACTICES]
313
+ 3. [EDGE_CASES]
314
+ 4. [PERFORMANCE/QUALITY_METRICS]
315
+
316
+ Provide your expert assessment and specific improvements.
317
+ ```
318
+
319
+ **Cost**: Single additional LLM call
320
+ **Best for**: Complex technical outputs, code, architectural decisions
321
+
322
+ ---
323
+
324
+ ### Template 4: Structured Reflection with Rubric
325
+
326
+ ```
327
+ Original Task: [TASK]
328
+
329
+ Generated Output:
330
+ [OUTPUT]
331
+
332
+ Evaluation Rubric:
333
+ 1. Requirement A: [DESCRIPTION]
334
+ Status: ✓ Met / ✗ Not Met
335
+ If not met, why?
336
+
337
+ 2. Requirement B: [DESCRIPTION]
338
+ Status: ✓ Met / ✗ Not Met
339
+ If not met, why?
340
+
341
+ [... for each requirement ...]
342
+
343
+ Summary:
344
+ - Which requirements were NOT met?
345
+ - Specific fixes needed for each failure
346
+ - Revised output addressing all requirements
347
+
348
+ Provide corrected output that meets all requirements.
349
+ ```
350
+
351
+ **Cost**: Single additional LLM call
352
+ **Best for**: Tasks with explicit criteria, validation-heavy workflows
353
+
354
+ ---
355
+
356
+ ### Template 5: Confidence-Triggered Reflection
357
+
358
+ ```
359
+ Original Task: [TASK]
360
+
361
+ Your response:
362
+ [RESPONSE]
363
+
364
+ Before we proceed, please evaluate your own confidence:
365
+ 1. How confident are you that this response is correct? (0-100%)
366
+ 2. What aspects are you uncertain about?
367
+ 3. What additional information would increase your confidence?
368
+
369
+ If confidence < 80%:
370
+ - Identify specific sources of uncertainty
371
+ - Provide alternative approaches you considered
372
+ - Suggest how to verify your answer
373
+ - Offer a revised response with higher confidence
374
+ ```
375
+
376
+ **Cost**: Single additional call with conditional branching
377
+ **Best for**: High-stakes decisions, complex problem-solving
378
+
379
+ ---
380
+
381
+ ### Template 6: Multi-Turn Reflection Loop
382
+
383
+ ```
384
+ ROUND 1 - Initial Generation:
385
+ [INITIAL_PROMPT]
386
+
387
+ ROUND 2 - Self-Critique:
388
+ "Review your response for: correctness, completeness, clarity, and efficiency.
389
+ Identify specific issues."
390
+
391
+ [CRITIQUE_FROM_PREVIOUS_ROUND]
392
+
393
+ ROUND 3 - Improvement:
394
+ "Based on the identified issues, provide an improved version.
395
+ Explain what you changed and why."
396
+
397
+ [CONTINUE_FOR_UP_TO_N_ROUNDS]
398
+
399
+ Quality Checkpoint:
400
+ Does current output meet all quality criteria? If yes, finalize.
401
+ If no, continue round [N+1].
402
+ ```
403
+
404
+ **Cost**: Multiple LLM calls (3-5 typically)
405
+ **Best for**: Complex writing, algorithm optimization, architectural design
406
+
407
+ ---
408
+
409
+ ## Existing Framework Approaches
410
+
411
+ ### LangChain/LangGraph Reflection
412
+
413
+ LangChain implements reflection through **LangGraph**, a stateful graph framework.
414
+
415
+ **Three Core Patterns:**
416
+
417
+ #### 1. Basic Reflection (MessageGraph)
418
+ ```typescript
419
+ - State: List of messages
420
+ - Generator Node: Produces initial responses
421
+ - Reflector Node: Acts as "teacher" providing constructive criticism
422
+ - Edges: Loop back up to N times
423
+ ```
424
+
425
+ #### 2. Reflexion Pattern
426
+ ```typescript
427
+ - Generator produces draft
428
+ - Tools are executed
429
+ - Feedback captured
430
+ - Revision happens with reflection context
431
+ - Conditional loop based on iteration count
432
+ ```
433
+
434
+ #### 3. Language Agent Tree Search (LATS)
435
+ ```typescript
436
+ - Combines reflection/evaluation with Monte Carlo tree search
437
+ - Four steps: Select → Expand/Simulate → Reflect+Evaluate → Backpropagate
438
+ - Uses StateGraph with tree-based exploration
439
+ ```
440
+
441
+ **Key Implementation Details:**
442
+ - Uses `add_node()`, `add_edge()`, `add_conditional_edges()`
443
+ - State is shared data structure representing current snapshot
444
+ - Nodes encode logic, perform computation, make LLM calls
445
+ - Edges define next node based on current state
446
+
447
+ **Trade-off**: Reflection requires additional computational time and resources. Each pattern trades latency for higher output quality. Not suitable for low-latency applications.
448
+
449
+ ---
450
+
451
+ ### Reflexion Framework (Shinn et al., 2023)
452
+
453
+ **Design Philosophy**: Keep model frozen, use text-based feedback as reinforcement.
454
+
455
+ **Components**:
456
+ 1. **Actor** - Attempts task using Chain-of-Thought/ReAct with memory
457
+ 2. **Evaluator** - Assigns reward scores to trajectories
458
+ 3. **Self-Reflection** - Generates verbal feedback from rewards
459
+
460
+ **Key Feature**: Reflexion forces explicit grounding in external data:
461
+ - Must cite sources for claims
462
+ - Explicitly enumerate superfluous aspects (what's wrong)
463
+ - Explicitly enumerate missing aspects (what's needed)
464
+
465
+ **Results**:
466
+ - 91% success on complex tasks vs lower baselines
467
+ - Strong performance on: AlfWorld (decision-making), HotPotQA (reasoning), HumanEval/MBPP (programming)
468
+
469
+ **Best Use Cases**:
470
+ - Iterative learning from mistakes
471
+ - When traditional RL is impractical
472
+ - Tasks where interpretability matters
473
+ - Systems requiring nuanced feedback
474
+
475
+ ---
476
+
477
+ ### Claude/Anthropic Reflection Patterns
478
+
479
+ **Philosophy**: Simplicity over complexity. Start with simple prompts, optimize through evaluation, add multi-step systems only when necessary.
480
+
481
+ **Core Principles**:
482
+ 1. Maintain simplicity in agent design
483
+ 2. Prioritize transparency (show planning steps explicitly)
484
+ 3. Carefully craft agent-computer interface (ACI) through tool documentation
485
+
486
+ **Evaluator-Optimizer Workflow**:
487
+ ```
488
+ One LLM Call: Generates response
489
+
490
+ Another LLM Call: Provides evaluation and feedback
491
+
492
+ Loop: Iteratively refine
493
+ ```
494
+
495
+ **Most Effective When**:
496
+ - Clear evaluation criteria exist
497
+ - Iterative refinement provides measurable value
498
+ - Not implementing complex internal reasoning
499
+
500
+ **Extended Thinking Integration**:
501
+ - Use extended thinking for complex reasoning within reflection
502
+ - Interleaved mode: tool call → tool result → reflection thinking
503
+ - Strongly prefer thinking block when uncertain
504
+ - Enables "System 2" deliberative thinking in reflection phase
505
+
506
+ **Feedback Approaches**:
507
+ - **Rules-Based Feedback**: Define explicit rules, explain which failed and why
508
+ - **Code Linting**: Type-checked languages (TypeScript) provide automatic feedback layers
509
+ - **Sandbox Execution**: Run code to identify bugs
510
+
511
+ ---
512
+
513
+ ### AutoGPT Self-Correction
514
+
515
+ **Approach**: Analyze feedback from errors and adjust strategy.
516
+
517
+ **Core Mechanism**:
518
+ ```
519
+ Execute Step
520
+
521
+ Evaluate Outcome
522
+
523
+ If Failed:
524
+ - Run reflection process
525
+ - Diagnose failure points
526
+ - Update strategy
527
+ - Proceed
528
+ ```
529
+
530
+ **Key Features**:
531
+ - Flexible automation with error analysis
532
+ - Requires human oversight to prevent infinite loops
533
+ - Handles many errors on client side
534
+
535
+ **Recent Innovation: Retrials Without Feedback**
536
+ Research shows "retrials without feedback" is effective:
537
+ - Retry whenever incorrect answer identified
538
+ - No explicit self-reflection needed
539
+ - Continue until correct solution found or budget exhausted
540
+ - Simpler than Reflexion, surprisingly effective
541
+
542
+ ---
543
+
544
+ ### Google Agent Development Kit (ADK) - Reflect & Retry
545
+
546
+ **Technical Implementation**:
547
+
548
+ **Core Mechanism**: Intercepts tool failures, provides structured guidance for correction, retries up to configurable limit.
549
+
550
+ **Key Features**:
551
+ - Concurrency-safe with locking mechanisms
552
+ - Failure tracking per-invocation (default) or global across users
553
+ - Custom error extraction by overriding detection methods
554
+ - Supports both transient and logical errors
555
+
556
+ **Configuration**:
557
+ ```
558
+ max_retries: 3 (default)
559
+ throw_on_exceeded: true (default)
560
+ failure_scope: per_invocation or global
561
+ ```
562
+
563
+ **Advanced Pattern**: Custom error detection
564
+ ```
565
+ Override extract_error_from_result() to identify:
566
+ - HTTP status codes
567
+ - Custom response fields
568
+ - Error patterns in normal responses
569
+ ```
570
+
571
+ ---
572
+
573
+ ## Best Practices & Guardrails
574
+
575
+ ### Maximum Reflection Attempts
576
+
577
+ **Industry Standard**: 3-5 maximum reflection attempts
578
+
579
+ **Recommended Configuration**:
580
+ ```
581
+ - Basic Tasks (simple validation): 2 attempts
582
+ - Standard Tasks (tool-based workflows): 3 attempts
583
+ - Complex Reasoning: 4-5 attempts
584
+ - Never exceed: 8 attempts
585
+ ```
586
+
587
+ **Guardrails to Prevent Loops**:
588
+ 1. **Hard iteration limit**: `max_rounds` (fixed ceiling)
589
+ 2. **No-progress detection**: Stop after K rounds with no improvement
590
+ 3. **State-hash deduplication**: Exit if returning to previous state
591
+ 4. **Cost budget**: Total token limit across all attempts
592
+ 5. **Timeout mechanism**: Overall time limit (not just per-request)
593
+
594
+ ### Error Handling Strategy
595
+
596
+ **Distinguish Error Types**:
597
+
598
+ | Error Type | Action | Retry? |
599
+ |-----------|--------|--------|
600
+ | Transient (timeout, rate limit) | Wait with exponential backoff | Yes (2-3x) |
601
+ | Logical (wrong approach) | Reflect, change strategy | Yes (up to 3x) |
602
+ | Invalid input (bad data) | Return error to user | No |
603
+ | Model refusal | Accept result | No |
604
+ | Permanent failure (API down) | Escalate/fallback | No |
605
+
606
+ **Backoff Strategies**:
607
+ - **Constant Backoff**: Fixed delay (e.g., 1 second)
608
+ - **Exponential Backoff**: Delay doubles each attempt
609
+ - **Jittered Backoff**: Add randomness to prevent thundering herd
610
+
611
+ Example exponential backoff:
612
+ ```
613
+ Attempt 1: Retry immediately
614
+ Attempt 2: Wait 1 second
615
+ Attempt 3: Wait 2 seconds
616
+ Attempt 4: Wait 4 seconds
617
+ Attempt 5: Wait 8 seconds
618
+ ```
619
+
620
+ ### Success Criteria Matter
621
+
622
+ Clear success criteria prevent infinite loops:
623
+
624
+ **Bad**: "Fix the bug", "optimize the database", "improve the response"
625
+ **Good**: "Make test_user_login pass", "reduce query time below 100ms", "increase BLEU score to 0.85+"
626
+
627
+ ### State Capture Before Reflection
628
+
629
+ **What to Capture**:
630
+ 1. **Input Context**: Original request, parameters, user intent
631
+ 2. **Execution Snapshot**: Current state at failure point
632
+ 3. **Error Details**: Exception, error code, message
633
+ 4. **Attempt History**: What was tried before, outcomes
634
+ 5. **Decision Metadata**: Why each choice was made, confidence level
635
+
636
+ **Storage Strategy**:
637
+ - Use lightweight JSON objects
638
+ - Store in Redis with expiration matching workflow duration
639
+ - Separate learned patterns from temporary processing state
640
+ - Keep reasoning chain (why decisions were made) separate
641
+
642
+ ---
643
+
644
+ ## When NOT to Reflect
645
+
646
+ ### Scenarios to Avoid Reflection
647
+
648
+ #### 1. **Low-Stakes, High-Velocity Tasks**
649
+ - Real-time chat responses
650
+ - Autocomplete suggestions
651
+ - Quick lookups
652
+ - Requirements: <100ms latency
653
+
654
+ **Cost/Benefit**: Cost of reflection exceeds value of marginal improvement
655
+
656
+ #### 2. **Well-Understood, Deterministic Workflows**
657
+ - Simple CRUD operations
658
+ - Predictable data transformations
659
+ - Tasks with 99%+ baseline accuracy
660
+
661
+ **Cost/Benefit**: No errors to fix, reflection wastes tokens
662
+
663
+ #### 3. **Clear Model Refusals**
664
+ - User asks model to do something against policies
665
+ - Model refuses for safety reasons
666
+ - No reflection can change this outcome
667
+
668
+ **Cost/Benefit**: Reflection won't help
669
+
670
+ #### 4. **Ambiguous User Input Without Clarification**
671
+ - User request is unclear
672
+ - Model can't determine intent
673
+
674
+ **Better approach**: Ask clarifying questions, don't reflect
675
+
676
+ #### 5. **High-Confidence Outputs with Good Validation**
677
+ - Model is highly confident
678
+ - Output passes all validation checks
679
+ - Tests confirm correctness
680
+
681
+ **Cost/Benefit**: Reflection adds latency with no benefit
682
+
683
+ #### 6. **Token Budget Constraints**
684
+ - Limited tokens remaining in context window
685
+ - Reflection would consume majority of remaining budget
686
+
687
+ **Cost/Benefit**: Can't afford the cost
688
+
689
+ #### 7. **Cascading Failures**
690
+ - Reflection failure causes downstream failures
691
+ - Loop detection shows same error pattern repeating
692
+
693
+ **Better approach**: Escalate to human or fallback
694
+
695
+ ### Performance Impact
696
+
697
+ **Cost of Reflection**:
698
+ - Each reflection attempt = ~1 additional LLM call
699
+ - Latency: +200-2000ms per reflection (depends on model)
700
+ - Cost: +1x-2x per reflection (depending on output length)
701
+
702
+ **When Cost Justifies Benefit**:
703
+ - High-value decisions (code generation, critical business logic)
704
+ - Complex reasoning tasks
705
+ - Where 10-30% improvement is meaningful
706
+ - User acceptable for 2-5x latency increase
707
+
708
+ ### Confidence-Based Thresholds
709
+
710
+ **Reflection Triggers**:
711
+ - Model confidence < 70%: Trigger reflection
712
+ - Model confidence 70-85%: Optional reflection
713
+ - Model confidence > 85%: Skip reflection
714
+
715
+ **Implementation**:
716
+ - Use model's own uncertainty assessment
717
+ - Leverage confidence tokens from extended thinking
718
+ - Monitor chain-of-thought for hedging language
719
+ - Track prediction confidence scores
720
+
721
+ ---
722
+
723
+ ## State Capture Patterns
724
+
725
+ ### Pre-Reflection State Snapshot
726
+
727
+ Capture critical state **before** attempting reflection:
728
+
729
+ ```json
730
+ {
731
+ "attempt_number": 1,
732
+ "timestamp": "2025-12-08T12:34:56Z",
733
+ "input": {
734
+ "user_request": "...",
735
+ "context": "...",
736
+ "parameters": {...}
737
+ },
738
+ "generation": {
739
+ "output": "...",
740
+ "model": "claude-opus-4.5",
741
+ "tokens_used": 245,
742
+ "confidence": 0.65
743
+ },
744
+ "validation": {
745
+ "passed": false,
746
+ "violations": ["format_check_failed", "logic_error"],
747
+ "error_message": "..."
748
+ },
749
+ "error_context": {
750
+ "type": "logical_error",
751
+ "details": "..."
752
+ }
753
+ }
754
+ ```
755
+
756
+ ### Reasoning Chain Logging
757
+
758
+ Separate reasoning metadata from content:
759
+
760
+ ```json
761
+ {
762
+ "decision_point": "tool_selection",
763
+ "options_considered": ["approach_a", "approach_b", "approach_c"],
764
+ "chosen": "approach_a",
765
+ "reasoning": "Approach A is more efficient because...",
766
+ "confidence": 0.72,
767
+ "alternative_rationale": "Approach B would work but...",
768
+ "risk_factors": ["potential_timeout", "edge_case_handling"]
769
+ }
770
+ ```
771
+
772
+ Benefits:
773
+ - Recovery doesn't re-analyze same information
774
+ - Next attempt picks up decision trail where it left off
775
+ - Provides context for reflection prompts
776
+
777
+ ### Memory State Preservation
778
+
779
+ Distinguish learned patterns from temporary state:
780
+
781
+ ```json
782
+ {
783
+ "learned_patterns": {
784
+ "document_structure_insights": ["..."],
785
+ "user_preferences": ["..."],
786
+ "error_recovery_strategies": ["..."]
787
+ },
788
+ "temporary_state": {
789
+ "current_task_context": "...",
790
+ "current_output": "...",
791
+ "current_attempt": 2
792
+ }
793
+ }
794
+ ```
795
+
796
+ **Key principle**: When individual tasks fail, preserve learned insights while resetting temporary state.
797
+
798
+ ### State for Error Recovery
799
+
800
+ Include information needed for intelligent retry:
801
+
802
+ ```json
803
+ {
804
+ "failed_attempt": {
805
+ "approach": "web_search_strategy",
806
+ "output": "...",
807
+ "error": "timeout"
808
+ },
809
+ "recovery_context": {
810
+ "what_worked_before": [
811
+ {"approach": "api_call", "result": "success"},
812
+ {"approach": "local_cache", "result": "cache_miss"}
813
+ ],
814
+ "what_failed": [
815
+ {"approach": "web_search", "reason": "timeout"}
816
+ ],
817
+ "suggestion": "Try API call approach next"
818
+ }
819
+ }
820
+ ```
821
+
822
+ ---
823
+
824
+ ## Code Implementation Examples
825
+
826
+ ### Example 1: Basic Error-Reflection-Retry Loop (TypeScript)
827
+
828
+ ```typescript
829
+ interface ReflectionState {
830
+ attempt: number;
831
+ maxAttempts: number;
832
+ lastError: string | null;
833
+ attemptHistory: Array<{
834
+ approach: string;
835
+ result: string;
836
+ error: string | null;
837
+ }>;
838
+ }
839
+
840
+ async function executeWithReflection(
841
+ task: string,
842
+ maxAttempts: number = 3
843
+ ): Promise<string> {
844
+ const state: ReflectionState = {
845
+ attempt: 0,
846
+ maxAttempts,
847
+ lastError: null,
848
+ attemptHistory: [],
849
+ };
850
+
851
+ while (state.attempt < state.maxAttempts) {
852
+ state.attempt++;
853
+
854
+ try {
855
+ // Step 1: Generate solution
856
+ const solution = await generateSolution(task, state.attemptHistory);
857
+
858
+ // Step 2: Validate
859
+ const validation = validateOutput(solution);
860
+ if (validation.isValid) {
861
+ return solution;
862
+ }
863
+
864
+ // Step 3: Reflect on failure
865
+ state.lastError = validation.errors.join("; ");
866
+ const reflection = await reflectOnFailure(
867
+ task,
868
+ solution,
869
+ validation.errors,
870
+ state.attemptHistory
871
+ );
872
+
873
+ // Step 4: Update history
874
+ state.attemptHistory.push({
875
+ approach: reflection.suggestedApproach,
876
+ result: solution,
877
+ error: state.lastError,
878
+ });
879
+
880
+ } catch (error) {
881
+ state.lastError = String(error);
882
+
883
+ // Attempt recovery reflection
884
+ const recovery = await reflectOnError(task, error, state.attemptHistory);
885
+ state.attemptHistory.push({
886
+ approach: recovery.suggestedApproach,
887
+ result: "",
888
+ error: state.lastError,
889
+ });
890
+ }
891
+ }
892
+
893
+ throw new Error(
894
+ `Failed after ${state.maxAttempts} attempts. ` +
895
+ `Last error: ${state.lastError}`
896
+ );
897
+ }
898
+
899
+ async function generateSolution(
900
+ task: string,
901
+ history: ReflectionState["attemptHistory"]
902
+ ): Promise<string> {
903
+ const historyContext = history.length > 0
904
+ ? `Previous attempts:\n${history
905
+ .map((h, i) => `Attempt ${i + 1} (${h.approach}): ${h.error || "failed"}`)
906
+ .join("\n")}\n`
907
+ : "";
908
+
909
+ const response = await client.messages.create({
910
+ model: "claude-opus-4.5",
911
+ max_tokens: 1024,
912
+ messages: [
913
+ {
914
+ role: "user",
915
+ content: `${historyContext}\nTask: ${task}\n\nGenerate a solution.`,
916
+ },
917
+ ],
918
+ });
919
+
920
+ return response.content[0].type === "text" ? response.content[0].text : "";
921
+ }
922
+
923
+ async function reflectOnFailure(
924
+ task: string,
925
+ solution: string,
926
+ errors: string[],
927
+ history: ReflectionState["attemptHistory"]
928
+ ): Promise<{ suggestedApproach: string }> {
929
+ const response = await client.messages.create({
930
+ model: "claude-opus-4.5",
931
+ max_tokens: 512,
932
+ messages: [
933
+ {
934
+ role: "user",
935
+ content: `Task: ${task}
936
+
937
+ Your previous solution failed with these issues:
938
+ ${errors.map((e) => `- ${e}`).join("\n")}
939
+
940
+ Previous solution:
941
+ ${solution}
942
+
943
+ Analyze what went wrong and suggest a different approach that would avoid these issues.`,
944
+ },
945
+ ],
946
+ });
947
+
948
+ return {
949
+ suggestedApproach:
950
+ response.content[0].type === "text" ? response.content[0].text : "",
951
+ };
952
+ }
953
+
954
+ async function reflectOnError(
955
+ task: string,
956
+ error: unknown,
957
+ history: ReflectionState["attemptHistory"]
958
+ ): Promise<{ suggestedApproach: string }> {
959
+ // Similar to reflectOnFailure but handles exceptions
960
+ return {
961
+ suggestedApproach: `Error recovery strategy after: ${String(error)}`,
962
+ };
963
+ }
964
+
965
+ function validateOutput(output: string): {
966
+ isValid: boolean;
967
+ errors: string[];
968
+ } {
969
+ const errors: string[] = [];
970
+
971
+ if (!output || output.trim().length === 0) {
972
+ errors.push("Output is empty");
973
+ }
974
+
975
+ if (output.length < 10) {
976
+ errors.push("Output is too short");
977
+ }
978
+
979
+ return {
980
+ isValid: errors.length === 0,
981
+ errors,
982
+ };
983
+ }
984
+ ```
985
+
986
+ ---
987
+
988
+ ### Example 2: Instruction-Following Validation Pattern
989
+
990
+ ```typescript
991
+ interface ValidationRule {
992
+ name: string;
993
+ validate: (value: any) => boolean;
994
+ errorMessage: string;
995
+ }
996
+
997
+ interface InstructionFollowingEvaluator {
998
+ rules: ValidationRule[];
999
+ maxRetries: number;
1000
+ }
1001
+
1002
+ async function validateWithIFE(
1003
+ task: string,
1004
+ evaluator: InstructionFollowingEvaluator
1005
+ ): Promise<string> {
1006
+ let retries = 0;
1007
+
1008
+ while (retries < evaluator.maxRetries) {
1009
+ // Generate output
1010
+ const output = await generateOutput(task);
1011
+
1012
+ // Check each rule
1013
+ const violations: string[] = [];
1014
+ for (const rule of evaluator.rules) {
1015
+ if (!rule.validate(output)) {
1016
+ violations.push(rule.errorMessage);
1017
+ }
1018
+ }
1019
+
1020
+ // If all rules pass, return
1021
+ if (violations.length === 0) {
1022
+ return output;
1023
+ }
1024
+
1025
+ // If violations, refine and retry
1026
+ retries++;
1027
+ if (retries < evaluator.maxRetries) {
1028
+ const refinedTask = await refinePormptWithViolations(
1029
+ task,
1030
+ output,
1031
+ violations
1032
+ );
1033
+ task = refinedTask;
1034
+ } else {
1035
+ throw new Error(
1036
+ `Validation failed after ${evaluator.maxRetries} attempts. ` +
1037
+ `Violations: ${violations.join("; ")}`
1038
+ );
1039
+ }
1040
+ }
1041
+
1042
+ throw new Error("Unexpected error in IFE validation");
1043
+ }
1044
+
1045
+ async function generateOutput(task: string): Promise<string> {
1046
+ const response = await client.messages.create({
1047
+ model: "claude-opus-4.5",
1048
+ max_tokens: 1024,
1049
+ messages: [{ role: "user", content: task }],
1050
+ });
1051
+
1052
+ return response.content[0].type === "text" ? response.content[0].text : "";
1053
+ }
1054
+
1055
+ async function refinePormptWithViolations(
1056
+ originalTask: string,
1057
+ output: string,
1058
+ violations: string[]
1059
+ ): Promise<string> {
1060
+ const response = await client.messages.create({
1061
+ model: "claude-opus-4.5",
1062
+ max_tokens: 512,
1063
+ messages: [
1064
+ {
1065
+ role: "user",
1066
+ content: `Original task: ${originalTask}
1067
+
1068
+ Your output failed these validation rules:
1069
+ ${violations.map((v) => `- ${v}`).join("\n")}
1070
+
1071
+ Your output was:
1072
+ ${output}
1073
+
1074
+ Revise the task/instructions to ensure the next attempt will satisfy all rules.`,
1075
+ },
1076
+ ],
1077
+ });
1078
+
1079
+ return response.content[0].type === "text" ? response.content[0].text : "";
1080
+ }
1081
+
1082
+ // Example usage with specific validation rules
1083
+ const codeEvaluator: InstructionFollowingEvaluator = {
1084
+ maxRetries: 3,
1085
+ rules: [
1086
+ {
1087
+ name: "valid_syntax",
1088
+ validate: (code) => {
1089
+ try {
1090
+ // Parse or compile check
1091
+ return code.includes("function") || code.includes("const");
1092
+ } catch {
1093
+ return false;
1094
+ }
1095
+ },
1096
+ errorMessage: "Code must have valid TypeScript syntax",
1097
+ },
1098
+ {
1099
+ name: "includes_tests",
1100
+ validate: (code) => code.includes("test") || code.includes("describe"),
1101
+ errorMessage: "Code must include test cases",
1102
+ },
1103
+ {
1104
+ name: "has_comments",
1105
+ validate: (code) => code.includes("//") || code.includes("/*"),
1106
+ errorMessage: "Code must include comments",
1107
+ },
1108
+ ],
1109
+ };
1110
+ ```
1111
+
1112
+ ---
1113
+
1114
+ ### Example 3: Reflexion-Style Architecture
1115
+
1116
+ ```typescript
1117
+ interface ReflexionState {
1118
+ task: string;
1119
+ trajectory: string;
1120
+ reward: number;
1121
+ reflection: string;
1122
+ nextAttempt: string;
1123
+ }
1124
+
1125
+ class ReflexionAgent {
1126
+ private actor: LLMClient;
1127
+ private evaluator: (output: string) => number;
1128
+ private reflector: LLMClient;
1129
+ private memory: ReflexionState[] = [];
1130
+
1131
+ async runReflexion(task: string, maxIterations: number = 3): Promise<string> {
1132
+ let currentTask = task;
1133
+
1134
+ for (let i = 0; i < maxIterations; i++) {
1135
+ // Step 1: Actor generates trajectory
1136
+ const trajectory = await this.actor.generate(currentTask);
1137
+
1138
+ // Step 2: Evaluator assigns reward
1139
+ const reward = this.evaluator(trajectory);
1140
+
1141
+ // Step 3: Reflector generates feedback
1142
+ const reflection = await this.reflector.generateReflection(
1143
+ task,
1144
+ trajectory,
1145
+ reward,
1146
+ this.memory
1147
+ );
1148
+
1149
+ // Step 4: Store in memory
1150
+ const state: ReflexionState = {
1151
+ task,
1152
+ trajectory,
1153
+ reward,
1154
+ reflection,
1155
+ nextAttempt: "",
1156
+ };
1157
+ this.memory.push(state);
1158
+
1159
+ // Step 5: Use reflection to improve next attempt
1160
+ if (reward > 0.8) {
1161
+ // Good enough, return
1162
+ return trajectory;
1163
+ }
1164
+
1165
+ // Prepare for next iteration with reflection context
1166
+ currentTask = `${task}
1167
+
1168
+ Previous attempt feedback:
1169
+ ${reflection}
1170
+
1171
+ Generate an improved solution that addresses the feedback above.`;
1172
+ }
1173
+
1174
+ return this.memory[this.memory.length - 1].trajectory;
1175
+ }
1176
+ }
1177
+
1178
+ class ReflectorModel {
1179
+ private client: LLMClient;
1180
+
1181
+ async generateReflection(
1182
+ task: string,
1183
+ trajectory: string,
1184
+ reward: number,
1185
+ memory: ReflexionState[]
1186
+ ): Promise<string> {
1187
+ const memoryContext =
1188
+ memory.length > 0
1189
+ ? `Previous attempts and feedback:\n${memory
1190
+ .slice(-2)
1191
+ .map((m) => `Reward: ${m.reward}\nFeedback: ${m.reflection}`)
1192
+ .join("\n\n")}\n`
1193
+ : "";
1194
+
1195
+ const response = await this.client.messages.create({
1196
+ model: "claude-opus-4.5",
1197
+ max_tokens: 512,
1198
+ messages: [
1199
+ {
1200
+ role: "user",
1201
+ content: `Task: ${task}
1202
+
1203
+ ${memoryContext}
1204
+
1205
+ Current attempt (reward score: ${reward}):
1206
+ ${trajectory}
1207
+
1208
+ Evaluate this attempt:
1209
+ 1. What did it do well?
1210
+ 2. What are the specific failures or issues?
1211
+ 3. What should be tried differently in the next attempt?
1212
+ 4. What patterns from previous attempts should be avoided?
1213
+
1214
+ Format your response as structured verbal feedback.`,
1215
+ },
1216
+ ],
1217
+ });
1218
+
1219
+ return response.content[0].type === "text" ? response.content[0].text : "";
1220
+ }
1221
+ }
1222
+ ```
1223
+
1224
+ ---
1225
+
1226
+ ### Example 4: Multi-Agent Reflection
1227
+
1228
+ ```typescript
1229
+ class MultiAgentReflection {
1230
+ private generator: LLMClient;
1231
+ private critic: LLMClient;
1232
+
1233
+ async reflectiveGeneration(
1234
+ task: string,
1235
+ maxRounds: number = 3
1236
+ ): Promise<string> {
1237
+ let currentOutput = await this.generator.generate(task);
1238
+
1239
+ for (let round = 1; round < maxRounds; round++) {
1240
+ // Get critique
1241
+ const critique = await this.critic.critique(task, currentOutput);
1242
+
1243
+ if (critique.isSatisfactory) {
1244
+ return currentOutput;
1245
+ }
1246
+
1247
+ // Generate improvement
1248
+ currentOutput = await this.generator.improve(
1249
+ task,
1250
+ currentOutput,
1251
+ critique.feedback,
1252
+ critique.suggestions
1253
+ );
1254
+ }
1255
+
1256
+ return currentOutput;
1257
+ }
1258
+
1259
+ async generate(task: string): Promise<string> {
1260
+ const response = await this.generator.messages.create({
1261
+ model: "claude-opus-4.5",
1262
+ max_tokens: 1024,
1263
+ messages: [{ role: "user", content: task }],
1264
+ });
1265
+
1266
+ return response.content[0].type === "text" ? response.content[0].text : "";
1267
+ }
1268
+
1269
+ async improve(
1270
+ task: string,
1271
+ currentOutput: string,
1272
+ feedback: string,
1273
+ suggestions: string[]
1274
+ ): Promise<string> {
1275
+ const response = await this.generator.messages.create({
1276
+ model: "claude-opus-4.5",
1277
+ max_tokens: 1024,
1278
+ messages: [
1279
+ {
1280
+ role: "user",
1281
+ content: `Task: ${task}
1282
+
1283
+ Current output:
1284
+ ${currentOutput}
1285
+
1286
+ Feedback from review:
1287
+ ${feedback}
1288
+
1289
+ Specific improvements to make:
1290
+ ${suggestions.map((s) => `- ${s}`).join("\n")}
1291
+
1292
+ Provide an improved version that addresses all feedback.`,
1293
+ },
1294
+ ],
1295
+ });
1296
+
1297
+ return response.content[0].type === "text" ? response.content[0].text : "";
1298
+ }
1299
+
1300
+ async critique(
1301
+ task: string,
1302
+ output: string
1303
+ ): Promise<{
1304
+ isSatisfactory: boolean;
1305
+ feedback: string;
1306
+ suggestions: string[];
1307
+ }> {
1308
+ const response = await this.critic.messages.create({
1309
+ model: "claude-opus-4.5",
1310
+ max_tokens: 512,
1311
+ messages: [
1312
+ {
1313
+ role: "user",
1314
+ content: `You are a critical reviewer. Evaluate this response:
1315
+
1316
+ Task: ${task}
1317
+
1318
+ Response:
1319
+ ${output}
1320
+
1321
+ Provide:
1322
+ 1. Overall assessment (satisfactory or needs improvement)
1323
+ 2. Specific issues with the current response
1324
+ 3. Concrete suggestions for improvement
1325
+
1326
+ Format as JSON: { "isSatisfactory": boolean, "feedback": string, "suggestions": string[] }`,
1327
+ },
1328
+ ],
1329
+ });
1330
+
1331
+ const text =
1332
+ response.content[0].type === "text" ? response.content[0].text : "{}";
1333
+ return JSON.parse(text);
1334
+ }
1335
+ }
1336
+ ```
1337
+
1338
+ ---
1339
+
1340
+ ### Example 5: Confidence-Based Reflection Trigger
1341
+
1342
+ ```typescript
1343
+ interface ConfidenceMetadata {
1344
+ overallConfidence: number;
1345
+ uncertaintyAreas: string[];
1346
+ alternativesConsidered: string[];
1347
+ }
1348
+
1349
+ async function confidenceBasedReflection(
1350
+ task: string,
1351
+ confidenceThreshold: number = 0.75
1352
+ ): Promise<string> {
1353
+ const response = await client.messages.create({
1354
+ model: "claude-opus-4.5",
1355
+ max_tokens: 1024,
1356
+ messages: [
1357
+ {
1358
+ role: "user",
1359
+ content: `${task}
1360
+
1361
+ After your response, provide a JSON block with confidence metadata:
1362
+ {
1363
+ "overallConfidence": <0-1>,
1364
+ "uncertaintyAreas": ["area1", "area2"],
1365
+ "alternativesConsidered": ["alternative1", "alternative2"]
1366
+ }`,
1367
+ },
1368
+ ],
1369
+ });
1370
+
1371
+ const text =
1372
+ response.content[0].type === "text" ? response.content[0].text : "";
1373
+
1374
+ // Extract response and metadata
1375
+ const jsonMatch = text.match(/\{[\s\S]*\}/);
1376
+ const metadata: ConfidenceMetadata = jsonMatch
1377
+ ? JSON.parse(jsonMatch[0])
1378
+ : { overallConfidence: 0.5, uncertaintyAreas: [], alternativesConsidered: [] };
1379
+
1380
+ // If confidence too low, reflect
1381
+ if (metadata.overallConfidence < confidenceThreshold) {
1382
+ const reflection = await reflectWithLowConfidence(
1383
+ task,
1384
+ text,
1385
+ metadata
1386
+ );
1387
+ return reflection;
1388
+ }
1389
+
1390
+ return text;
1391
+ }
1392
+
1393
+ async function reflectWithLowConfidence(
1394
+ task: string,
1395
+ initialResponse: string,
1396
+ metadata: ConfidenceMetadata
1397
+ ): Promise<string> {
1398
+ const response = await client.messages.create({
1399
+ model: "claude-opus-4.5",
1400
+ max_tokens: 1024,
1401
+ messages: [
1402
+ {
1403
+ role: "user",
1404
+ content: `Original task: ${task}
1405
+
1406
+ Your previous response (confidence: ${metadata.overallConfidence}):
1407
+ ${initialResponse}
1408
+
1409
+ You indicated uncertainty in these areas:
1410
+ ${metadata.uncertaintyAreas.map((a) => `- ${a}`).join("\n")}
1411
+
1412
+ You considered these alternatives:
1413
+ ${metadata.alternativesConsidered.map((a) => `- ${a}`).join("\n")}
1414
+
1415
+ Given your own identified uncertainties:
1416
+ 1. Identify what specific information would increase your confidence
1417
+ 2. Provide a revised response that addresses these uncertainty areas
1418
+ 3. Explain how your revised response is more robust`,
1419
+ },
1420
+ ],
1421
+ });
1422
+
1423
+ return response.content[0].type === "text" ? response.content[0].text : "";
1424
+ }
1425
+ ```
1426
+
1427
+ ---
1428
+
1429
+ ## Summary & Key Takeaways
1430
+
1431
+ ### What Is Reflection?
1432
+ Self-reflection in AI agents enables error detection, analysis, and correction without human intervention. It's a three-phase cycle: generate → analyze → improve.
1433
+
1434
+ ### When to Implement Reflection
1435
+ - **Error recovery**: When outputs fail validation
1436
+ - **Iterative refinement**: Complex tasks needing multiple passes
1437
+ - **High-stakes decisions**: Code generation, critical logic
1438
+ - **Low-confidence outputs**: When model expresses uncertainty
1439
+
1440
+ ### When NOT to Implement Reflection
1441
+ - Low-latency requirements (<100ms)
1442
+ - Simple, deterministic tasks
1443
+ - Well-understood workflows with 99%+ baseline accuracy
1444
+ - Clear model refusals
1445
+ - Token budget constraints
1446
+
1447
+ ### Best Practices
1448
+ 1. **Limit retries**: 3-5 attempts maximum, never unlimited
1449
+ 2. **Clear success criteria**: Specific, measurable goals (not vague)
1450
+ 3. **State capture**: Preserve context for intelligent retry
1451
+ 4. **Error categorization**: Different strategies for different error types
1452
+ 5. **Backoff strategies**: Exponential backoff for transient errors
1453
+ 6. **Avoid reflection loops**: Use state deduplication and progress detection
1454
+
1455
+ ### Implementation Hierarchy
1456
+ 1. Start simple: Basic self-critique in prompts
1457
+ 2. Add validation: Explicit output rules
1458
+ 3. Multi-attempt: Error-reflection-retry loop (3 attempts)
1459
+ 4. Tool-enhanced: Use linters, tests, execution for feedback
1460
+ 5. Multi-agent: Deploy separate critic for complex tasks
1461
+ 6. Full Reflexion: If baseline approaches insufficient
1462
+
1463
+ ### Framework Selection
1464
+ - **LangChain/LangGraph**: Pre-built reflection patterns, good for graph-based workflows
1465
+ - **Anthropic/Claude**: Emphasis on simplicity, extended thinking for reflection
1466
+ - **Google ADK**: Specialized reflect-and-retry plugin
1467
+ - **Custom**: Lightweight TypeScript patterns for specific needs
1468
+