ifcraftcorpus 1.4.0__py3-none-any.whl → 1.6.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (60) hide show
  1. ifcraftcorpus/index.py +5 -1
  2. ifcraftcorpus-1.6.0.data/data/share/ifcraftcorpus/corpus/agent-design/agent_memory_architecture.md +818 -0
  3. ifcraftcorpus-1.6.0.data/data/share/ifcraftcorpus/corpus/agent-design/agent_prompt_engineering.md +1481 -0
  4. {ifcraftcorpus-1.4.0.data → ifcraftcorpus-1.6.0.data}/data/share/ifcraftcorpus/corpus/agent-design/multi_agent_patterns.md +1 -0
  5. {ifcraftcorpus-1.4.0.data → ifcraftcorpus-1.6.0.data}/data/share/ifcraftcorpus/corpus/genre-conventions/fantasy_conventions.md +4 -0
  6. {ifcraftcorpus-1.4.0.data → ifcraftcorpus-1.6.0.data}/data/share/ifcraftcorpus/corpus/prose-and-language/narrative_point_of_view.md +3 -4
  7. {ifcraftcorpus-1.4.0.dist-info → ifcraftcorpus-1.6.0.dist-info}/METADATA +1 -1
  8. {ifcraftcorpus-1.4.0.dist-info → ifcraftcorpus-1.6.0.dist-info}/RECORD +59 -58
  9. ifcraftcorpus-1.4.0.data/data/share/ifcraftcorpus/corpus/agent-design/agent_prompt_engineering.md +0 -750
  10. {ifcraftcorpus-1.4.0.data → ifcraftcorpus-1.6.0.data}/data/share/ifcraftcorpus/corpus/audience-and-access/accessibility_guidelines.md +0 -0
  11. {ifcraftcorpus-1.4.0.data → ifcraftcorpus-1.6.0.data}/data/share/ifcraftcorpus/corpus/audience-and-access/audience_targeting.md +0 -0
  12. {ifcraftcorpus-1.4.0.data → ifcraftcorpus-1.6.0.data}/data/share/ifcraftcorpus/corpus/audience-and-access/localization_considerations.md +0 -0
  13. {ifcraftcorpus-1.4.0.data → ifcraftcorpus-1.6.0.data}/data/share/ifcraftcorpus/corpus/craft-foundations/audio_visual_integration.md +0 -0
  14. {ifcraftcorpus-1.4.0.data → ifcraftcorpus-1.6.0.data}/data/share/ifcraftcorpus/corpus/craft-foundations/collaborative_if_writing.md +0 -0
  15. {ifcraftcorpus-1.4.0.data → ifcraftcorpus-1.6.0.data}/data/share/ifcraftcorpus/corpus/craft-foundations/creative_workflow_pipeline.md +0 -0
  16. {ifcraftcorpus-1.4.0.data → ifcraftcorpus-1.6.0.data}/data/share/ifcraftcorpus/corpus/craft-foundations/diegetic_design.md +0 -0
  17. {ifcraftcorpus-1.4.0.data → ifcraftcorpus-1.6.0.data}/data/share/ifcraftcorpus/corpus/craft-foundations/idea_capture_and_hooks.md +0 -0
  18. {ifcraftcorpus-1.4.0.data → ifcraftcorpus-1.6.0.data}/data/share/ifcraftcorpus/corpus/craft-foundations/if_platform_tools.md +0 -0
  19. {ifcraftcorpus-1.4.0.data → ifcraftcorpus-1.6.0.data}/data/share/ifcraftcorpus/corpus/craft-foundations/player_analytics_metrics.md +0 -0
  20. {ifcraftcorpus-1.4.0.data → ifcraftcorpus-1.6.0.data}/data/share/ifcraftcorpus/corpus/craft-foundations/quality_standards_if.md +0 -0
  21. {ifcraftcorpus-1.4.0.data → ifcraftcorpus-1.6.0.data}/data/share/ifcraftcorpus/corpus/craft-foundations/research_and_verification.md +0 -0
  22. {ifcraftcorpus-1.4.0.data → ifcraftcorpus-1.6.0.data}/data/share/ifcraftcorpus/corpus/craft-foundations/testing_interactive_fiction.md +0 -0
  23. {ifcraftcorpus-1.4.0.data → ifcraftcorpus-1.6.0.data}/data/share/ifcraftcorpus/corpus/emotional-design/conflict_patterns.md +0 -0
  24. {ifcraftcorpus-1.4.0.data → ifcraftcorpus-1.6.0.data}/data/share/ifcraftcorpus/corpus/emotional-design/emotional_beats.md +0 -0
  25. {ifcraftcorpus-1.4.0.data → ifcraftcorpus-1.6.0.data}/data/share/ifcraftcorpus/corpus/game-design/mechanics_design_patterns.md +0 -0
  26. {ifcraftcorpus-1.4.0.data → ifcraftcorpus-1.6.0.data}/data/share/ifcraftcorpus/corpus/genre-conventions/children_and_ya_conventions.md +0 -0
  27. {ifcraftcorpus-1.4.0.data → ifcraftcorpus-1.6.0.data}/data/share/ifcraftcorpus/corpus/genre-conventions/historical_fiction.md +0 -0
  28. {ifcraftcorpus-1.4.0.data → ifcraftcorpus-1.6.0.data}/data/share/ifcraftcorpus/corpus/genre-conventions/horror_conventions.md +0 -0
  29. {ifcraftcorpus-1.4.0.data → ifcraftcorpus-1.6.0.data}/data/share/ifcraftcorpus/corpus/genre-conventions/mystery_conventions.md +0 -0
  30. {ifcraftcorpus-1.4.0.data → ifcraftcorpus-1.6.0.data}/data/share/ifcraftcorpus/corpus/genre-conventions/sci_fi_conventions.md +0 -0
  31. {ifcraftcorpus-1.4.0.data → ifcraftcorpus-1.6.0.data}/data/share/ifcraftcorpus/corpus/narrative-structure/branching_narrative_construction.md +0 -0
  32. {ifcraftcorpus-1.4.0.data → ifcraftcorpus-1.6.0.data}/data/share/ifcraftcorpus/corpus/narrative-structure/branching_narrative_craft.md +0 -0
  33. {ifcraftcorpus-1.4.0.data → ifcraftcorpus-1.6.0.data}/data/share/ifcraftcorpus/corpus/narrative-structure/endings_patterns.md +0 -0
  34. {ifcraftcorpus-1.4.0.data → ifcraftcorpus-1.6.0.data}/data/share/ifcraftcorpus/corpus/narrative-structure/episodic_serialized_if.md +0 -0
  35. {ifcraftcorpus-1.4.0.data → ifcraftcorpus-1.6.0.data}/data/share/ifcraftcorpus/corpus/narrative-structure/nonlinear_structure.md +0 -0
  36. {ifcraftcorpus-1.4.0.data → ifcraftcorpus-1.6.0.data}/data/share/ifcraftcorpus/corpus/narrative-structure/pacing_and_tension.md +0 -0
  37. {ifcraftcorpus-1.4.0.data → ifcraftcorpus-1.6.0.data}/data/share/ifcraftcorpus/corpus/narrative-structure/romance_and_relationships.md +0 -0
  38. {ifcraftcorpus-1.4.0.data → ifcraftcorpus-1.6.0.data}/data/share/ifcraftcorpus/corpus/narrative-structure/scene_structure_and_beats.md +0 -0
  39. {ifcraftcorpus-1.4.0.data → ifcraftcorpus-1.6.0.data}/data/share/ifcraftcorpus/corpus/narrative-structure/scene_transitions.md +0 -0
  40. {ifcraftcorpus-1.4.0.data → ifcraftcorpus-1.6.0.data}/data/share/ifcraftcorpus/corpus/prose-and-language/character_voice.md +0 -0
  41. {ifcraftcorpus-1.4.0.data → ifcraftcorpus-1.6.0.data}/data/share/ifcraftcorpus/corpus/prose-and-language/dialogue_craft.md +0 -0
  42. {ifcraftcorpus-1.4.0.data → ifcraftcorpus-1.6.0.data}/data/share/ifcraftcorpus/corpus/prose-and-language/exposition_techniques.md +0 -0
  43. {ifcraftcorpus-1.4.0.data → ifcraftcorpus-1.6.0.data}/data/share/ifcraftcorpus/corpus/prose-and-language/prose_patterns.md +0 -0
  44. {ifcraftcorpus-1.4.0.data → ifcraftcorpus-1.6.0.data}/data/share/ifcraftcorpus/corpus/prose-and-language/subtext_and_implication.md +0 -0
  45. {ifcraftcorpus-1.4.0.data → ifcraftcorpus-1.6.0.data}/data/share/ifcraftcorpus/corpus/prose-and-language/voice_register_consistency.md +0 -0
  46. {ifcraftcorpus-1.4.0.data → ifcraftcorpus-1.6.0.data}/data/share/ifcraftcorpus/corpus/scope-and-planning/scope_and_length.md +0 -0
  47. {ifcraftcorpus-1.4.0.data → ifcraftcorpus-1.6.0.data}/data/share/ifcraftcorpus/corpus/world-and-setting/canon_management.md +0 -0
  48. {ifcraftcorpus-1.4.0.data → ifcraftcorpus-1.6.0.data}/data/share/ifcraftcorpus/corpus/world-and-setting/setting_as_character.md +0 -0
  49. {ifcraftcorpus-1.4.0.data → ifcraftcorpus-1.6.0.data}/data/share/ifcraftcorpus/corpus/world-and-setting/worldbuilding_patterns.md +0 -0
  50. {ifcraftcorpus-1.4.0.data → ifcraftcorpus-1.6.0.data}/data/share/ifcraftcorpus/subagents/README.md +0 -0
  51. {ifcraftcorpus-1.4.0.data → ifcraftcorpus-1.6.0.data}/data/share/ifcraftcorpus/subagents/if_genre_consultant.md +0 -0
  52. {ifcraftcorpus-1.4.0.data → ifcraftcorpus-1.6.0.data}/data/share/ifcraftcorpus/subagents/if_platform_advisor.md +0 -0
  53. {ifcraftcorpus-1.4.0.data → ifcraftcorpus-1.6.0.data}/data/share/ifcraftcorpus/subagents/if_prose_writer.md +0 -0
  54. {ifcraftcorpus-1.4.0.data → ifcraftcorpus-1.6.0.data}/data/share/ifcraftcorpus/subagents/if_quality_reviewer.md +0 -0
  55. {ifcraftcorpus-1.4.0.data → ifcraftcorpus-1.6.0.data}/data/share/ifcraftcorpus/subagents/if_story_architect.md +0 -0
  56. {ifcraftcorpus-1.4.0.data → ifcraftcorpus-1.6.0.data}/data/share/ifcraftcorpus/subagents/if_world_curator.md +0 -0
  57. {ifcraftcorpus-1.4.0.dist-info → ifcraftcorpus-1.6.0.dist-info}/WHEEL +0 -0
  58. {ifcraftcorpus-1.4.0.dist-info → ifcraftcorpus-1.6.0.dist-info}/entry_points.txt +0 -0
  59. {ifcraftcorpus-1.4.0.dist-info → ifcraftcorpus-1.6.0.dist-info}/licenses/LICENSE +0 -0
  60. {ifcraftcorpus-1.4.0.dist-info → ifcraftcorpus-1.6.0.dist-info}/licenses/LICENSE-CONTENT +0 -0
@@ -0,0 +1,1481 @@
1
+ ---
2
+ title: Agent Prompt Engineering
3
+ summary: Techniques for crafting effective LLM agent prompts—attention patterns, tool design, context layering, model size considerations, and testing strategies.
4
+ topics:
5
+ - prompt-engineering
6
+ - llm-agents
7
+ - attention-patterns
8
+ - tool-design
9
+ - context-management
10
+ - small-models
11
+ - chain-of-thought
12
+ - few-shot-learning
13
+ - list-completeness
14
+ - validation-loops
15
+ - external-validation
16
+ - manifest-first
17
+ - scoped-identifiers
18
+ - error-classification
19
+ - sampling-parameters
20
+ - temperature
21
+ - output-diversity
22
+ - creativity-levels
23
+ cluster: agent-design
24
+ ---
25
+
26
+ # Agent Prompt Engineering
27
+
28
+ Techniques for crafting effective prompts for LLM agents—attention patterns, tool design, context layering, and strategies for different model sizes.
29
+
30
+ This document is useful both for agents creating content AND for humans designing agents.
31
+
32
+ ---
33
+
34
+ ## Attention Patterns
35
+
36
+ ### Lost in the Middle
37
+
38
+ LLMs exhibit a U-shaped attention curve: information at the **beginning** and **end** of prompts receives stronger attention than content in the middle.
39
+
40
+ ```
41
+ Position in prompt: [START] -------- [MIDDLE] -------- [END]
42
+ Attention strength: HIGH LOW HIGH
43
+ ```
44
+
45
+ Critical instructions placed in the middle of a long prompt may be ignored, even by otherwise capable models.
46
+
47
+ ### The Sandwich Pattern
48
+
49
+ For critical instructions, repeat them at the **start AND end** of the prompt:
50
+
51
+ ```markdown
52
+ ## CRITICAL: You are an orchestrator. NEVER write prose yourself.
53
+
54
+ [... 500+ lines of context ...]
55
+
56
+ ## REMINDER: You are an orchestrator. NEVER write prose yourself.
57
+ ```
58
+
59
+ ### Ordering for Attention
60
+
61
+ Structure prompts strategically given the U-shaped curve:
62
+
63
+ **Recommended order:**
64
+
65
+ 1. **Critical behavioral constraints** (lines 1-20)
66
+ 2. **Role identity and purpose** (lines 21-50)
67
+ 3. **Tool descriptions** (if using function calling)
68
+ 4. **Reference material** (middle—lowest attention)
69
+ 5. **Knowledge summaries** (for retrieval patterns)
70
+ 6. **Critical reminder** (last 10-20 lines)
71
+
72
+ **What goes in the middle:**
73
+
74
+ Lower-priority content that can be retrieved on demand:
75
+
76
+ - Detailed procedures
77
+ - Reference tables
78
+ - Quality criteria details
79
+ - Examples (use retrieval when possible)
80
+
81
+ ---
82
+
83
+ ## List Completeness Patterns
84
+
85
+ When LLMs must process every item in a list (entity decisions, task completions, validation checklists), they frequently skip items—especially in the middle of long lists. This section describes patterns to ensure completeness.
86
+
87
+ ### Numbered Lists vs Checkboxes
88
+
89
+ Numbered lists outperform checkboxes for sequential processing:
90
+
91
+ | Format | Behavior | Reliability |
92
+ |--------|----------|-------------|
93
+ | `- [ ] item` | Treated as optional; often reformatted creatively | Lower |
94
+ | `1. item` | Signals discrete task requiring attention | Higher |
95
+
96
+ **Why it works:** Numbered format implies a sequence of individual tasks. Combined with explicit counts, this creates accountability that checkbox format cannot.
97
+
98
+ **Example:**
99
+
100
+ Anti-pattern:
101
+
102
+ ```text
103
+ - [ ] Decide on entity: butler_jameson
104
+ - [ ] Decide on entity: guest_clara
105
+ - [ ] Decide on entity: archive_room
106
+ ```
107
+
108
+ Better:
109
+
110
+ ```text
111
+ Entity Decisions (3 total):
112
+ 1. butler_jameson — [your decision]
113
+ 2. guest_clara — [your decision]
114
+ 3. archive_room — [your decision]
115
+ ```
116
+
117
+ ### Quantity Anchoring
118
+
119
+ State exact counts at both start AND end of prompts (sandwich pattern for quantities):
120
+
121
+ ```markdown
122
+ # REQUIREMENT: Exactly 21 Entity Decisions
123
+
124
+ [numbered list of 21 entities]
125
+
126
+ ...
127
+
128
+ # REMINDER: 21 entity decisions required. You must provide a decision for all 21.
129
+ ```
130
+
131
+ The explicit number creates a concrete, verifiable target. Vague instructions like "all items" or "every entity" are easier to satisfy incompletely.
132
+
133
+ ### Anti-Skipping Statements
134
+
135
+ Direct statements about completeness requirements are effective, especially when combined with the sandwich pattern:
136
+
137
+ | Position | Example |
138
+ |----------|---------|
139
+ | Start | "You must process ALL 21 entities. Skipping any is not acceptable." |
140
+ | End | "Total: 21 entities. Confirm you provided a decision for every single one." |
141
+
142
+ These explicit constraints work because they:
143
+
144
+ - Create a falsifiable claim the model must satisfy
145
+ - Exploit primacy/recency attention patterns
146
+ - Provide a concrete metric (count) rather than vague completeness
147
+
148
+ ### External Validation Required
149
+
150
+ **LLMs cannot reliably self-verify completeness mid-generation.**
151
+
152
+ Research shows that self-verification checklists embedded in prompts are frequently ignored or filled incorrectly. This is a fundamental limitation: LLMs operate via approximate retrieval, not logical verification.
153
+
154
+ **Anti-pattern:**
155
+
156
+ ```markdown
157
+ Before submitting, verify:
158
+ - [ ] I processed all 21 entities
159
+ - [ ] No entity was skipped
160
+ - [ ] Each decision is justified
161
+ ```
162
+
163
+ The model will often check these boxes without actually verifying.
164
+
165
+ **Better approach:**
166
+
167
+ ```text
168
+ 1. Generate output (entity decisions)
169
+ 2. External code counts decisions: found 20, expected 21
170
+ 3. Feedback: "Missing decision for entity 'guest_clara'. Provide decision."
171
+ 4. Model repairs the specific gap
172
+ ```
173
+
174
+ The "Validate → Feedback → Repair" loop (see below) must use **external logic**, not LLM self-assessment.
175
+
176
+ ### Combining Patterns
177
+
178
+ For maximum completeness on list-processing tasks:
179
+
180
+ 1. Use **numbered lists** (not checkboxes)
181
+ 2. State **exact count** at start and end (sandwich)
182
+ 3. Include **anti-skipping statements** at start and end
183
+ 4. Validate **externally** after generation
184
+ 5. Provide **specific feedback** naming missing items
185
+
186
+ This combination addressed a real-world failure where gpt-4o-mini skipped 1 of 21 entities despite an embedded entity checklist.
187
+
188
+ ---
189
+
190
+ ## Tool Design
191
+
192
+ ### Tool Count Effects
193
+
194
+ Tool count strongly correlates with compliance, especially for smaller models:
195
+
196
+ | Tool Count | Compliance Rate (8B model) |
197
+ |------------|---------------------------|
198
+ | 6 tools | ~100% |
199
+ | 12 tools | ~85% |
200
+ | 20 tools | ~70% |
201
+
202
+ **Recommendations:**
203
+
204
+ - **Small models (≤8B)**: Limit to 6-8 tools
205
+ - **Medium models (9B-70B)**: Up to 12 tools
206
+ - **Large models (70B+)**: Can handle 15+ but consider UX
207
+
208
+ ### Tool Schema Overhead
209
+
210
+ Tool schemas sent via function calling are often larger than the system prompt itself:
211
+
212
+ | Component | Typical Size |
213
+ |-----------|--------------|
214
+ | Tool name | ~5 tokens |
215
+ | Description | 50-150 tokens |
216
+ | Parameter schema | 100-300 tokens |
217
+ | **Per tool total** | 150-450 tokens |
218
+ | **13 tools** | **2,000-5,900 tokens** |
219
+
220
+ ### Optimization Strategies
221
+
222
+ **1. Model-Class Filtering**
223
+
224
+ Define reduced tool sets for small models:
225
+
226
+ ```json
227
+ {
228
+ "tools": ["delegate", "communicate", "search", "save", ...],
229
+ "small_model_tools": ["delegate", "communicate", "save"]
230
+ }
231
+ ```
232
+
233
+ **2. Two-Stage Selection**
234
+
235
+ For large tool libraries (20+):
236
+
237
+ 1. Show lightweight menu (name + summary only)
238
+ 2. Agent selects relevant tools
239
+ 3. Load full schema only for selected tools
240
+
241
+ Research shows 50%+ token reduction with 3x accuracy improvement.
242
+
243
+ **3. Deferred Loading**
244
+
245
+ Mark specialized tools as discoverable but not pre-loaded. They appear in a search interface rather than being sent to the API upfront.
246
+
247
+ **4. Concise Descriptions**
248
+
249
+ 1-2 sentences max. Move detailed usage guidance to knowledge entries.
250
+
251
+ **Before** (~80 tokens):
252
+
253
+ > "Delegate work to another agent. This hands off control until the agent completes the task. Provide task description, context, expected outputs, and quality criteria. The receiving agent executes and returns control with artifacts and assessment."
254
+
255
+ **After** (~20 tokens):
256
+
257
+ > "Hand off a task to another agent. Control returns when they complete."
258
+
259
+ **5. Minimal Parameter Schemas**
260
+
261
+ For small models, simplify schemas:
262
+
263
+ **Full** (~200 tokens): All optional parameters with descriptions
264
+
265
+ **Minimal** (~50 tokens): Only required parameters
266
+
267
+ Optional parameters can use reasonable defaults.
268
+
269
+ ### Tool Description Biasing
270
+
271
+ Tool descriptions have **higher influence** than system prompt content when models decide which tool to call.
272
+
273
+ **Problem:**
274
+
275
+ If a tool description contains prescriptive language ("ALWAYS use this", "This is the primary method"), models will prefer that tool regardless of system prompt instructions.
276
+
277
+ **Solution:**
278
+
279
+ Use **neutral, descriptive** tool descriptions. Let the **system prompt** dictate when to use tools.
280
+
281
+ **Anti-pattern:**
282
+
283
+ > "ALWAYS use this tool to create story content. This is the primary way to generate text."
284
+
285
+ **Better:**
286
+
287
+ > "Creates story prose from a brief. Produces narrative text with dialogue and descriptions."
288
+
289
+ ---
290
+
291
+ ## Context Architecture
292
+
293
+ ### The Four Layers
294
+
295
+ Organize agent prompts into distinct layers:
296
+
297
+ | Layer | Purpose | Token Priority |
298
+ |-------|---------|----------------|
299
+ | **System** | Core identity, constraints | High (always include) |
300
+ | **Task** | Current instructions | High |
301
+ | **Tool** | Tool descriptions/schemas | Medium (filter for small models) |
302
+ | **Memory** | Historical context | Variable (summarize as needed) |
303
+
304
+ ### Benefits of Layer Separation
305
+
306
+ - **Debugging**: Isolate which layer caused unexpected behavior
307
+ - **Model switching**: System layer stays constant across model sizes
308
+ - **Token management**: Each layer can be independently compressed
309
+ - **Caching**: System and tool layers can be cached between turns
310
+
311
+ ### Menu + Consult Pattern
312
+
313
+ For knowledge that agents need access to but not always in context:
314
+
315
+ **Structure:**
316
+
317
+ ```
318
+ System prompt contains:
319
+ - Summary/menu showing what knowledge exists
320
+ - Tool to retrieve full details
321
+
322
+ System prompt does NOT contain:
323
+ - Full knowledge content
324
+ - Detailed procedures
325
+ - Reference material
326
+ ```
327
+
328
+ **Benefits:**
329
+
330
+ - Smaller initial prompt
331
+ - Agent can "pull" knowledge when needed
332
+ - Works well with small models
333
+
334
+ ### When to Inject vs. Consult
335
+
336
+ | Content Type | Small Model | Large Model |
337
+ |--------------|-------------|-------------|
338
+ | Role identity | Inject | Inject |
339
+ | Behavioral constraints | Inject | Inject |
340
+ | Workflow procedures | Consult | Inject or Consult |
341
+ | Quality criteria | Consult | Inject |
342
+ | Reference material | Consult | Consult |
343
+
344
+ ---
345
+
346
+ ## Model Size Considerations
347
+
348
+ ### Token Budgets
349
+
350
+ | Model Class | Recommended System Prompt |
351
+ |-------------|---------------------------|
352
+ | Small (≤8B) | ≤2,000 tokens |
353
+ | Medium (9B-70B) | ≤6,000 tokens |
354
+ | Large (70B+) | ≤12,000 tokens |
355
+
356
+ Exceeding these budgets leads to:
357
+
358
+ - Ignored instructions (especially in the middle)
359
+ - Reduced tool compliance
360
+ - Hallucinated responses
361
+
362
+ ### Instruction Density
363
+
364
+ Small models struggle with:
365
+
366
+ - Conditional logic: "If X and not Y, then Z unless W"
367
+ - Multiple competing priorities
368
+ - Nuanced edge cases
369
+
370
+ **Simplify for small models:**
371
+
372
+ - "Always call delegate" (not "call delegate unless validating")
373
+ - One instruction per topic
374
+ - Remove edge case handling (accept lower quality)
375
+
376
+ ### Concise Content Pattern
377
+
378
+ Provide two versions of guidance:
379
+
380
+ ```json
381
+ {
382
+ "summary": "Orchestrators delegate tasks to specialists. Before delegating, consult the relevant playbook to understand the workflow. Pass artifact IDs between steps. Monitor completion.",
383
+ "concise_summary": "Delegate to specialists. Consult playbook first."
384
+ }
385
+ ```
386
+
387
+ Runtime selects the appropriate version based on model class.
388
+
389
+ ### Semantic Ambiguity
390
+
391
+ Avoid instructions that can be interpreted multiple ways.
392
+
393
+ **Anti-pattern:**
394
+
395
+ > "Use your best judgment to determine when validation is needed."
396
+
397
+ Small models may interpret as "never validate" or "always validate."
398
+
399
+ **Better:**
400
+
401
+ > "Call validate after every save."
402
+
403
+ ---
404
+
405
+ ## Sampling Parameters
406
+
407
+ Sampling parameters control the randomness and diversity of LLM outputs. The two most important are **temperature** and **top_p**. These can be set per API call, enabling different settings for different phases of a workflow.
408
+
409
+ ### Temperature
410
+
411
+ Temperature controls the probability distribution over tokens. Lower values make the model more deterministic; higher values increase randomness and creativity.
412
+
413
+ | Temperature | Effect | Use Cases |
414
+ |-------------|--------|-----------|
415
+ | 0.0–0.2 | Highly deterministic, consistent | Structured output, tool calling, factual responses |
416
+ | 0.3–0.5 | Balanced, slight variation | General conversation, summarization |
417
+ | 0.6–0.8 | More creative, diverse | Brainstorming, draft generation |
418
+ | 0.9–1.0+ | High randomness, exploratory | Creative writing, idea exploration, poetry |
419
+
420
+ **How it works:** Temperature scales the logits (pre-softmax scores) before sampling. At T=0, the model always picks the highest-probability token. At T>1, probability differences flatten, making unlikely tokens more probable.
421
+
422
+ **Caveats:**
423
+
424
+ - Even T=0 isn't fully deterministic—hardware concurrency and floating-point variations can introduce tiny differences
425
+ - High temperature increases hallucination risk
426
+ - Temperature interacts with top_p; tuning both simultaneously requires care
427
+
428
+ ### Top_p (Nucleus Sampling)
429
+
430
+ Top_p limits sampling to the smallest set of tokens whose cumulative probability exceeds p. This provides a different control over diversity than temperature.
431
+
432
+ | Top_p | Effect |
433
+ |-------|--------|
434
+ | 0.1–0.3 | Very focused, few token choices |
435
+ | 0.5–0.7 | Moderate diversity |
436
+ | 0.9–1.0 | Wide sampling, more variation |
437
+
438
+ **Temperature vs Top_p:**
439
+
440
+ - Temperature affects *all* token probabilities uniformly
441
+ - Top_p dynamically adjusts the candidate pool based on probability mass
442
+ - For most use cases, adjust one and leave the other at default
443
+ - Common pattern: low temperature (0.0–0.3) with top_p=1.0 for structured tasks
444
+
445
+ ### Provider Temperature Ranges
446
+
447
+ Temperature ranges and behavior vary significantly across providers:
448
+
449
+ | Provider | Range | Default | Notes |
450
+ |----------|-------|---------|-------|
451
+ | OpenAI | 0.0–2.0 | 1.0 | GPT-3.5 was 0.0–1.0; GPT-4+ expanded range |
452
+ | Anthropic | 0.0–1.0 | 1.0 | Hard cap at 1.0; more conservative at same value |
453
+ | Gemini | 0.0–2.0 | 1.0 | Gemini 3 recommends keeping at 1.0 for reasoning |
454
+ | DeepSeek | 0.0–2.0 | 1.0 | API temp 1.0 internally maps to 0.3 (see below) |
455
+ | Ollama | 0.0–1.0+ | 0.7–0.8 | Model-dependent; Qwen recommends 0.7 |
456
+
457
+ **DeepSeek temperature mapping:** DeepSeek implements an internal mapping where API temperature values are transformed before use. Notably, `temperature=1.0` sent via API is internally converted to `0.3`. This means DeepSeek at the "default" temperature behaves significantly more conservatively than other providers at the same nominal value.
458
+
459
+ **Cross-provider equivalence:** Research shows that "optimal settings are wildly model-dependent"—experiments have found nearly opposite configurations work best for the same task on different models. There is no universal cross-provider temperature mapping; the relationship between temperature values and output behavior varies by model architecture and training.
460
+
461
+ ### Semantic Creativity Levels
462
+
463
+ Rather than using raw temperature values directly, consider defining semantic creativity levels that map to provider-appropriate values at runtime:
464
+
465
+ | Level | Intent | OpenAI | Anthropic | DeepSeek | Ollama |
466
+ |-------|--------|--------|-----------|----------|--------|
467
+ | Deterministic | Structured output, JSON, tool calls | 0.0 | 0.0 | 0.0 | 0.0 |
468
+ | Focused | Consistent with minimal variation | 0.3 | 0.3 | 1.0 | 0.3 |
469
+ | Balanced | Default for most tasks | 0.7 | 0.5 | 1.3 | 0.7 |
470
+ | Creative | Brainstorming, exploration | 1.0 | 0.8 | 1.5 | 0.9 |
471
+ | Experimental | Maximum diversity (coherence may degrade) | 1.5 | 1.0 | 2.0 | 1.0+ |
472
+
473
+ This abstraction allows code to express intent (`CreativityLevel.CREATIVE`) while handling provider differences in a single mapping layer.
474
+
475
+ **Implementation pattern:**
476
+
477
+ ```python
478
+ from enum import Enum
479
+
480
+ class CreativityLevel(Enum):
481
+ DETERMINISTIC = "deterministic"
482
+ FOCUSED = "focused"
483
+ BALANCED = "balanced"
484
+ CREATIVE = "creative"
485
+ EXPERIMENTAL = "experimental"
486
+
487
+ TEMPERATURE_MAP = {
488
+ "openai": {
489
+ CreativityLevel.DETERMINISTIC: 0.0,
490
+ CreativityLevel.FOCUSED: 0.3,
491
+ CreativityLevel.BALANCED: 0.7,
492
+ CreativityLevel.CREATIVE: 1.0,
493
+ CreativityLevel.EXPERIMENTAL: 1.5,
494
+ },
495
+ "anthropic": {
496
+ CreativityLevel.DETERMINISTIC: 0.0,
497
+ CreativityLevel.FOCUSED: 0.3,
498
+ CreativityLevel.BALANCED: 0.5,
499
+ CreativityLevel.CREATIVE: 0.8,
500
+ CreativityLevel.EXPERIMENTAL: 1.0,
501
+ },
502
+ # ... other providers
503
+ }
504
+ ```
505
+
506
+ ### Provider Parameter Support
507
+
508
+ Not all sampling parameters are available across providers:
509
+
510
+ | Parameter | OpenAI | Anthropic | Gemini | DeepSeek | Ollama |
511
+ |-----------|--------|-----------|--------|----------|--------|
512
+ | temperature | ✓ | ✓ | ✓ | ✓ | ✓ |
513
+ | top_p | ✓ | ✓* | ✓ | ✓ | ✓ |
514
+ | top_k | ✗ | ✓ | ✓ | ✗ | ✓ |
515
+ | frequency_penalty | ✓ | ✗ | ✓ | ✓ | ✗ |
516
+ | presence_penalty | ✓ | ✗ | ✓ | ✓ | ✗ |
517
+ | seed | ✓ | ✗ | ✗ | ✓ | ✓ |
518
+ | min_p | ✗ | ✗ | ✗ | ✗ | ✓ |
519
+ | repeat_penalty | ✗ | ✗ | ✗ | ✗ | ✓ |
520
+
521
+ *Claude 4.5+ enforces mutual exclusion: temperature and top_p cannot both be specified.
522
+
523
+ **Anthropic's constraints:** Claude models lack frequency/presence penalties and seed parameters. Anthropic recommends using temperature alone and leaving other parameters at defaults.
524
+
525
+ **Ollama/local models:** Support additional parameters like min_p, repeat_penalty, and mirostat that aren't available via cloud APIs. These can be valuable for fine-tuning local model behavior.
526
+
527
+ ### Additional Sampling Parameters
528
+
529
+ Beyond temperature and top_p, several other parameters affect output diversity:
530
+
531
+ **Min_p (local models):** Scales the truncation threshold based on the top token's probability. Unlike top_p's fixed threshold, min_p adapts to model confidence—more restrictive when the model is confident, more permissive when uncertain. This maintains coherence even at higher temperatures.
532
+
533
+ | Min_p | Effect |
534
+ |-------|--------|
535
+ | 0.0 | Disabled (Ollama default) |
536
+ | 0.05–0.1 | Recommended range; balances creativity and coherence |
537
+ | 0.2+ | More restrictive; reduces diversity |
538
+
539
+ **Frequency penalty (OpenAI, Gemini, DeepSeek):** Penalizes tokens proportionally to how often they've appeared. A token appearing 10 times receives a higher penalty than one appearing twice. Range: -2.0 to 2.0, default 0.0.
540
+
541
+ **Presence penalty (OpenAI, Gemini, DeepSeek):** Penalizes any repeated token equally, regardless of frequency. Encourages the model to introduce new topics. Range: -2.0 to 2.0, default 0.0.
542
+
543
+ **Repeat penalty (Ollama):** Similar to frequency penalty but operates on a sliding window of recent tokens. Values >1.0 discourage repetition. Start at 1.0–1.05; avoid exceeding 1.2 as it can degrade coherence.
544
+
545
+ **Mirostat (Ollama):** Adaptive algorithm that maintains target perplexity throughout generation. Dynamically adjusts sampling to balance coherence and diversity. When enabled, disable other sampling parameters (set top_p=1.0, top_k=0, min_p=0.0).
546
+
547
+ ### Understanding Output Diversity
548
+
549
+ Temperature controls *token-level* randomness but does not guarantee *conceptual* diversity. Research has documented consistent patterns in LLM creative output:
550
+
551
+ **The "Echoes" phenomenon:** When generating multiple outputs for the same creative prompt, LLMs produce strikingly similar plot elements, narrative structures, and ideas—even at high temperatures. A study examining GPT-4 story continuations found that across 100 generations:
552
+
553
+ - 50/100 had a policeman giving directions to "take the second left"
554
+ - 18/100 directed to "take the second right"
555
+ - 16/100 mentioned a bakery as a landmark
556
+
557
+ These "echoes" repeat across generations at all semantic levels.
558
+
559
+ **Cross-model homogeneity:** LLM responses are more similar to *other LLM responses* than human responses are to each other—even across different model families. This isn't specific to any single model; it appears to be characteristic of how LLMs function.
560
+
561
+ **Why this happens:**
562
+
563
+ - Temperature affects which *tokens* are selected, not which *concepts* are explored
564
+ - Training data contains statistical regularities that models learn and reproduce
565
+ - Smaller models exhibit this more strongly due to narrower "creative paths" in their weights
566
+ - RLHF/preference tuning may further narrow the distribution of "acceptable" outputs
567
+
568
+ **Implications:**
569
+
570
+ - High temperature alone won't produce fundamentally different ideas
571
+ - Multiple generations with the same prompt will converge on similar structures
572
+ - This behavior is predictable and can be useful (consistency) or limiting (creative tasks)
573
+
574
+ ### Techniques for Increasing Diversity
575
+
576
+ When varied outputs are desired, several techniques can help—used individually or in combination:
577
+
578
+ **Sampling parameter adjustments:**
579
+
580
+ | Technique | Effect | Trade-off |
581
+ |-----------|--------|-----------|
582
+ | Higher temperature (0.8–1.2) | More varied token selection | Coherence degrades at high values |
583
+ | Higher top_p (0.9–0.95) | Larger candidate pool | May introduce unlikely tokens |
584
+ | Min_p (0.05–0.1) | Maintains coherence at high temps | Local models only |
585
+ | Presence penalty (0.3–0.6) | Discourages concept reuse | May force awkward topic changes |
586
+ | Seed variation | Different random paths | Requires seed support |
587
+
588
+ **Prompt-based techniques:**
589
+
590
+ *Explicit diversity requests:*
591
+
592
+ ```text
593
+ Generate an unusual or unexpected approach to this problem.
594
+ Avoid common tropes like [X, Y, Z].
595
+ ```
596
+
597
+ *List prompting:*
598
+
599
+ ```text
600
+ List 5 different approaches to this story opening, then select
601
+ the most unconventional one to develop.
602
+ ```
603
+
604
+ *Verbalized sampling:* Ask the model to generate multiple responses with probabilities, then sample from the distribution's tails:
605
+
606
+ ```text
607
+ Generate 5 possible story openings with estimated probability for each.
608
+ Then select one with probability < 0.10 to develop further.
609
+ ```
610
+
611
+ This technique achieves 2–3× diversity improvement in benchmarks while maintaining quality, and works with any model via prompting.
612
+
613
+ **Structural approaches:**
614
+
615
+ - Vary the prompt itself across generations (different framings, constraints)
616
+ - Use different system prompts that prime different "creative modes"
617
+ - Chain outputs: use one generation's unexpected element as input for the next
618
+
619
+ ### Phase-Specific Temperature
620
+
621
+ Since temperature can be set per API call, use different values for different workflow phases:
622
+
623
+ | Phase | Temperature | Rationale |
624
+ |-------|-------------|-----------|
625
+ | Brainstorming/Discuss | 0.7–1.0 | Encourage diverse ideas, exploration |
626
+ | Planning/Freeze | 0.3–0.5 | Balance creativity with coherence |
627
+ | Serialize/Tool calls | 0.0–0.2 | Maximize format compliance |
628
+ | Validation repair | 0.0–0.2 | Deterministic corrections |
629
+
630
+ This is particularly relevant for the **Discuss → Freeze → Serialize** pattern described below—each stage benefits from different temperature settings.
631
+
632
+ ---
633
+
634
+ ## Structured Output Pipelines
635
+
636
+ Many agent tasks end in a **strict artifact**—JSON/YAML configs, story plans, outlines—rather than free-form prose. Trying to get both *conversation* and *perfectly formatted output* from a single response is brittle, especially for small/local models.
637
+
638
+ A more reliable approach is to separate the flow into stages:
639
+
640
+ 1. **Discuss** – messy, human-friendly turns to clarify goals and constraints. No structured output yet.
641
+ 2. **Freeze** – summarize final decisions into a compact, explicit list (facts & constraints).
642
+ 3. **Serialize** – a dedicated call whose only job is to emit the structured artifact, constrained by a schema or tool signature.
643
+
644
+ ### Discuss → Freeze → Serialize
645
+
646
+ **Discuss** (temperature 0.7–1.0): Keep prompts focused on meaning, not field names. Explicitly tell the model *not* to output JSON/YAML during this phase. Higher temperature encourages diverse ideas and creative exploration.
647
+
648
+ **Freeze** (temperature 0.3–0.5): Compress decisions into a short summary:
649
+
650
+ - 10–30 bullets, one decision per line.
651
+ - No open questions, only resolved choices.
652
+ - Structured enough that a smaller model can follow it reliably.
653
+ - Moderate temperature balances coherence with flexibility.
654
+
655
+ **Serialize** (temperature 0.0–0.2): In a separate call:
656
+
657
+ - Provide the schema (JSON Schema, typed model, or tool definition).
658
+ - Instruct: *"Output only JSON that matches this schema. No prose, no markdown fences."*
659
+ - Use constrained decoding/tool calling where available.
660
+ - Low temperature maximizes format compliance.
661
+
662
+ This separates conversational drift from serialization, which significantly improves reliability for structured outputs like story plans, world-bible slices, or configuration objects. The temperature gradient—high for exploration, low for precision—matches each phase's purpose.
663
+
664
+ ### Tool-Gated Finalization
665
+
666
+ An alternative is to represent structured output as a **single tool call**:
667
+
668
+ - During normal conversation: no tools are called.
669
+ - On FINALIZE: the agent must call a tool such as `submit_plan(plan: PlanSchema)` exactly once.
670
+
671
+ Pros:
672
+
673
+ - Structured data arrives as typed arguments (no text parsing).
674
+ - The runtime can validate arguments immediately.
675
+
676
+ Cons:
677
+
678
+ - Some models occasionally skip the tool call or send partial arguments.
679
+
680
+ Pattern in practice:
681
+
682
+ - Prefer tool-gated finalization when your stack treats tools as first-class.
683
+ - Keep a fallback: if the tool call doesn't happen, fall back to a serialize-only call using the freeze summary.
684
+
685
+ ### Manifest-First Serialization
686
+
687
+ When LLMs must generate decisions for every item in a known set (entities, tensions, threads), **extraction-style prompts fail**. The manifest-first pattern prevents omission by construction.
688
+
689
+ #### The Problem: Extraction Mindset
690
+
691
+ Extraction prompts ask the LLM to find information in the context:
692
+
693
+ ```markdown
694
+ # WRONG (extraction mindset)
695
+ Based on the discussion above, extract entity decisions.
696
+ Do NOT include entities not listed in the discussion.
697
+ ```
698
+
699
+ **Why it fails:**
700
+
701
+ - Entities discovered via tool calls but not echoed in assistant text are "invisible"
702
+ - The LLM satisfies the prompt by omitting items it can't find
703
+ - Information loss compounds through pipeline stages
704
+
705
+ #### The Solution: Generation Mindset
706
+
707
+ Generation prompts frame each item as a **required output**:
708
+
709
+ ```markdown
710
+ # CORRECT (generation mindset)
711
+ ## Generation Requirements (CRITICAL)
712
+
713
+ You MUST generate a decision for EVERY entity ID listed below.
714
+ Missing items WILL fail validation.
715
+
716
+ Entity IDs (5 total):
717
+ 1. entity::hero
718
+ 2. entity::tavern
719
+ 3. entity::sword
720
+ 4. entity::dragon
721
+ 5. entity::treasure
722
+
723
+ For EACH entity above, provide: [retain|cut] - justification
724
+ ```
725
+
726
+ **Key differences:**
727
+
728
+ | Extraction Mindset | Generation Mindset |
729
+ |-------------------|-------------------|
730
+ | "Extract entities from context" | "Generate decision for EACH entity below" |
731
+ | "Do NOT include unlisted items" | "Missing items WILL fail validation" |
732
+ | Omission is compliance | Omission is failure |
733
+ | Post-hoc validation catches errors | Prevention by construction |
734
+
735
+ #### The Three Gates
736
+
737
+ Structure prompts with explicit prevention at each stage boundary:
738
+
739
+ | Gate | Stage Boundary | Mechanism |
740
+ |------|----------------|-----------|
741
+ | **Prevention** | Before Summarize | Manifest lists all IDs requiring decisions |
742
+ | **Enforcement** | Before Serialize | Count-based language ("Generate EXACTLY N") |
743
+ | **Validation** | After Serialize | Structural count check before semantic check |
744
+
745
+ **Gate 1: Prevention (Summarize Prompt)**
746
+
747
+ ```markdown
748
+ ## Required Decisions
749
+
750
+ You must include a decision for EVERY ID below:
751
+
752
+ Entity IDs: entity::hero | entity::tavern | entity::sword
753
+ Thread IDs: thread::host_motive | thread::butler_fidelity
754
+
755
+ Format: `id: [retain|cut] - justification`
756
+ ```
757
+
758
+ **Gate 2: Enforcement (Serialize Prompt)**
759
+
760
+ ```markdown
761
+ ## Generation Requirements
762
+
763
+ Generate EXACTLY:
764
+ - 3 entity decisions (entity::hero, entity::tavern, entity::sword)
765
+ - 2 thread decisions (thread::host_motive, thread::butler_fidelity)
766
+
767
+ Missing items = validation failure. No exceptions.
768
+ ```
769
+
770
+ **Gate 3: Validation (Code)**
771
+
772
+ ```python
773
+ # Fast structural check BEFORE semantic validation
774
+ def validate_completeness(output, manifest):
775
+ if len(output.entities) != manifest["entity_count"]:
776
+ return CompletenessError(
777
+ f"Expected {manifest['entity_count']} entities, got {len(output.entities)}"
778
+ )
779
+ # Only proceed to semantic validation if counts match
780
+ return validate_semantics(output)
781
+ ```
782
+
783
+ #### Count-Based vs String Parsing
784
+
785
+ Count-based validation is more reliable than parsing natural language:
786
+
787
+ ```python
788
+ # FRAGILE: String parsing
789
+ if "all entities" in response.lower():
790
+ # Did they really include all?
791
+
792
+ # ROBUST: Count validation
793
+ if len(response.entities) == expected_count:
794
+ # Structural guarantee
795
+ ```
796
+
797
+ ### Scoped Identifiers
798
+
799
+ When outputs reference multiple ID types (entities, threads, locations, items), use **scoped identifiers** to prevent confusion and enable precise validation.
800
+
801
+ #### Format: `type::raw_id`
802
+
803
+ ```
804
+ entity::hero
805
+ thread::host_motive
806
+ location::tavern_basement
807
+ item::rusty_sword
808
+ ```
809
+
810
+ **Why scoping matters:**
811
+
812
+ Without scoping, the model (and validators) can confuse IDs across types:
813
+
814
+ ```python
815
+ # AMBIGUOUS: Is "hero" an entity, location, or something else?
816
+ ids = ["hero", "tavern", "host_motive"]
817
+
818
+ # UNAMBIGUOUS: Type is explicit
819
+ ids = ["entity::hero", "location::tavern", "thread::host_motive"]
820
+ ```
821
+
822
+ #### Benefits
823
+
824
+ | Benefit | Explanation |
825
+ |---------|-------------|
826
+ | **Disambiguation** | "tavern" as entity vs location is clear |
827
+ | **Validation** | Can validate against type-specific manifests |
828
+ | **Error messages** | "Unknown entity::hero" is more actionable than "Unknown ID: hero" |
829
+ | **Grep-ability** | `entity::` finds all entity references |
830
+ | **Model clarity** | Explicit types reduce hallucination of cross-type references |
831
+
832
+ #### Implementation
833
+
834
+ ```python
835
+ from dataclasses import dataclass
836
+ from typing import Literal
837
+
838
+ @dataclass
839
+ class ScopedId:
840
+ type: Literal["entity", "thread", "location", "item"]
841
+ raw_id: str
842
+
843
+ def __str__(self) -> str:
844
+ return f"{self.type}::{self.raw_id}"
845
+
846
+ @classmethod
847
+ def parse(cls, scoped: str) -> "ScopedId":
848
+ if "::" not in scoped:
849
+ raise ValueError(f"Invalid scoped ID (missing '::'): {scoped}")
850
+ type_part, raw = scoped.split("::", 1)
851
+ return cls(type=type_part, raw_id=raw)
852
+
853
+ def validate_scoped_ids(output_ids: list[str], manifests: dict[str, set[str]]) -> list[str]:
854
+ """Validate that all scoped IDs exist in their respective manifests."""
855
+ errors = []
856
+ for scoped in output_ids:
857
+ sid = ScopedId.parse(scoped)
858
+ if sid.type not in manifests:
859
+ errors.append(f"Unknown ID type: {sid.type}")
860
+ elif sid.raw_id not in manifests[sid.type]:
861
+ errors.append(f"Unknown {sid.type}::{sid.raw_id}")
862
+ return errors
863
+ ```
864
+
865
+ #### Prompt Integration
866
+
867
+ Include scoping in both manifest and output format instructions:
868
+
869
+ ```markdown
870
+ ## Valid IDs (use exact format)
871
+
872
+ Entities: entity::hero | entity::tavern | entity::sword
873
+ Threads: thread::host_motive | thread::butler_fidelity
874
+
875
+ ## Output Format
876
+
877
+ Each decision must use scoped ID format:
878
+ - `entity::hero: retain - central to plot`
879
+ - `thread::host_motive: cut - resolved in Act 1`
880
+ ```
881
+
882
+ ---
883
+
884
+ ## Validate → Feedback → Repair Loop
885
+
886
+ Even with good prompts, structured output will sometimes be **almost** right. Instead of accepting failures or silently discarding data, use a validate-with-feedback loop:
887
+
888
+ 1. Generate a candidate object (JSON/tool args/text).
889
+ 2. Validate it in code (schema/type checks, domain rules).
890
+ 3. If invalid, feed back the errors and ask the model to repair **only** the problems.
891
+ 4. Repeat for a small, fixed number of attempts.
892
+
893
+ ### Validation Channels
894
+
895
+ Typical validators:
896
+
897
+ - **Schema/type validation:** JSON Schema, Pydantic/dataclasses, or your own type checks.
898
+ - **Domain rules:** length ranges, allowed enum values, cross-field consistency (e.g., word-count vs estimated playtime).
899
+ - **Link/graph checks:** required references exist, no impossible states.
900
+
901
+ ### Designing the Feedback Prompt
902
+
903
+ When a candidate fails validation, the repair prompt should:
904
+
905
+ - Include the previous candidate object verbatim.
906
+ - Include a concise list of validation errors, grouped by field.
907
+ - Give strict instructions, e.g.:
908
+
909
+ > “Return a corrected JSON object that fixes **only** these errors. Do not change fields that are not mentioned. Output only JSON.”
910
+
911
+ For small models, keep error descriptions compact and concrete rather than abstract ("string too long: 345 > max 200").
912
+
913
+ ### Structured Validation Feedback
914
+
915
+ Rather than returning free-form error messages, use a structured feedback format that leverages attention patterns (status first, action last) and distinguishes error types clearly.
916
+
917
+ **Result Categories**
918
+
919
+ Use a semantic result enum rather than boolean success/failure:
920
+
921
+ | Result | Meaning | Model Action |
922
+ |--------|---------|--------------|
923
+ | `accepted` | Validation passed, artifact stored | Proceed to next step |
924
+ | `validation_failed` | Content issues the model can fix | Repair and resubmit |
925
+ | `tool_error` | Infrastructure failure | Retry unchanged or escalate |
926
+
927
+ This distinction matters: `validation_failed` tells the model its *content* was wrong (fixable), while `tool_error` indicates the tool itself failed (retry or give up).
928
+
929
+ **Error Categorization**
930
+
931
+ Group validation errors by type to help the model understand what went wrong:
932
+
933
+ ```json
934
+ {
935
+ "result": "validation_failed",
936
+ "issues": {
937
+ "invalid": [
938
+ {"field": "estimated_passages", "value": 15, "requirement": "must be 1-10"}
939
+ ],
940
+ "missing": ["protagonist_name", "setting"],
941
+ "unknown": ["passages"]
942
+ },
943
+ "issue_count": {"invalid": 1, "missing": 2, "unknown": 1},
944
+ "action": "Fix the 4 issues above and resubmit. Use exact field names from the schema."
945
+ }
946
+ ```
947
+
948
+ | Category | Meaning | Common Cause |
949
+ |----------|---------|--------------|
950
+ | `invalid` | Field present but value wrong | Constraint violation, wrong type |
951
+ | `missing` | Required field not provided | Omission, incomplete output |
952
+ | `unknown` | Field not in schema | Typo, hallucinated field name |
953
+
954
+ The `unknown` category is particularly valuable—it catches near-misses like `passages` instead of `estimated_passages` that would otherwise appear as "missing" with no hint about the typo.
955
+
956
+ **Field Ordering (Primacy/Recency)**
957
+
958
+ Structure feedback to exploit the U-shaped attention curve:
959
+
960
+ 1. **Result status** (first—immediate orientation)
961
+ 2. **Issues by category** (middle—detailed content)
962
+ 3. **Issue count** (severity summary)
963
+ 4. **Action instructions** (last—what to do next)
964
+
965
+ **What NOT to Include**
966
+
967
+ | Avoid | Why |
968
+ |-------|-----|
969
+ | Full schema | Already in tool definition; wastes tokens in retry loops |
970
+ | Boolean `success` field | Ambiguous; use semantic result categories instead |
971
+ | Generic hints | Replace with actionable, field-specific instructions |
972
+ | Valid fields | Only describe what failed, not what succeeded |
973
+
974
+ **Example: Before and After**
975
+
976
+ Anti-pattern (vague, wastes tokens):
977
+
978
+ ```
979
+ Error: Validation failed. Expected fields: type, title, protagonist_name,
980
+ setting, theme, estimated_passages, tone. Please check your submission
981
+ and ensure all required fields are present with valid values.
982
+ ```
983
+
984
+ Better (specific, actionable):
985
+
986
+ ```json
987
+ {
988
+ "result": "validation_failed",
989
+ "issues": {
990
+ "invalid": [{"field": "type", "value": "story", "requirement": "must be 'dream'"}],
991
+ "missing": ["protagonist_name"],
992
+ "unknown": ["passages"]
993
+ },
994
+ "action": "Fix these 3 issues. Did you mean 'estimated_passages' instead of 'passages'?"
995
+ }
996
+ ```
997
+
998
+ The improved version:
999
+
1000
+ - Names the exact fields that failed
1001
+ - Suggests the likely typo (`passages` → `estimated_passages`)
1002
+ - Doesn't repeat schema information already available to the model
1003
+ - Ends with a clear action instruction (primacy/recency)
1004
+
1005
+ ### Retry Budget and Token Efficiency
1006
+
1007
+ Validation loops consume tokens. Design for efficiency:
1008
+
1009
+ - **Cap retries**: 2-3 attempts is usually sufficient; more indicates a prompt or schema problem
1010
+ - **Escalate gracefully**: After retry budget exhausted, surface a clear failure rather than looping
1011
+ - **Track retry rates**: High retry rates signal opportunities for prompt improvement or schema simplification
1012
+ - **Consider model capability**: Less capable models may need higher retry budgets but with simpler feedback
1013
+
1014
+ ### Best Practices
1015
+
1016
+ - **Independent validator:** Treat validation as a separate layer or service whenever possible; don’t let the same model decide if its own output is valid.
1017
+ - **Retry budget:** Cap the number of repair attempts; surface a clear failure state instead of looping indefinitely.
1018
+ - **Partial success:** Prefer emitting valid-but-partial objects over invalid-but-complete-looking ones; downstream systems can handle missing optional fields more safely than malformed structure.
1019
+
1020
+ Validate → feedback → repair is a general pattern:
1021
+
1022
+ - Works for schema-bound JSON/YAML.
1023
+ - Works for more informal artifacts (e.g., checklists, outlines) when combined with light-weight structural checks.
1024
+ - Plays well with the structured-output patterns above and with the reflection/self-critique patterns below.
1025
+
1026
+ ### Enhanced Error Classification
1027
+
1028
+ Not all validation errors are equal. Categorizing errors by type enables targeted retry strategies and prevents wasted computation.
1029
+
1030
+ #### Error Categories
1031
+
1032
+ | Category | Trigger | Retry Strategy | Context |
1033
+ |----------|---------|----------------|---------|
1034
+ | **INNER** | Schema/format errors (JSON syntax, Pydantic failures) | Inner loop with targeted field hints | Name failing field, suggest fix |
1035
+ | **SEMANTIC** | Invalid references, impossible states | Outer loop—repair source | Valid references list, fuzzy match suggestions |
1036
+ | **COMPLETENESS** | Missing items from manifest | Outer loop with explicit count | Show expected count, list missing IDs |
1037
+ | **FATAL** | Unrecoverable (hallucinated structure, token limit) | Stop, surface error | Clear failure message |
1038
+
1039
+ #### Implementation Pattern
1040
+
1041
+ ```python
1042
+ from enum import Enum, auto
1043
+
1044
+ class ErrorCategory(Enum):
1045
+ INNER = auto() # Schema/format - fast retry
1046
+ SEMANTIC = auto() # Invalid references - repair source
1047
+ COMPLETENESS = auto() # Missing manifest items - outer loop
1048
+ FATAL = auto() # Unrecoverable - stop
1049
+
1050
+ def categorize_error(error: ValidationError, manifest: set[str], output_ids: set[str]) -> ErrorCategory:
1051
+ """Route error to appropriate retry strategy."""
1052
+
1053
+ # Schema errors are always INNER
1054
+ if isinstance(error, (PydanticError, JSONDecodeError)):
1055
+ return ErrorCategory.INNER
1056
+
1057
+ # Check for completeness against manifest
1058
+ missing = manifest - output_ids
1059
+ if missing:
1060
+ return ErrorCategory.COMPLETENESS
1061
+
1062
+ # Check for invented IDs (semantic error)
1063
+ invented = output_ids - manifest
1064
+ if invented:
1065
+ return ErrorCategory.SEMANTIC
1066
+
1067
+ # Unknown errors are FATAL
1068
+ return ErrorCategory.FATAL
1069
+
1070
+ def get_retry_context(category: ErrorCategory, error: ValidationError, valid_refs: list[str]) -> dict:
1071
+ """Build appropriate context for retry based on error category."""
1072
+
1073
+ if category == ErrorCategory.INNER:
1074
+ return {
1075
+ "failing_field": error.field_name,
1076
+ "expected_type": error.expected_type,
1077
+ "suggestion": error.suggested_fix
1078
+ }
1079
+
1080
+ elif category == ErrorCategory.SEMANTIC:
1081
+ invalid_id = error.invalid_reference
1082
+ fuzzy_matches = find_fuzzy_matches(invalid_id, valid_refs)
1083
+ return {
1084
+ "invalid_reference": invalid_id,
1085
+ "valid_references": valid_refs,
1086
+ "did_you_mean": fuzzy_matches[:3]
1087
+ }
1088
+
1089
+ elif category == ErrorCategory.COMPLETENESS:
1090
+ return {
1091
+ "expected_count": len(manifest),
1092
+ "received_count": len(output_ids),
1093
+ "missing_ids": list(manifest - output_ids)
1094
+ }
1095
+
1096
+ else: # FATAL
1097
+ return {"error": str(error), "action": "surface_to_user"}
1098
+ ```
1099
+
1100
+ #### Category-Specific Prompts
1101
+
1102
+ **For INNER errors:**
1103
+
1104
+ ```text
1105
+ Field `{failing_field}` has wrong type.
1106
+ Expected: {expected_type}
1107
+ Received: {received_value}
1108
+ Fix only this field and regenerate.
1109
+ ```
1110
+
1111
+ **For SEMANTIC errors:**
1112
+
1113
+ ```text
1114
+ Reference `{invalid_reference}` does not exist.
1115
+ Valid references: {valid_references}
1116
+ Did you mean: {did_you_mean}?
1117
+ Regenerate the source with a valid reference.
1118
+ ```
1119
+
1120
+ **For COMPLETENESS errors:**
1121
+
1122
+ ```text
1123
+ Expected {expected_count} decisions, received {received_count}.
1124
+ Missing: {missing_ids}
1125
+ Generate decisions for ALL items, including the missing ones.
1126
+ ```
1127
+
1128
+ ### Two-Level Feedback Architecture
1129
+
1130
+ Simple validation loops assume errors can be fixed by repairing the output. But some errors originate earlier in the pipeline—the output is wrong because the *input* was wrong. A two-level architecture handles both cases.
1131
+
1132
+ #### The Problem: Broken Input Propagation
1133
+
1134
+ Consider a pipeline: `Summarize → Serialize → Validate`
1135
+
1136
+ ```text
1137
+ Summarize → Brief (with invented IDs) → Serialize → Validate → Feedback
1138
+ ↑ ↓
1139
+ └──────── Brief stays the same! ───────────────┘
1140
+ ```
1141
+
1142
+ If the summarize step invents an ID (`archive_access` instead of valid `diary_truth`), the serialize step will use it because it's in the brief. Validation rejects it. The inner repair loop retries serialize with the same broken brief → **0% correction rate**.
1143
+
1144
+ #### Solution: Nested Loops
1145
+
1146
+ ```text
1147
+ ┌─────────────────────────────────────────────────────────────────┐
1148
+ │ OUTER LOOP (max 2) │
1149
+ │ When SEMANTIC validation fails → repair the SOURCE: │
1150
+ │ - Original input + validation errors │
1151
+ │ - Valid references list │
1152
+ │ - Fuzzy replacement suggestions │
1153
+ └─────────────────────────────────────────────────────────────────┘
1154
+ ↓ ↑
1155
+ ┌───────────┐ ┌──────────────┐
1156
+ │ SOURCE │ │ SEMANTIC │
1157
+ │ (brief) │ │ VALIDATION │
1158
+ └───────────┘ └──────────────┘
1159
+ ↓ ↑
1160
+ ┌─────────────────────────────────────────────────────────────────┐
1161
+ │ INNER LOOP (max 3) │
1162
+ │ Handles schema/format errors only │
1163
+ │ (Pydantic failures, JSON syntax, missing fields) │
1164
+ └─────────────────────────────────────────────────────────────────┘
1165
+ ```
1166
+
1167
+ **Inner loop** (fast, cheap): Schema errors, type mismatches, missing required fields. These can be fixed by repairing the serialized output directly.
1168
+
1169
+ **Outer loop** (expensive, rare): Semantic errors—invalid references, invented IDs, impossible states. These require repairing the *source* that caused the problem.
1170
+
1171
+ #### When to Use Each Loop
1172
+
1173
+ | Error Type | Loop | Example |
1174
+ |------------|------|---------|
1175
+ | JSON syntax error | Inner | Missing comma, unclosed brace |
1176
+ | Missing required field | Inner | `protagonist_name` not provided |
1177
+ | Invalid field value | Inner | `estimated_passages: 15` when max is 10 |
1178
+ | Unknown field | Inner | `passages` instead of `estimated_passages` |
1179
+ | Invalid reference ID | **Outer** | `thread: "archive_access"` when ID doesn't exist |
1180
+ | Semantic inconsistency | **Outer** | Character referenced before introduction |
1181
+ | Hallucinated entity | **Outer** | Entity name invented, not from source data |
1182
+
1183
+ #### Fuzzy ID Replacement Suggestions
1184
+
1185
+ When semantic validation finds invalid IDs, generate replacement suggestions using fuzzy matching:
1186
+
1187
+ ```markdown
1188
+ ### Error: Invalid Thread ID
1189
+ - Location: initial_beats.5.threads
1190
+ - You used: `archive_access`
1191
+ - VALID OPTIONS: `butler_fidelity` | `diary_truth` | `host_motive`
1192
+ - SUGGESTED: `diary_truth` (closest match to "archive")
1193
+
1194
+ ### Error: Unknown Entity
1195
+ - Location: scene.3.characters
1196
+ - You used: `mysterious_stranger`
1197
+ - VALID OPTIONS: `butler_jameson` | `guest_clara` | `detective_morse`
1198
+ - SUGGESTED: Remove this reference (no close match)
1199
+ ```
1200
+
1201
+ This gives the model actionable guidance rather than just rejection.
1202
+
1203
+ #### Source Repair Prompt Pattern
1204
+
1205
+ When the outer loop triggers, the repair prompt should include:
1206
+
1207
+ 1. **Original source** (the brief/summary being repaired)
1208
+ 2. **Validation errors** (what went wrong downstream)
1209
+ 3. **Valid references** (complete list of allowed IDs)
1210
+ 4. **Fuzzy suggestions** (what to replace invalid IDs with)
1211
+ 5. **Full context** (original input data the source was derived from)
1212
+
1213
+ ```markdown
1214
+ ## Repair Required
1215
+
1216
+ Your brief contained invalid references that caused downstream failures.
1217
+
1218
+ ### Original Brief
1219
+ [brief content here]
1220
+
1221
+ ### Validation Errors
1222
+ 1. `archive_access` is not a valid thread ID
1223
+ 2. `clock_distortion` is not a valid thread ID
1224
+
1225
+ ### Valid Thread IDs
1226
+ - butler_fidelity
1227
+ - diary_truth
1228
+ - host_motive
1229
+
1230
+ ### Suggested Replacements
1231
+ - `archive_access` → `diary_truth` (both relate to hidden information)
1232
+ - `clock_distortion` → REMOVE (no matching concept)
1233
+
1234
+ ### Original Discussion (for context)
1235
+ [full source material the brief was derived from]
1236
+
1237
+ Produce a corrected brief that uses only valid IDs.
1238
+ ```
1239
+
1240
+ #### Budget and Applicability
1241
+
1242
+ | Stage Type | Needs Outer Loop? | Reason |
1243
+ |------------|-------------------|--------|
1244
+ | Generation (creates new IDs) | No | Creates IDs, doesn't reference them |
1245
+ | Summarization | **Yes** | May invent or misremember IDs |
1246
+ | Serialization (uses existing IDs) | **Yes** | References IDs from earlier stages |
1247
+ | Expansion (adds detail) | Maybe | References scene/entity IDs |
1248
+
1249
+ **Total budget:** Outer loop max 2 iterations × Inner loop max 3 iterations = ≤12 LLM calls per stage worst case.
1250
+
1251
+ **Success criteria:**
1252
+
1253
+ - >80% correction rate on first outer loop iteration
1254
+ - Clear error messages guide model to correct IDs
1255
+ - Fuzzy matching reduces guesswork
1256
+
1257
+ ---
1258
+
1259
+ ## Prompt-History Conflicts
1260
+
1261
+ When the system prompt says "MUST do X first" but the conversation history shows the model already did Y, confusion results.
1262
+
1263
+ **Problem:**
1264
+
1265
+ ```
1266
+ System: "You MUST call consult_playbook before any delegation."
1267
+ History: [delegate(...) was called successfully]
1268
+ Model: "But I already delegated... should I undo it?"
1269
+ ```
1270
+
1271
+ **Solutions:**
1272
+
1273
+ 1. **Use present-tense rules**: "Call consult_playbook before delegating" not "MUST call first"
1274
+ 2. **Acknowledge state**: "If you haven't yet consulted the playbook, do so now"
1275
+ 3. **Avoid absolute language** when state may vary
1276
+
1277
+ ---
1278
+
1279
+ ## Chain-of-Thought (CoT)
1280
+
1281
+ For complex logical tasks, forcing the model to articulate its reasoning *before* acting significantly reduces hallucination and logic errors.
1282
+
1283
+ ### The Problem
1284
+
1285
+ Zero-shot tool calling often fails on multi-step problems because the model commits to an action before fully processing constraints.
1286
+
1287
+ ### Implementation
1288
+
1289
+ Require explicit reasoning steps:
1290
+
1291
+ - **Structure**: `<thought>Analysis...</thought>` followed by tool call
1292
+ - **Tooling**: Add a mandatory `reasoning` parameter to critical tools
1293
+ - **Benefits**: +40-50% improvement on complex reasoning benchmarks
1294
+
1295
+ ### When to Use
1296
+
1297
+ - Multi-step planning decisions
1298
+ - Constraint satisfaction problems
1299
+ - Quality assessments with multiple criteria
1300
+ - Decisions with long-term consequences
1301
+
1302
+ ---
1303
+
1304
+ ## Dynamic Few-Shot Prompting
1305
+
1306
+ Static example lists consume tokens and may not match the current task.
1307
+
1308
+ ### The Pattern
1309
+
1310
+ Use retrieval to inject context-aware examples:
1311
+
1312
+ 1. **Store** a library of high-quality examples as vectors
1313
+ 2. **Query** using the current task description
1314
+ 3. **Inject** top 3-5 most relevant examples into the prompt
1315
+
1316
+ ### Benefits
1317
+
1318
+ - Smaller prompts (no static example bloat)
1319
+ - More relevant examples for each task
1320
+ - Examples improve as library grows
1321
+
1322
+ ### When to Use
1323
+
1324
+ - Tasks requiring stylistic consistency
1325
+ - Complex tool usage patterns
1326
+ - Domain-specific formats
1327
+
1328
+ ---
1329
+
1330
+ ## Reflection and Self-Correction
1331
+
1332
+ Models perform significantly better when asked to critique their own work before finalizing.
1333
+
1334
+ ### The Pattern
1335
+
1336
+ Implement a "Draft-Critique-Refine" loop:
1337
+
1338
+ 1. **Draft**: Generate preliminary plan or content
1339
+ 2. **Critique**: Evaluate against constraints
1340
+ 3. **Refine**: Generate final output based on critique
1341
+
1342
+ ### Implementation Options
1343
+
1344
+ - **Two-turn**: Separate critique and refinement turns
1345
+ - **Single-turn**: Internal thought step for capable models
1346
+ - **Validator pattern**: Separate agent reviews work
1347
+
1348
+ ### When to Use
1349
+
1350
+ - High-stakes actions (modifying persistent state, finalizing content)
1351
+ - Complex constraint satisfaction
1352
+ - Quality-critical outputs
1353
+
1354
+ ---
1355
+
1356
+ ## Active Context Pruning
1357
+
1358
+ Long-running sessions suffer from "Context Rot"—old, irrelevant details confuse the model even within token limits.
1359
+
1360
+ ### The Problem
1361
+
1362
+ Context is often treated as append-only log. But stale context:
1363
+
1364
+ - Dilutes attention from current task
1365
+ - May contain outdated assumptions
1366
+ - Wastes token budget
1367
+
1368
+ ### Strategies
1369
+
1370
+ **Semantic Chunking:**
1371
+
1372
+ Group history by episodes or tasks, not just turns.
1373
+
1374
+ **Active Forgetting:**
1375
+
1376
+ When a task completes, summarize to high-level outcome and **remove** raw turns.
1377
+
1378
+ **State-over-History:**
1379
+
1380
+ Prefer providing current *state* (artifacts, flags) over the *history* of how that state was reached.
1381
+
1382
+ ---
1383
+
1384
+ ## Testing Agent Prompts
1385
+
1386
+ ### Test with Target Models
1387
+
1388
+ Before deploying:
1389
+
1390
+ 1. Test with smallest target model
1391
+ 2. Verify first-turn tool calls work
1392
+ 3. Check for unexpected prose generation
1393
+ 4. Measure token count of system prompt
1394
+
1395
+ ### Metrics to Track
1396
+
1397
+ | Metric | What It Measures |
1398
+ |--------|------------------|
1399
+ | Tool compliance rate | % of turns with correct tool calls |
1400
+ | First-turn success | Does the model call a tool on turn 1? |
1401
+ | Prose leakage | Does a coordinator generate content? |
1402
+ | Instruction following | Are critical constraints obeyed? |
1403
+
1404
+ ---
1405
+
1406
+ ## Provider-Specific Optimizations
1407
+
1408
+ - **Anthropic**: Use `token-efficient-tools` beta header for up to 70% output token reduction; temperature capped at 1.0
1409
+ - **OpenAI**: Consider fine-tuning for frequently-used patterns; temperature range 0.0–2.0
1410
+ - **Gemini**: Temperature range 0.0–2.0, similar behavior to OpenAI
1411
+ - **Ollama/Local**: Tool retrieval essential—small models struggle with 10+ tools; default temperature varies by model (typically 0.7–0.8)
1412
+
1413
+ See [Sampling Parameters](#sampling-parameters) for detailed temperature guidance by use case.
1414
+
1415
+ ---
1416
+
1417
+ ## Quick Reference
1418
+
1419
+ | Pattern | Problem It Solves | Key Technique |
1420
+ |---------|-------------------|---------------|
1421
+ | Sandwich | Lost in the middle | Repeat critical instructions at start AND end |
1422
+ | Tool filtering | Small model tool overload | Limit tools by model class |
1423
+ | Two-stage selection | Large tool libraries | Menu → select → load |
1424
+ | Concise descriptions | Schema token overhead | 1-2 sentences, details in knowledge |
1425
+ | Neutral descriptions | Tool preference bias | Descriptive not prescriptive |
1426
+ | Menu + consult | Context explosion | Summaries in prompt, retrieve on demand |
1427
+ | Concise content | Small model budgets | Dual-length summaries |
1428
+ | CoT | Complex reasoning failures | Require reasoning before action |
1429
+ | Dynamic few-shot | Static example bloat | Retrieve relevant examples |
1430
+ | Reflection | Quality failures | Draft → critique → refine |
1431
+ | Context pruning | Context rot | Summarize and remove stale turns |
1432
+ | Structured feedback | Vague validation errors | Categorize issues (invalid/missing/unknown) |
1433
+ | Phase-specific temperature | Format errors in structured output | High temp for discuss, low for serialize |
1434
+ | Numbered lists | Checkbox skipping | Use 1. 2. 3. format, not checkboxes |
1435
+ | Quantity anchoring | Incomplete list processing | State exact count at start AND end |
1436
+ | Anti-skipping statements | Middle items ignored | Explicit "process all N" constraints |
1437
+ | Two-level validation | Broken input propagation | Outer loop repairs source, inner repairs output |
1438
+ | Manifest-first | Extraction failures, missing items | Provide complete ID list before generation |
1439
+ | Three gates | Item omission in generation | Prevention → Enforcement → Validation |
1440
+ | Error classification | Wasted retry computation | Route INNER/SEMANTIC/COMPLETENESS/FATAL separately |
1441
+ | Scoped identifiers | ID confusion across types | `type::raw_id` format (entity::hero) |
1442
+ | Semantic creativity levels | Cross-provider temperature differences | Abstract intent, map to provider values |
1443
+ | Verbalized sampling | LLM output homogeneity | Generate multiple + sample from tails |
1444
+
1445
+ | Model Class | Max Prompt | Max Tools | Strategy |
1446
+ |-------------|------------|-----------|----------|
1447
+ | Small (≤8B) | 2,000 tokens | 6-8 | Aggressive filtering, concise content |
1448
+ | Medium (9B-70B) | 6,000 tokens | 12 | Selective filtering, menu+consult |
1449
+ | Large (70B+) | 12,000 tokens | 15+ | Full content where beneficial |
1450
+
1451
+ ---
1452
+
1453
+ ## Research Basis
1454
+
1455
+ | Source | Key Finding |
1456
+ |--------|-------------|
1457
+ | Stanford "Lost in the Middle" | U-shaped attention curve; middle content ignored |
1458
+ | "Less is More" (2024) | Tool count inversely correlates with compliance |
1459
+ | RAG-MCP (2025) | Two-stage selection reduces tokens 50%+, improves accuracy 3x |
1460
+ | Anthropic Token-Efficient Tools | Schema optimization reduces output tokens 70% |
1461
+ | Reflexion research | Self-correction improves quality on complex tasks |
1462
+ | STROT Framework (2025) | Structured feedback loops achieve 95% first-attempt success |
1463
+ | AWS Evaluator-Optimizer | Semantic reflection enables self-improving validation |
1464
+ | LLM Self-Verification Limitations (2024) | LLMs cannot reliably self-verify; external validation required |
1465
+ | Spotify Verification Loops (2025) | Inner/outer loop architecture; deterministic + semantic validation |
1466
+ | LLMLOOP (ICSME 2025) | First feedback iteration has highest impact (up to 24% improvement) |
1467
+ | QuestFoundry #211 | Manifest-first serialization; extraction mindset fails on generation tasks |
1468
+ | Renze & Guven (EMNLP 2024) | Temperature 0.0–1.0 has no significant impact on problem-solving performance |
1469
+ | "Echoes in AI" (PNAS 2025) | LLM outputs contain repeated plot elements across generations (Sui Generis score) |
1470
+ | "Creative Homogeneity" (2025) | LLM responses more similar to each other than human responses |
1471
+ | Min-p Sampling (2024) | Adaptive truncation maintains coherence at high temperatures |
1472
+ | Verbalized Sampling (CHATS-lab) | 2–3× diversity improvement via prompting technique |
1473
+ | DeepSeek API Docs | Temperature mapping: API 1.0 → internal 0.3 |
1474
+
1475
+ ---
1476
+
1477
+ ## See Also
1478
+
1479
+ - [Agent Memory Architecture](agent_memory_architecture.md) — State-managed memory, checkpointers, history management
1480
+ - [Branching Narrative Construction](../narrative-structure/branching_narrative_construction.md) — LLM generation strategies for narratives
1481
+ - [Multi-Agent Patterns](multi_agent_patterns.md) — Team coordination and delegation