PyPI - ifcraftcorpus - Versions diffs - 1.4.0__tar.gz → 1.5.0__tar.gz - Mend

ifcraftcorpus 1.4.0tar.gz → 1.5.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (68) hide show

{ifcraftcorpus-1.4.0 → ifcraftcorpus-1.5.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: ifcraftcorpus
-Version: 1.4.0
+Version: 1.5.0
 Summary: Interactive fiction craft corpus with search library and MCP server
 Project-URL: Homepage, https://pvliesdonk.github.io/if-craft-corpus
 Project-URL: Repository, https://github.com/pvliesdonk/if-craft-corpus

ifcraftcorpus-1.5.0/corpus/agent-design/agent_memory_architecture.md ADDED Viewed

@@ -0,0 +1,765 @@
+---
+title: Agent Memory Architecture
+summary: Framework-independent patterns for managing agent conversation history and long-term memory—why prompt stuffing fails, state-managed alternatives, memory types, and multi-agent sharing.
+topics:
+  - memory-architecture
+  - conversation-history
+  - state-management
+  - checkpointers
+  - context-engineering
+  - multi-agent
+  - langgraph
+  - openai-agents
+cluster: agent-design
+---
+# Agent Memory Architecture
+Patterns for managing agent conversation history and long-term memory. This guide explains why manual prompt concatenation fails, how to use state-managed memory correctly, and how to share context between agents.
+This document is framework-independent in principles but includes concrete examples for LangGraph and OpenAI Agents SDK.
+---
+## The Anti-Pattern: Manual Prompt Concatenation
+When building agents, developers (and AI coding assistants) often default to manually concatenating conversation history into prompts. This is the most common mistake in agent development.
+### What It Looks Like
+**Anti-pattern: Naive history concatenation**
+```python
+# DON'T DO THIS
+class NaiveAgent:
+    def __init__(self, model):
+        self.model = model
+        self.history = []  # Manual history list
+    def chat(self, user_message: str) -> str:
+        self.history.append({"role": "user", "content": user_message})
+        # Stuffing full history into every call
+        response = self.model.chat(
+            messages=[
+                {"role": "system", "content": SYSTEM_PROMPT},
+                *self.history  # Growing unboundedly
+            ]
+        )
+        self.history.append({"role": "assistant", "content": response})
+        return response
+```
+**Problems:**
+1. **No persistence**: History lost on restart
+2. **Unbounded growth**: Eventually exceeds context window
+3. **No thread isolation**: Can't run multiple conversations
+4. **Attention degradation**: Middle content gets ignored
+5. **Token waste**: Paying for stale context every call
+**Anti-pattern: String concatenation**
+```python
+# DON'T DO THIS
+def build_prompt(history: list[dict], new_message: str) -> str:
+    history_text = "\n".join([
+        f"{msg['role']}: {msg['content']}"
+        for msg in history
+    ])
+    return f"""Previous conversation:
+{history_text}
+User: {new_message}
+Assistant:"""
+```
+**Problems:**
+1. **Format fragility**: Role formatting can confuse the model
+2. **No structure**: Loses message boundaries
+3. **Injection risk**: History content can break prompt structure
+4. **No tool call preservation**: Loses function call context
+### Why AI Coding Assistants Default to This
+Training data contains many examples of this pattern because:
+- It's the simplest implementation
+- It works for demos and tutorials
+- Framework-specific patterns require API knowledge
+- Most code examples don't show production patterns
+This is why you have to repeatedly explain you want proper memory management.
+### Why It Fails: The Evidence
+**"Lost in the Middle" Research (Liu et al., 2023)**
+LLMs exhibit a U-shaped attention curve—content at the start and end of context receives attention, middle content is systematically ignored. Stuffing history into the middle of a prompt means important context gets lost.
+**The 75% Rule (Claude Code, Anthropic)**
+When Claude Code operated above 90% context utilization, output quality degraded significantly. Implementing auto-compaction at 75% produced dramatic quality improvements. The lesson: **capacity ≠ capability**. Empty headroom enables reasoning, not just retrieval.
+**Context Rot**
+Old, irrelevant details don't just waste tokens—they actively confuse the model. A discussion about error handling from 50 turns ago can distract from the current task, even if technically within the context window.
+---
+## The Correct Model: State-Managed Memory
+Memory should be **first-class state**, not prompt injection. The framework handles storage, retrieval, trimming, and injection—your code focuses on logic.
+### Core Principles
+**1. Separation of Concerns**
+| Concern | Responsibility | Your Code |
+|---------|----------------|-----------|
+| Storage | Persist messages to durable store | Configure checkpointer |
+| Retrieval | Load relevant history for thread | Provide thread_id |
+| Trimming | Keep context within limits | Set thresholds |
+| Injection | Add history to model calls | Automatic |
+**2. Thread Isolation**
+Each conversation gets a unique `thread_id`. The framework maintains separate history per thread, enabling concurrent conversations without interference.
+**3. Resumability**
+Conversations can be paused and resumed—even across process restarts. The checkpointer persists state to durable storage.
+**4. Automatic Management**
+You don't manually append messages or manage context length. The framework handles this based on configuration.
+### LangGraph: Checkpointer Pattern
+```python
+from langgraph.checkpoint.memory import InMemorySaver
+from langgraph.checkpoint.sqlite import SqliteSaver
+from langgraph.graph import StateGraph, MessagesState
+# Development: in-memory
+checkpointer = InMemorySaver()
+# Production: persistent storage
+# checkpointer = SqliteSaver.from_conn_string("conversations.db")
+# Define your graph
+builder = StateGraph(MessagesState)
+builder.add_node("agent", call_model)
+builder.add_edge("__start__", "agent")
+# Compile WITH checkpointer
+graph = builder.compile(checkpointer=checkpointer)
+# Each conversation gets a thread_id
+config = {"configurable": {"thread_id": "user-123-session-1"}}
+# Framework handles history automatically
+response = graph.invoke(
+    {"messages": [{"role": "user", "content": "Hello!"}]},
+    config
+)
+# Same thread_id = conversation continues
+response = graph.invoke(
+    {"messages": [{"role": "user", "content": "What did I just say?"}]},
+    config  # Same config = same thread
+)
+```
+**What the framework does:**
+1. Before invoke: Loads existing messages for thread_id
+2. Prepends history to new messages
+3. Calls model with full context
+4. After invoke: Persists new messages to checkpointer
+5. Handles context limits based on configuration
+### OpenAI Agents SDK: Session Pattern
+```python
+from agents import Agent, Runner
+from agents.sessions import SQLiteSession
+# Create persistent session storage
+session = SQLiteSession("conversations.db")
+agent = Agent(
+    name="assistant",
+    instructions="You are a helpful assistant.",
+    model="gpt-4o"
+)
+runner = Runner()
+# Session handles history automatically
+response = await runner.run(
+    agent,
+    "Hello!",
+    session=session,
+    session_id="user-123-session-1"
+)
+# Same session_id = conversation continues
+response = await runner.run(
+    agent,
+    "What did I just say?",
+    session=session,
+    session_id="user-123-session-1"
+)
+```
+**What the session does:**
+1. Before run: Retrieves conversation history for session_id
+2. Prepends history to input items
+3. Executes agent with full context
+4. After run: Stores new items (user input, responses, tool calls)
+5. Handles continuity across runs
+---
+## Memory Types
+Agent memory isn't monolithic. Different types serve different purposes and have different scopes.
+### Short-Term Memory (Thread-Scoped)
+**Scope**: Single conversation thread
+**Purpose**: Maintain context within an ongoing session
+**Lifetime**: Duration of conversation (or until explicitly cleared)
+| Framework | Implementation |
+|-----------|----------------|
+| LangGraph | Checkpointer with `thread_id` |
+| OpenAI SDK | Session with `session_id` |
+| General | Thread-isolated message store |
+**What belongs in short-term memory:**
+- User messages and assistant responses
+- Tool calls and results
+- Reasoning traces (if using chain-of-thought)
+- Current task state
+### Long-Term Memory (Cross-Session)
+**Scope**: Across multiple conversations
+**Purpose**: Persist facts, preferences, learned patterns
+**Lifetime**: Indefinite (or until explicitly deleted)
+#### Structured Long-Term Memory
+Facts, relationships, and decisions stored in queryable format.
+```python
+# LangGraph Store pattern
+from langgraph.store.memory import InMemoryStore
+store = InMemoryStore()
+# Store user preference (persists across threads)
+store.put(
+    namespace=("users", "user-123", "preferences"),
+    key="timezone",
+    value={"timezone": "America/New_York", "updated": "2025-01-17"}
+)
+# Retrieve in any thread
+prefs = store.get(("users", "user-123", "preferences"), "timezone")
+```
+#### Semantic Long-Term Memory
+Embedding-based retrieval for finding relevant past context.
+```python
+# Conceptual pattern (framework-independent)
+from your_vector_store import VectorStore
+memory_store = VectorStore()
+# Store interaction summary with embedding
+memory_store.add(
+    text="User prefers concise responses without code comments",
+    metadata={"user_id": "user-123", "type": "preference"},
+    embedding=embed("User prefers concise responses...")
+)
+# Retrieve relevant memories for new context
+relevant = memory_store.search(
+    query="How should I format code for this user?",
+    filter={"user_id": "user-123"}
+)
+```
+### Episodic Memory
+**Scope**: Cross-session, timestamped
+**Purpose**: Record past interactions for learning and audit
+**Lifetime**: Configurable retention
+```python
+# Record interaction outcome
+episodic_store.add({
+    "timestamp": "2025-01-17T10:30:00Z",
+    "user_id": "user-123",
+    "thread_id": "session-456",
+    "task": "debug authentication error",
+    "outcome": "resolved",
+    "approach": "checked token expiration, found clock skew",
+    "user_feedback": "positive"
+})
+# Query past approaches for similar tasks
+past_successes = episodic_store.query(
+    task_type="debug authentication",
+    outcome="resolved",
+    user_id="user-123"
+)
+```
+### Memory Layers Summary
+| Layer | Scope | Storage | Retrieval | Example Use |
+|-------|-------|---------|-----------|-------------|
+| Short-term | Thread | Checkpointer/Session | By thread_id | Conversation context |
+| Long-term (Structured) | User/Global | Key-value store | By namespace + key | User preferences |
+| Long-term (Semantic) | User/Global | Vector store | By similarity | Relevant past context |
+| Episodic | User/Global | Event log | By query + time | Past task outcomes |
+---
+## State-Over-History Principle
+A key insight for efficient memory management: **prefer passing current state over full history**.
+### The Problem with Full History
+```python
+# Anti-pattern: Passing full transcript to sub-agent
+sub_agent_prompt = f"""
+Here's the full conversation so far:
+{format_messages(all_300_messages)}
+Now help with: {current_task}
+"""
+```
+**Problems:**
+- Token explosion
+- Attention dilution
+- Irrelevant context pollution
+- Latency increase
+### State-Over-History Pattern
+```python
+# Better: Pass current state, not history
+current_state = {
+    "user_goal": "Build a REST API for user management",
+    "completed_steps": ["schema design", "database setup"],
+    "current_step": "implement CRUD endpoints",
+    "decisions_made": {
+        "database": "PostgreSQL",
+        "framework": "FastAPI",
+        "auth": "JWT tokens"
+    },
+    "open_questions": [],
+    "artifacts": ["schema.sql", "models.py"]
+}
+sub_agent_prompt = f"""
+Current project state:
+{json.dumps(current_state, indent=2)}
+Task: {current_task}
+"""
+```
+**Benefits:**
+- Minimal tokens
+- Focused attention
+- No stale context
+- Faster inference
+### What Belongs in State vs History
+| State (Pass Forward) | History (Store, Don't Pass) |
+|---------------------|------------------------------|
+| Current goal | How goal was established |
+| Decisions made | Discussion leading to decisions |
+| Artifacts created | Iterations and revisions |
+| Open questions | Resolved questions |
+| Error context (if debugging) | Successful operations |
+### Implementing State Extraction
+```python
+# LangGraph: Custom state schema
+from typing import TypedDict, Annotated
+from langgraph.graph import add_messages
+class ProjectState(TypedDict):
+    messages: Annotated[list, add_messages]  # Short-term (auto-managed)
+    # Extracted state (you manage)
+    current_goal: str
+    decisions: dict
+    artifacts: list[str]
+    phase: str
+# Update state after significant events
+def extract_state(messages: list, current_state: ProjectState) -> ProjectState:
+    """Extract/update state from recent messages."""
+    # Use LLM or rules to identify:
+    # - New decisions made
+    # - Artifacts created
+    # - Phase transitions
+    return updated_state
+```
+---
+## Managing History Growth
+Even with proper memory architecture, history grows. You need strategies to keep it bounded.
+### Strategy 1: Trimming
+Keep only the last N turns, drop the rest.
+**LangGraph: trim_messages**
+```python
+from langgraph.prebuilt import create_react_agent
+from langchain_core.messages import trim_messages
+def trim_to_recent(messages: list) -> list:
+    """Keep system message + last 10 messages."""
+    return trim_messages(
+        messages,
+        max_tokens=4000,
+        strategy="last",
+        token_counter=len,  # Or use tiktoken
+        include_system=True,
+        allow_partial=False
+    )
+# Apply before model call
+agent = create_react_agent(
+    model,
+    tools,
+    state_modifier=trim_to_recent
+)
+```
+**When to use trimming:**
+- Short, transactional conversations
+- Tasks where old context is truly irrelevant
+- When latency is critical
+**Anti-patterns with trimming:**
+- Losing critical decisions from early in conversation
+- Trimming mid-tool-call (orphaned tool results)
+- Using for planning tasks that need long-range context
+### Strategy 2: Summarization
+Compress older messages into a synthetic summary.
+**LangGraph: SummarizationMiddleware**
+```python
+from langchain.agents import create_agent, SummarizationMiddleware
+agent = create_agent(
+    model="gpt-4o",
+    tools=tools,
+    middleware=[
+        SummarizationMiddleware(
+            model="gpt-4o-mini",  # Cheaper model for summarization
+            trigger={"tokens": 4000},  # Trigger when context exceeds
+            keep={"messages": 10}  # Keep last 10 verbatim
+        )
+    ]
+)
+```
+**What summarization produces:**
+```
+[Summary of turns 1-50]:
+- User requested help building a REST API
+- Decided on FastAPI + PostgreSQL
+- Completed: schema design, database models
+- Current focus: authentication implementation
+- User prefers concise code without excessive comments
+[Recent messages 51-60 kept verbatim]
+```
+**When to use summarization:**
+- Long-running planning conversations
+- Support threads spanning multiple issues
+- Tasks requiring long-range continuity
+**Anti-patterns with summarization:**
+- **Summary drift**: Facts get reinterpreted incorrectly
+- **Context poisoning**: Errors in summary propagate indefinitely
+- **Over-compression**: Losing critical details
+- **Summarizing too frequently**: Latency overhead
+### Strategy 3: Hybrid (Recommended)
+Combine summarization for old context + trimming for recent.
+```python
+class HybridMemoryConfig:
+    # Summarize when total exceeds this
+    summarize_threshold_tokens: int = 8000
+    # Keep this many recent messages verbatim
+    keep_recent_messages: int = 20
+    # Maximum summary length
+    max_summary_tokens: int = 500
+    # Model for summarization (use cheaper model)
+    summary_model: str = "gpt-4o-mini"
+```
+**Flow:**
+1. Check total token count
+2. If under threshold: no action
+3. If over threshold:
+   - Keep last N messages verbatim
+   - Summarize older messages
+   - Replace older messages with summary
+   - Continue with bounded context
+---
+## Multi-Agent Memory Sharing
+When multiple agents collaborate, memory sharing becomes critical.
+### Pattern 1: Shared State Object
+Agents read from and write to a common state.
+```python
+# LangGraph: Shared state across nodes
+from typing import TypedDict, Annotated
+from langgraph.graph import StateGraph, add_messages
+class SharedState(TypedDict):
+    messages: Annotated[list, add_messages]
+    # Shared across all agents
+    research_findings: list[str]
+    draft_content: str
+    review_feedback: list[str]
+    final_output: str
+def researcher(state: SharedState) -> SharedState:
+    """Research agent adds findings to shared state."""
+    findings = do_research(state["messages"][-1])
+    return {"research_findings": state["research_findings"] + findings}
+def writer(state: SharedState) -> SharedState:
+    """Writer agent reads research, produces draft."""
+    draft = write_draft(state["research_findings"])
+    return {"draft_content": draft}
+def reviewer(state: SharedState) -> SharedState:
+    """Reviewer reads draft, adds feedback."""
+    feedback = review(state["draft_content"])
+    return {"review_feedback": feedback}
+# Wire agents together
+graph = StateGraph(SharedState)
+graph.add_node("researcher", researcher)
+graph.add_node("writer", writer)
+graph.add_node("reviewer", reviewer)
+```
+### Pattern 2: Artifact Passing (Not Transcript Passing)
+**Anti-pattern: Context telephone**
+```python
+# DON'T DO THIS
+def orchestrator_delegates_to_specialist(conversation_history):
+    # Passing full history degrades information
+    specialist_result = specialist.run(
+        f"Here's the conversation:\n{conversation_history}\n\nDo task X"
+    )
+    return specialist_result
+```
+**Problems:**
+- Information degrades through each handoff
+- Irrelevant context pollutes specialist focus
+- Token waste compounds at each level
+**Better: Pass artifacts and state**
+```python
+# DO THIS
+def orchestrator_delegates_to_specialist(task_state):
+    # Pass only what specialist needs
+    specialist_result = specialist.run(
+        task_description=task_state["current_task"],
+        input_artifacts=task_state["relevant_artifacts"],
+        constraints=task_state["constraints"],
+        # NOT the full conversation history
+    )
+    return specialist_result
+```
+### Pattern 3: Memory Isolation vs Sharing
+| Scenario | Memory Strategy |
+|----------|-----------------|
+| Agents working on same task | Shared state object |
+| Agents with different domains | Isolated memory, share artifacts |
+| Parallel independent tasks | Fully isolated threads |
+| Validator reviewing creator's work | Read-only access to creator's output |
+**LangGraph: Isolated sub-agents**
+```python
+# Each specialist gets its own thread
+def delegate_to_specialist(state, specialist_graph, task):
+    # Create isolated thread for specialist
+    specialist_thread_id = f"{state['thread_id']}-{specialist_graph.name}-{uuid4()}"
+    result = specialist_graph.invoke(
+        {"messages": [{"role": "user", "content": task}]},
+        {"configurable": {"thread_id": specialist_thread_id}}
+    )
+    # Return only the result, not specialist's internal history
+    return result["final_output"]
+```
+### Pattern 4: Namespace-Based Sharing
+For long-term memory that should be shared across agents:
+```python
+# Shared user preferences (all agents can read)
+user_namespace = ("users", user_id, "preferences")
+# Agent-specific learned patterns (isolated)
+agent_namespace = ("agents", agent_id, "patterns")
+# Project-specific context (shared within project)
+project_namespace = ("projects", project_id, "context")
+```
+---
+## The 75% Rule
+Never fill context to capacity. Reserve headroom for reasoning.
+### Why Headroom Matters
+| Context Usage | Effect |
+|---------------|--------|
+| < 50% | Optimal reasoning space |
+| 50-75% | Good balance |
+| 75-90% | Degraded quality, trigger compaction |
+| > 90% | Significant quality loss |
+### Implementation
+```python
+def should_compact(messages: list, model_context_limit: int) -> bool:
+    """Check if context needs compaction."""
+    current_tokens = count_tokens(messages)
+    threshold = model_context_limit * 0.75
+    return current_tokens > threshold
+def auto_compact_middleware(state: AgentState) -> AgentState:
+    """Middleware that triggers compaction at 75%."""
+    if should_compact(state["messages"], MODEL_CONTEXT_LIMIT):
+        state["messages"] = summarize_and_trim(state["messages"])
+    return state
+```
+---
+## Implementation Checklist
+When building agents, verify:
+- [ ] **No manual history concatenation** in prompt building
+- [ ] **Checkpointer/Session configured** for conversation persistence
+- [ ] **Thread IDs assigned** for conversation isolation
+- [ ] **Trimming or summarization** configured for long conversations
+- [ ] **State-over-history** for sub-agent delegation
+- [ ] **Artifacts passed**, not transcripts, between agents
+- [ ] **75% threshold** for context compaction
+- [ ] **Long-term memory** separated from short-term (if needed)
+---
+## Quick Reference
+### Pattern Selection
+| Situation | Pattern | Framework Feature |
+|-----------|---------|-------------------|
+| Basic conversation persistence | Checkpointer/Session | LangGraph: `InMemorySaver`, OpenAI: `SQLiteSession` |
+| Long conversations | Summarization middleware | LangGraph: `SummarizationMiddleware` |
+| Multi-agent shared context | Shared state schema | LangGraph: `StateGraph` with shared `TypedDict` |
+| Cross-session user data | Long-term store | LangGraph: `InMemoryStore`, MongoDB Store |
+| Semantic memory retrieval | Vector store integration | External: Pinecone, Chroma, pgvector |
+### Anti-Pattern Recognition
+| If you see... | It's wrong because... | Replace with... |
+|---------------|----------------------|-----------------|
+| `history.append(msg)` | Manual management | Checkpointer |
+| `prompt += history` | String concatenation | Session with auto-injection |
+| Full transcript to sub-agent | Context telephone | Artifact/state passing |
+| No thread_id | No isolation | Explicit thread management |
+| No trimming/summarization | Unbounded growth | Memory middleware |
+---
+## Research Basis
+| Source | Key Finding |
+|--------|-------------|
+| "Lost in the Middle" (Liu et al., 2023) | U-shaped attention; middle content ignored |
+| Claude Code 75% Rule (Anthropic) | Quality degrades above 75% context usage |
+| LangChain Short-Term Memory Guide | Checkpointer + summarization patterns |
+| OpenAI Agents SDK Session Docs | Session-based auto-persistence |
+| AWS Memory-Augmented Agents | Memory layer architecture patterns |
+| A-Mem (2025) | Dynamic vs predefined memory access |
+---
+## See Also
+- [Agent Prompt Engineering](agent_prompt_engineering.md) — Context architecture, active pruning, state-over-history principle
+- [Multi-Agent Patterns](multi_agent_patterns.md) — Delegation, context passing, artifact handoffs

{ifcraftcorpus-1.4.0 → ifcraftcorpus-1.5.0}/corpus/agent-design/agent_prompt_engineering.md RENAMED Viewed

@@ -10,6 +10,9 @@ topics:
   - small-models
   - chain-of-thought
   - few-shot-learning
+  - list-completeness
+  - validation-loops
+  - external-validation
 cluster: agent-design
 ---
@@ -70,6 +73,113 @@ Lower-priority content that can be retrieved on demand:
 ---
+## List Completeness Patterns
+When LLMs must process every item in a list (entity decisions, task completions, validation checklists), they frequently skip items—especially in the middle of long lists. This section describes patterns to ensure completeness.
+### Numbered Lists vs Checkboxes
+Numbered lists outperform checkboxes for sequential processing:
+| Format | Behavior | Reliability |
+|--------|----------|-------------|
+| `- [ ] item` | Treated as optional; often reformatted creatively | Lower |
+| `1. item` | Signals discrete task requiring attention | Higher |
+**Why it works:** Numbered format implies a sequence of individual tasks. Combined with explicit counts, this creates accountability that checkbox format cannot.
+**Example:**
+Anti-pattern:
+```text
+- [ ] Decide on entity: butler_jameson
+- [ ] Decide on entity: guest_clara
+- [ ] Decide on entity: archive_room
+```
+Better:
+```text
+Entity Decisions (3 total):
+1. butler_jameson — [your decision]
+2. guest_clara — [your decision]
+3. archive_room — [your decision]
+```
+### Quantity Anchoring
+State exact counts at both start AND end of prompts (sandwich pattern for quantities):
+```markdown
+# REQUIREMENT: Exactly 21 Entity Decisions
+[numbered list of 21 entities]
+...
+# REMINDER: 21 entity decisions required. You must provide a decision for all 21.
+```
+The explicit number creates a concrete, verifiable target. Vague instructions like "all items" or "every entity" are easier to satisfy incompletely.
+### Anti-Skipping Statements
+Direct statements about completeness requirements are effective, especially when combined with the sandwich pattern:
+| Position | Example |
+|----------|---------|
+| Start | "You must process ALL 21 entities. Skipping any is not acceptable." |
+| End | "Total: 21 entities. Confirm you provided a decision for every single one." |
+These explicit constraints work because they:
+- Create a falsifiable claim the model must satisfy
+- Exploit primacy/recency attention patterns
+- Provide a concrete metric (count) rather than vague completeness
+### External Validation Required
+**LLMs cannot reliably self-verify completeness mid-generation.**
+Research shows that self-verification checklists embedded in prompts are frequently ignored or filled incorrectly. This is a fundamental limitation: LLMs operate via approximate retrieval, not logical verification.
+**Anti-pattern:**
+```markdown
+Before submitting, verify:
+- [ ] I processed all 21 entities
+- [ ] No entity was skipped
+- [ ] Each decision is justified
+```
+The model will often check these boxes without actually verifying.
+**Better approach:**
+```text
+1. Generate output (entity decisions)
+2. External code counts decisions: found 20, expected 21
+3. Feedback: "Missing decision for entity 'guest_clara'. Provide decision."
+4. Model repairs the specific gap
+```
+The "Validate → Feedback → Repair" loop (see below) must use **external logic**, not LLM self-assessment.
+### Combining Patterns
+For maximum completeness on list-processing tasks:
+1. Use **numbered lists** (not checkboxes)
+2. State **exact count** at start and end (sandwich)
+3. Include **anti-skipping statements** at start and end
+4. Validate **externally** after generation
+5. Provide **specific feedback** naming missing items
+This combination addressed a real-world failure where gpt-4o-mini skipped 1 of 21 entities despite an embedded entity checklist.
+---
 ## Tool Design
 ### Tool Count Effects
@@ -544,6 +654,135 @@ Validate → feedback → repair is a general pattern:
 - Works for more informal artifacts (e.g., checklists, outlines) when combined with light-weight structural checks.
 - Plays well with the structured-output patterns above and with the reflection/self-critique patterns below.
+### Two-Level Feedback Architecture
+Simple validation loops assume errors can be fixed by repairing the output. But some errors originate earlier in the pipeline—the output is wrong because the *input* was wrong. A two-level architecture handles both cases.
+#### The Problem: Broken Input Propagation
+Consider a pipeline: `Summarize → Serialize → Validate`
+```text
+Summarize → Brief (with invented IDs) → Serialize → Validate → Feedback
+                    ↑                                              ↓
+                    └──────── Brief stays the same! ───────────────┘
+```
+If the summarize step invents an ID (`archive_access` instead of valid `diary_truth`), the serialize step will use it because it's in the brief. Validation rejects it. The inner repair loop retries serialize with the same broken brief → **0% correction rate**.
+#### Solution: Nested Loops
+```text
+┌─────────────────────────────────────────────────────────────────┐
+│                    OUTER LOOP (max 2)                           │
+│  When SEMANTIC validation fails → repair the SOURCE:            │
+│  - Original input + validation errors                           │
+│  - Valid references list                                        │
+│  - Fuzzy replacement suggestions                                │
+└─────────────────────────────────────────────────────────────────┘
+                    ↓                              ↑
+            ┌───────────┐                  ┌──────────────┐
+            │  SOURCE   │                  │   SEMANTIC   │
+            │  (brief)  │                  │  VALIDATION  │
+            └───────────┘                  └──────────────┘
+                    ↓                              ↑
+┌─────────────────────────────────────────────────────────────────┐
+│                    INNER LOOP (max 3)                           │
+│  Handles schema/format errors only                              │
+│  (Pydantic failures, JSON syntax, missing fields)               │
+└─────────────────────────────────────────────────────────────────┘
+```
+**Inner loop** (fast, cheap): Schema errors, type mismatches, missing required fields. These can be fixed by repairing the serialized output directly.
+**Outer loop** (expensive, rare): Semantic errors—invalid references, invented IDs, impossible states. These require repairing the *source* that caused the problem.
+#### When to Use Each Loop
+| Error Type | Loop | Example |
+|------------|------|---------|
+| JSON syntax error | Inner | Missing comma, unclosed brace |
+| Missing required field | Inner | `protagonist_name` not provided |
+| Invalid field value | Inner | `estimated_passages: 15` when max is 10 |
+| Unknown field | Inner | `passages` instead of `estimated_passages` |
+| Invalid reference ID | **Outer** | `thread: "archive_access"` when ID doesn't exist |
+| Semantic inconsistency | **Outer** | Character referenced before introduction |
+| Hallucinated entity | **Outer** | Entity name invented, not from source data |
+#### Fuzzy ID Replacement Suggestions
+When semantic validation finds invalid IDs, generate replacement suggestions using fuzzy matching:
+```markdown
+### Error: Invalid Thread ID
+- Location: initial_beats.5.threads
+- You used: `archive_access`
+- VALID OPTIONS: `butler_fidelity` | `diary_truth` | `host_motive`
+- SUGGESTED: `diary_truth` (closest match to "archive")
+### Error: Unknown Entity
+- Location: scene.3.characters
+- You used: `mysterious_stranger`
+- VALID OPTIONS: `butler_jameson` | `guest_clara` | `detective_morse`
+- SUGGESTED: Remove this reference (no close match)
+```
+This gives the model actionable guidance rather than just rejection.
+#### Source Repair Prompt Pattern
+When the outer loop triggers, the repair prompt should include:
+1. **Original source** (the brief/summary being repaired)
+2. **Validation errors** (what went wrong downstream)
+3. **Valid references** (complete list of allowed IDs)
+4. **Fuzzy suggestions** (what to replace invalid IDs with)
+5. **Full context** (original input data the source was derived from)
+```markdown
+## Repair Required
+Your brief contained invalid references that caused downstream failures.
+### Original Brief
+[brief content here]
+### Validation Errors
+1. `archive_access` is not a valid thread ID
+2. `clock_distortion` is not a valid thread ID
+### Valid Thread IDs
+- butler_fidelity
+- diary_truth
+- host_motive
+### Suggested Replacements
+- `archive_access` → `diary_truth` (both relate to hidden information)
+- `clock_distortion` → REMOVE (no matching concept)
+### Original Discussion (for context)
+[full source material the brief was derived from]
+Produce a corrected brief that uses only valid IDs.
+```
+#### Budget and Applicability
+| Stage Type | Needs Outer Loop? | Reason |
+|------------|-------------------|--------|
+| Generation (creates new IDs) | No | Creates IDs, doesn't reference them |
+| Summarization | **Yes** | May invent or misremember IDs |
+| Serialization (uses existing IDs) | **Yes** | References IDs from earlier stages |
+| Expansion (adds detail) | Maybe | References scene/entity IDs |
+**Total budget:** Outer loop max 2 iterations × Inner loop max 3 iterations = ≤12 LLM calls per stage worst case.
+**Success criteria:**
+- >80% correction rate on first outer loop iteration
+- Clear error messages guide model to correct IDs
+- Fuzzy matching reduces guesswork
 ---
 ## Prompt-History Conflicts
@@ -721,6 +960,10 @@ See [Sampling Parameters](#sampling-parameters) for detailed temperature guidanc
 | Context pruning | Context rot | Summarize and remove stale turns |
 | Structured feedback | Vague validation errors | Categorize issues (invalid/missing/unknown) |
 | Phase-specific temperature | Format errors in structured output | High temp for discuss, low for serialize |
+| Numbered lists | Checkbox skipping | Use 1. 2. 3. format, not checkboxes |
+| Quantity anchoring | Incomplete list processing | State exact count at start AND end |
+| Anti-skipping statements | Middle items ignored | Explicit "process all N" constraints |
+| Two-level validation | Broken input propagation | Outer loop repairs source, inner repairs output |
 | Model Class | Max Prompt | Max Tools | Strategy |
 |-------------|------------|-----------|----------|
@@ -741,10 +984,14 @@ See [Sampling Parameters](#sampling-parameters) for detailed temperature guidanc
 | Reflexion research | Self-correction improves quality on complex tasks |
 | STROT Framework (2025) | Structured feedback loops achieve 95% first-attempt success |
 | AWS Evaluator-Optimizer | Semantic reflection enables self-improving validation |
+| LLM Self-Verification Limitations (2024) | LLMs cannot reliably self-verify; external validation required |
+| Spotify Verification Loops (2025) | Inner/outer loop architecture; deterministic + semantic validation |
+| LLMLOOP (ICSME 2025) | First feedback iteration has highest impact (up to 24% improvement) |
 ---
 ## See Also
+- [Agent Memory Architecture](agent_memory_architecture.md) — State-managed memory, checkpointers, history management
 - [Branching Narrative Construction](../narrative-structure/branching_narrative_construction.md) — LLM generation strategies for narratives
 - [Multi-Agent Patterns](multi_agent_patterns.md) — Team coordination and delegation

{ifcraftcorpus-1.4.0 → ifcraftcorpus-1.5.0}/corpus/agent-design/multi_agent_patterns.md RENAMED Viewed

@@ -492,5 +492,6 @@ This enables:
 ## See Also
+- [Agent Memory Architecture](agent_memory_architecture.md) — Memory sharing, state-over-history, context passing
 - [Agent Prompt Engineering](agent_prompt_engineering.md) — Prompt design for individual agents
 - [Branching Narrative Construction](../narrative-structure/branching_narrative_construction.md) — Decomposition strategies for complex generation

{ifcraftcorpus-1.4.0 → ifcraftcorpus-1.5.0}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "ifcraftcorpus"
-version = "1.4.0"
+version = "1.5.0"
 description = "Interactive fiction craft corpus with search library and MCP server"
 readme = "README.md"
 license = {text = "MIT"}