npm - @bastani/atomic - Versions diffs - 0.5.11-0 → 0.5.12-0 - Mend

@bastani/atomic 0.5.11-0 → 0.5.12-0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (506) hide show

package/.agents/skills/context-compression/tests/test_compression_evaluator.py ADDED Viewed

@@ -0,0 +1,56 @@
+import importlib.util
+import unittest
+from pathlib import Path
+MODULE_PATH = (
+    Path(__file__).resolve().parents[1] / "scripts" / "compression_evaluator.py"
+)
+MODULE_SPEC = importlib.util.spec_from_file_location(
+    "compression_evaluator", MODULE_PATH
+)
+if MODULE_SPEC is None or MODULE_SPEC.loader is None:
+    raise RuntimeError(f"Unable to load compression_evaluator.py from {MODULE_PATH}")
+COMPRESSION_EVALUATOR = importlib.util.module_from_spec(MODULE_SPEC)
+MODULE_SPEC.loader.exec_module(COMPRESSION_EVALUATOR)
+class CompressionEvaluatorTests(unittest.TestCase):
+    def test_json_ground_truth_terms_score_when_response_mentions_artifacts(
+        self,
+    ) -> None:
+        evaluator = COMPRESSION_EVALUATOR.CompressionEvaluator()
+        rich_score = evaluator._heuristic_score(
+            {"id": "artifact_files_modified"},
+            "We modified src/app.py and updated README.md during the session.",
+            '[{"path": "src/app.py", "operation": "modified"}, {"path": "README.md", "operation": "updated"}]',
+        )
+        poor_score = evaluator._heuristic_score(
+            {"id": "artifact_files_modified"},
+            "We changed some files but I do not remember which ones.",
+            '[{"path": "src/app.py", "operation": "modified"}, {"path": "README.md", "operation": "updated"}]',
+        )
+        self.assertGreater(rich_score, poor_score)
+        self.assertGreaterEqual(rich_score, 4.0)
+    def test_plain_text_ground_truth_still_uses_substring_match(self) -> None:
+        evaluator = COMPRESSION_EVALUATOR.CompressionEvaluator()
+        exact_score = evaluator._heuristic_score(
+            {"id": "continuity_work_state"},
+            "Next: fix the websocket timeout before rerunning tests.",
+            "fix the websocket timeout",
+        )
+        missing_score = evaluator._heuristic_score(
+            {"id": "continuity_work_state"},
+            "Next: inspect logs again.",
+            "fix the websocket timeout",
+        )
+        self.assertGreater(exact_score, missing_score)
+if __name__ == "__main__":
+    unittest.main()

package/.agents/skills/context-degradation/SKILL.md ADDED Viewed

@@ -0,0 +1,206 @@
+---
+name: context-degradation
+description: This skill should be used when the user asks to "diagnose context problems", "fix lost-in-middle issues", "debug agent failures", "understand context poisoning", or mentions context degradation, attention patterns, context clash, context confusion, or agent performance degradation. A core context engineering skill — also activates when the user mentions "context engineering" or "context-engineering" in the context of diagnosing and mitigating context failures.
+---
+# Context Degradation Patterns
+Diagnose and fix context failures before they cascade. Context degradation is not binary — it is a continuum that manifests through five distinct, predictable patterns: lost-in-middle, poisoning, distraction, confusion, and clash. Each pattern has specific detection signals and mitigation strategies. Treat degradation as an engineering problem with measurable thresholds, not an unpredictable failure mode.
+## When to Activate
+Activate this skill when:
+- Agent performance degrades unexpectedly during long conversations
+- Debugging cases where agents produce incorrect or irrelevant outputs
+- Designing systems that must handle large contexts reliably
+- Evaluating context engineering choices for production systems
+- Investigating "lost in middle" phenomena in agent outputs
+- Analyzing context-related failures in agent behavior
+## Core Concepts
+Structure context placement around the attention U-curve: beginning and end positions receive reliable attention, while middle positions suffer 10-40% reduced recall accuracy (Liu et al., 2023). This is not a model bug but a consequence of attention mechanics — the first token (often BOS) acts as an "attention sink" that absorbs disproportionate attention budget, leaving middle tokens under-attended as context grows.
+Treat context poisoning as a circuit breaker problem. Once a hallucination, tool error, or incorrect retrieved fact enters context, it compounds through repeated self-reference. A poisoned goals section causes every downstream decision to reinforce incorrect assumptions. Detection requires tracking claim provenance; recovery requires truncating to before the poisoning point or restarting with verified-only context.
+Filter aggressively before loading context — even a single irrelevant document measurably degrades performance on relevant tasks. Models cannot "skip" irrelevant context; they must attend to everything provided, creating attention competition between relevant and irrelevant content. Move information that might be needed but is not immediately relevant behind tool calls instead of pre-loading it.
+Isolate task contexts to prevent confusion. When context contains multiple task types or switches between objectives, models incorporate constraints from the wrong task, call tools appropriate for a different context, or blend requirements from multiple sources. Explicit task segmentation with separate context windows eliminates cross-contamination.
+Resolve context clash through priority rules, not accumulation. When multiple correct-but-contradictory sources appear in context (version conflicts, perspective conflicts, multi-source retrieval), models cannot determine which applies. Mark contradictions explicitly, establish source precedence, and filter outdated versions before they enter context.
+## Detailed Topics
+### Lost-in-Middle: Detection and Placement Strategy
+Place critical information at the beginning and end of context, never in the middle. The U-shaped attention curve means middle-positioned information suffers 10-40% reduced recall accuracy. For contexts over 4K tokens, this effect becomes significant.
+Use summary structures that surface key findings at attention-favored positions. Add explicit section headers and structural markers — these help models navigate long contexts by creating attention anchors. When a document must be included in full, prepend a summary of its key points and append the critical conclusions.
+Monitor for lost-in-middle symptoms: correct information exists in context but the model ignores it, responses contradict provided data, or the model "forgets" instructions given earlier in a long prompt.
+### Context Poisoning: Prevention and Recovery
+Validate all external inputs before they enter context. Tool outputs, retrieved documents, and model-generated summaries are the three primary poisoning vectors. Each introduces unverified claims that subsequent reasoning treats as ground truth.
+Detect poisoning through these signals: degraded output quality on previously-successful tasks, tool misalignment (wrong tools or parameters), and hallucinations that persist despite explicit correction. When these cluster, suspect poisoning rather than model capability issues.
+Recover by removing poisoned content, not by adding corrections on top. Truncate to before the poisoning point, restart with clean context preserving only verified information, or explicitly mark the poisoned section and request re-evaluation from scratch. Layering corrections over poisoned context rarely works — the original errors retain attention weight.
+### Context Distraction: Curation Over Accumulation
+Curate what enters context rather than relying on models to ignore irrelevant content. Research shows even a single distractor document triggers measurable performance degradation — the effect follows a step function, not a linear curve. Multiple distractors compound the problem.
+Apply relevance filtering before loading retrieved documents. Use namespacing and structural organization to make section boundaries clear. Prefer tool-call-based access over pre-loading: store reference material behind retrieval tools so it enters context only when directly relevant to the current reasoning step.
+### Context Confusion: Task Isolation
+Segment different tasks into separate context windows. Context confusion is distinct from distraction — it concerns the model applying wrong-context constraints to the current task, not just attention dilution. Signs include responses addressing the wrong aspect of a query, tool calls appropriate for a different task, and outputs mixing requirements from multiple sources.
+Implement clear transitions between task contexts. Use state management that isolates objectives, constraints, and tool definitions per task. When task-switching within a single session is unavoidable, use explicit "context reset" markers that signal which constraints apply to the current segment.
+### Context Clash: Conflict Resolution Protocols
+Establish source priority rules before conflicts arise. Context clash differs from poisoning — multiple pieces of information are individually correct but mutually contradictory (version conflicts, perspective differences, multi-source retrieval with divergent facts).
+Implement version filtering to exclude outdated information before it enters context. When contradictions are unavoidable, mark them explicitly with structured conflict annotations: state what conflicts, which source each claim comes from, and which source takes precedence. Without explicit priority rules, models resolve contradictions unpredictably.
+### Empirical Benchmarks and Thresholds
+Use these benchmarks to set design constraints — not as universal truths. The RULER benchmark found only 50% of models claiming 32K+ context maintain satisfactory performance at that length. Near-perfect needle-in-haystack scores do not predict real-world long-context performance.
+**Model-Specific Degradation Thresholds**
+Degradation onset varies significantly by model family and task type. As a general rule, expect degradation to begin at 60-70% of the advertised context window for complex retrieval tasks (RULER benchmark found only 50% of models claiming 32K+ context maintain satisfactory performance at that length). Key patterns:
+- **Models with extended thinking** reduce hallucination through step-by-step verification but at higher latency and token cost
+- **Models optimized for agents/coding** tend to have better attention management for tool-output-heavy contexts
+- **Models with very large context windows (1M+)** handle more raw context but still follow U-shaped degradation curves — bigger windows do not eliminate the problem, they delay it
+Always benchmark degradation thresholds with your specific workload rather than relying on published benchmarks. Model-specific thresholds go stale with each model update (see Gotcha 2).
+### Counterintuitive Findings
+Account for these research-backed surprises when designing context strategies:
+**Shuffled context can outperform coherent context.** Studies found incoherent (shuffled) haystacks produce better retrieval performance than logically ordered ones. Coherent context creates false associations that confuse retrieval; incoherent context forces exact matching. Do not assume that better-organized context always yields better results — test both arrangements.
+**Single distractors have outsized impact.** The performance hit from one irrelevant document is disproportionately large compared to adding more distractors after the first. Treat distractor prevention as binary: either keep context clean or accept significant degradation.
+**Low needle-question similarity accelerates degradation.** Tasks requiring inference across dissimilar content degrade faster with context length than tasks with high surface-level similarity. Design retrieval to maximize semantic overlap between queries and retrieved content.
+### When Larger Contexts Hurt
+Do not assume larger context windows improve performance. Performance remains stable up to a model-specific threshold, then degrades rapidly — the curve is non-linear with a cliff edge, not a gentle slope. For many models, meaningful degradation begins at 8K-16K tokens even when windows support much larger sizes.
+Factor in cost: processing a 400K token context costs exponentially more than 200K in both time and compute, not linearly more. For many applications, this makes large-context processing economically impractical.
+Recognize the cognitive bottleneck: even with infinite context, asking a single model to maintain quality across dozens of independent tasks creates degradation that more context cannot solve. Split tasks across sub-agents instead of expanding context.
+## Practical Guidance
+### The Four-Bucket Mitigation Framework
+Apply these four strategies based on which degradation pattern is active:
+**Write** — Save context outside the window using scratchpads, file systems, or external storage. Use when context utilization exceeds 70% of the window. This keeps active context lean while preserving information access through tool calls.
+**Select** — Pull only relevant context into the window through retrieval, filtering, and prioritization. Use when distraction or confusion symptoms appear. Apply relevance scoring before loading; exclude anything below threshold rather than including everything available.
+**Compress** — Reduce tokens while preserving information through summarization, abstraction, and observation masking. Use when context is growing but all content is relevant. Replace verbose tool outputs with compact structured summaries; abstract repeated patterns into single references.
+**Isolate** — Split context across sub-agents or sessions to prevent any single context from growing past its degradation threshold. Use when confusion or clash symptoms appear, or when tasks are independent. This is the most aggressive strategy but often the most effective for complex multi-task systems.
+### Architectural Patterns for Resilience
+Implement just-in-time context loading: retrieve information only when the current reasoning step needs it, not preemptively. Use observation masking to replace verbose tool outputs with compact references after processing. Deploy sub-agent architectures where each agent holds only task-relevant context. Trigger compaction before context exceeds the model-specific degradation onset threshold — not after symptoms appear.
+## Examples
+**Example 1: Detecting Degradation**
+```yaml
+# Context grows during long conversation
+turn_1: 1000 tokens
+turn_5: 8000 tokens
+turn_10: 25000 tokens
+turn_20: 60000 tokens (degradation begins)
+turn_30: 90000 tokens (significant degradation)
+```
+**Example 2: Mitigating Lost-in-Middle**
+```markdown
+# Organize context with critical info at edges
+[CURRENT TASK]                      # At start
+- Goal: Generate quarterly report
+- Deadline: End of week
+[DETAILED CONTEXT]                  # Middle (less attention)
+- 50 pages of data
+- Multiple analysis sections
+- Supporting evidence
+[KEY FINDINGS]                     # At end
+- Revenue up 15%
+- Costs down 8%
+- Growth in Region A
+```
+## Guidelines
+1. Monitor context length and performance correlation during development
+2. Place critical information at beginning or end of context
+3. Implement compaction triggers before degradation becomes severe
+4. Validate retrieved documents for accuracy before adding to context
+5. Use versioning to prevent outdated information from causing clash
+6. Segment tasks to prevent context confusion across different objectives
+7. Design for graceful degradation rather than assuming perfect conditions
+8. Test with progressively larger contexts to find degradation thresholds
+## Gotchas
+1. **Normal variance looks like degradation**: Model output quality fluctuates naturally across runs. Do not diagnose degradation from a single drop in quality — establish a baseline over multiple runs and look for sustained, correlated decline tied to context growth. A 5-10% quality dip on one run is noise; the same dip consistently appearing after 40K tokens is signal.
+2. **Model-specific thresholds go stale**: The degradation onset values in benchmark tables reflect specific model versions. Provider updates, fine-tuning changes, and infrastructure shifts can move thresholds by 20-50% in either direction. Re-benchmark quarterly and after any major model update rather than treating published thresholds as permanent.
+3. **Needle-in-haystack scores create false confidence**: A model scoring 99% on needle-in-haystack does not mean it handles 128K tokens well in production. Needle tests measure single-fact retrieval from passive context — real workloads require multi-fact reasoning, instruction following, and synthesis across the full window. Use task-specific benchmarks that mirror actual workload patterns.
+4. **Contradictory retrieved documents poison silently**: When a RAG pipeline retrieves two documents that disagree on a fact, the model may silently pick one without signaling the conflict. This looks like a correct response but is effectively random. Implement contradiction detection in the retrieval layer before documents enter context.
+5. **Prompt quality problems masquerade as degradation**: Poor prompt structure (ambiguous instructions, missing constraints, unclear task framing) produces symptoms identical to context degradation — inconsistent outputs, ignored instructions, wrong tool usage. Before diagnosing degradation, verify the same prompt works correctly at low context lengths. If it fails at 2K tokens, the problem is the prompt, not the context.
+6. **Degradation is non-linear with a cliff edge**: Performance does not degrade gradually — it holds steady until a model-specific threshold, then drops sharply. Systems designed for "graceful degradation" often miss this pattern because monitoring checks assume linear decline. Set compaction triggers well before the cliff (at 70% of known onset), not at the onset itself.
+7. **Over-organizing context can backfire**: Intuitively, well-structured and coherent context should outperform disorganized content. Research shows shuffled haystacks sometimes outperform coherent ones for retrieval tasks because coherent context creates false associations. Test whether heavy structural formatting actually helps for the specific task — do not assume it does.
+## Integration
+This skill builds on context-fundamentals and should be studied after understanding basic context concepts. It connects to:
+- context-optimization - Techniques for mitigating degradation
+- multi-agent-patterns - Using isolation to prevent degradation
+- evaluation - Measuring and detecting degradation in production
+## References
+Internal reference:
+- [Degradation Patterns Reference](./references/patterns.md) - Read when: debugging a specific degradation pattern and needing implementation-level detection code (attention analysis, poisoning tracking, relevance scoring, recovery procedures)
+Related skills in this collection:
+- context-fundamentals - Read when: lacking foundational understanding of context windows, token budgets, or placement mechanics
+- context-optimization - Read when: degradation is diagnosed and specific mitigation techniques (compaction, compression, masking) are needed
+- evaluation - Read when: setting up production monitoring to detect degradation before it impacts users
+External resources:
+- Liu et al., 2023 "Lost in the Middle" - Read when: needing primary research backing for U-shaped attention claims or designing position-aware context layouts
+- RULER benchmark documentation - Read when: evaluating model claims about long-context support or comparing models for context-heavy workloads
+- Production engineering guides from AI labs - Read when: implementing context management in production infrastructure
+---
+## Skill Metadata
+**Created**: 2025-12-20
+**Last Updated**: 2026-03-17
+**Author**: Agent Skills for Context Engineering Contributors
+**Version**: 2.0.0

package/.agents/skills/context-degradation/references/patterns.md ADDED Viewed

@@ -0,0 +1,314 @@
+# Context Degradation Patterns: Technical Reference
+This document provides technical details on diagnosing and measuring context degradation.
+## Attention Distribution Analysis
+### U-Shaped Curve Measurement
+Measure attention distribution across context positions:
+```python
+def measure_attention_distribution(model, context_tokens, query):
+    """
+    Measure how attention varies across context positions.
+    Returns distribution showing attention weight by position.
+    """
+    attention_by_position = []
+    for position in range(len(context_tokens)):
+        # Measure model's attention to this position
+        attention = get_attention_weights(model, context_tokens, query, position)
+        attention_by_position.append({
+            "position": position,
+            "attention": attention,
+            "is_beginning": position < len(context_tokens) * 0.1,
+            "is_end": position > len(context_tokens) * 0.9,
+            "is_middle": True  # Will be overwritten
+        })
+    # Classify positions
+    for item in attention_by_position:
+        if item["is_beginning"] or item["is_end"]:
+            item["region"] = "attention_favored"
+        else:
+            item["region"] = "attention_degraded"
+    return attention_by_position
+```
+### Lost-in-Middle Detection
+Detect when critical information falls in degraded attention regions:
+```python
+def detect_lost_in_middle(critical_positions, attention_distribution):
+    """
+    Check if critical information is in attention-favored positions.
+    Args:
+        critical_positions: List of positions containing critical info
+        attention_distribution: Output from measure_attention_distribution
+    Returns:
+        Dictionary with detection results and recommendations
+    """
+    results = {
+        "at_risk": [],
+        "safe": [],
+        "recommendations": []
+    }
+    for pos in critical_positions:
+        region = attention_distribution[pos]["region"]
+        if region == "attention_degraded":
+            results["at_risk"].append(pos)
+        else:
+            results["safe"].append(pos)
+    # Generate recommendations
+    if results["at_risk"]:
+        results["recommendations"].extend([
+            "Move critical information to attention-favored positions",
+            "Use explicit markers to highlight critical information",
+            "Consider splitting context to reduce middle section"
+        ])
+    return results
+```
+## Context Poisoning Detection
+### Hallucination Tracking
+Track potential hallucinations across conversation turns:
+```python
+class HallucinationTracker:
+    def __init__(self):
+        self.claims = []
+        self.verifications = []
+    def add_claims(self, text):
+        """Extract claims from text for later verification."""
+        claims = extract_claims(text)
+        self.claims.extend([{"text": c, "verified": None} for c in claims])
+    def verify_claims(self, ground_truth):
+        """Verify claims against ground truth."""
+        for claim in self.claims:
+            if claim["verified"] is None:
+                claim["verified"] = check_claim(claim["text"], ground_truth)
+    def get_poisoning_indicators(self):
+        """
+        Return indicators of potential context poisoning.
+        High ratio of unverified claims suggests poisoning risk.
+        """
+        unverified = sum(1 for c in self.claims if not c["verified"])
+        verified_false = sum(1 for c in self.claims if c["verified"] == False)
+        return {
+            "unverified_count": unverified,
+            "false_count": verified_false,
+            "poisoning_risk": verified_false > 0 or unverified > len(self.claims) * 0.3
+        }
+```
+### Error Propagation Analysis
+Track how errors flow through context:
+```python
+def analyze_error_propagation(context, error_points):
+    """
+    Analyze how errors at specific points affect downstream context.
+    Returns visualization of error spread and impact assessment.
+    """
+    impact_map = {}
+    for error_point in error_points:
+        # Find all references to content after error point
+        downstream_refs = find_references(context, after=error_point)
+        for ref in downstream_refs:
+            if ref not in impact_map:
+                impact_map[ref] = []
+            impact_map[ref].append({
+                "source": error_point,
+                "type": classify_error_type(context[error_point])
+            })
+    # Assess severity
+    high_impact_areas = [k for k, v in impact_map.items() if len(v) > 3]
+    return {
+        "impact_map": impact_map,
+        "high_impact_areas": high_impact_areas,
+        "requires_intervention": len(high_impact_areas) > 0
+    }
+```
+## Distraction Metrics
+### Relevance Scoring
+Score relevance of context elements to current task:
+```python
+def score_context_relevance(context_elements, task_description):
+    """
+    Score each context element for relevance to current task.
+    Returns scores and identifies high-distraction elements.
+    """
+    task_embedding = embed(task_description)
+    scored_elements = []
+    for i, element in enumerate(context_elements):
+        element_embedding = embed(element)
+        relevance = cosine_similarity(task_embedding, element_embedding)
+        scored_elements.append({
+            "index": i,
+            "content_preview": element[:100],
+            "relevance_score": relevance
+        })
+    # Sort by relevance
+    scored_elements.sort(key=lambda x: x["relevance_score"], reverse=True)
+    # Identify potential distractors
+    threshold = calculate_relevance_threshold(scored_elements)
+    distractors = [e for e in scored_elements if e["relevance_score"] < threshold]
+    return {
+        "scored_elements": scored_elements,
+        "distractors": distractors,
+        "recommendation": f"Consider removing {len(distractors)} low-relevance elements"
+    }
+```
+## Degradation Monitoring System
+### Context Health Dashboard
+Implement continuous monitoring of context health:
+```python
+class ContextHealthMonitor:
+    def __init__(self, model, context_window_limit):
+        self.model = model
+        self.limit = context_window_limit
+        self.metrics = []
+    def assess_health(self, context, task):
+        """
+        Assess overall context health for current task.
+        Returns composite score and component metrics.
+        """
+        metrics = {
+            "token_count": len(context),
+            "utilization_ratio": len(context) / self.limit,
+            "attention_distribution": measure_attention_distribution(self.model, context, task),
+            "relevance_scores": score_context_relevance(context, task),
+            "age_tokens": count_recent_tokens(context)
+        }
+        # Calculate composite health score
+        health_score = self._calculate_composite(metrics)
+        result = {
+            "health_score": health_score,
+            "metrics": metrics,
+            "status": self._interpret_score(health_score),
+            "recommendations": self._generate_recommendations(metrics)
+        }
+        self.metrics.append(result)
+        return result
+    def _calculate_composite(self, metrics):
+        """Calculate composite health score from components."""
+        # Weighted combination of metrics
+        utilization_penalty = min(metrics["utilization_ratio"] * 0.5, 0.3)
+        attention_penalty = self._calculate_attention_penalty(metrics["attention_distribution"])
+        relevance_penalty = self._calculate_relevance_penalty(metrics["relevance_scores"])
+        base_score = 1.0
+        score = base_score - utilization_penalty - attention_penalty - relevance_penalty
+        return max(0, score)
+    def _interpret_score(self, score):
+        """Interpret health score and return status."""
+        if score > 0.8:
+            return "healthy"
+        elif score > 0.6:
+            return "warning"
+        elif score > 0.4:
+            return "degraded"
+        else:
+            return "critical"
+```
+### Alert Thresholds
+Configure appropriate alert thresholds:
+```python
+CONTEXT_ALERTS = {
+    "utilization_warning": 0.7,      # 70% of context limit
+    "utilization_critical": 0.9,     # 90% of context limit
+    "attention_degraded_ratio": 0.3, # 30% in middle region
+    "relevance_threshold": 0.3,      # Below 30% relevance
+    "consecutive_warnings": 3        # Three warnings triggers alert
+}
+```
+## Recovery Procedures
+### Context Truncation Strategy
+When context degrades beyond recovery, truncate strategically:
+```python
+def truncate_context_for_recovery(context, preserved_elements, target_size):
+    """
+    Truncate context while preserving critical elements.
+    Strategy:
+    1. Preserve system prompt and tool definitions
+    2. Preserve recent conversation turns
+    3. Preserve critical retrieved documents
+    4. Summarize older content if needed
+    5. Truncate from middle if still over target
+    """
+    truncated = []
+    # Category 1: Critical system elements (preserve always)
+    system_elements = extract_system_elements(context)
+    truncated.extend(system_elements)
+    # Category 2: Recent conversation (preserve more)
+    recent_turns = extract_recent_turns(context, num_turns=10)
+    truncated.extend(recent_turns)
+    # Category 3: Critical documents (preserve key ones)
+    critical_docs = extract_critical_documents(context, preserved_elements)
+    truncated.extend(critical_docs)
+    # Check size and summarize if needed
+    while len(truncated) > target_size:
+        # Summarize oldest category 3 elements
+        truncated = summarize_oldest(truncated, category="documents")
+        # If still too large, truncate oldest turns
+        if len(truncated) > target_size:
+            truncated = truncate_oldest_turns(truncated, keep_recent=5)
+    return truncated
+```