npm - @botlearn/summarizer - Versions diffs - 0.1.0 - Mend

@botlearn/summarizer 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (11) hide show

package/LICENSE +21 -0
package/README.md +35 -0
package/knowledge/anti-patterns.md +76 -0
package/knowledge/best-practices.md +130 -0
package/knowledge/domain.md +114 -0
package/manifest.json +26 -0
package/package.json +35 -0
package/skill.md +43 -0
package/strategies/main.md +85 -0
package/tests/benchmark.json +476 -0
package/tests/smoke.json +54 -0

package/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2025 BotLearn
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

package/README.md ADDED Viewed

@@ -0,0 +1,35 @@
+# @botlearn/summarizer
+> Content summarization with core argument extraction, key detail retention, and accuracy self-checking for OpenClaw Agent
+## Installation
+```bash
+# via npm
+npm install @botlearn/summarizer
+# via clawhub
+clawhub install @botlearn/summarizer
+```
+## Category
+content-processing
+## Dependencies
+None
+## Files
+| File | Description |
+|------|-------------|
+| `manifest.json` | Skill metadata and configuration |
+| `skill.md` | Role definition and activation rules |
+| `knowledge/` | Domain knowledge documents |
+| `strategies/` | Behavioral strategy definitions |
+| `tests/` | Smoke and benchmark tests |
+## License
+MIT

package/knowledge/anti-patterns.md ADDED Viewed

@@ -0,0 +1,76 @@
+---
+domain: summarizer
+topic: anti-patterns
+priority: medium
+ttl: 30d
+---
+# Content Summarization — Anti-Patterns
+## Detail Loss Anti-Patterns
+### 1. Dropping Numbers and Dates
+- **Problem**: Replacing specific quantitative data with vague language — e.g., summarizing "Revenue increased 34% to $4.2B in Q3 2024" as "Revenue increased significantly"
+- **Impact**: The summary loses verifiability and decision-making value; "significantly" is subjective and unanchored
+- **Fix**: Always carry forward key numbers, percentages, dates, and monetary values. If compression requires it, round numbers but never remove them. "$4.2B (up 34%, Q3 2024)" is acceptable compression; "revenue grew" is not
+### 2. Stripping Proper Nouns
+- **Problem**: Replacing named entities with generic references — e.g., "The company announced" instead of "Anthropic announced"
+- **Impact**: Loses attribution and specificity; makes the summary ambiguous when multiple entities are discussed
+- **Fix**: Retain proper nouns for all entities central to the argument. Generic references are only acceptable for minor entities mentioned in passing
+### 3. Losing Temporal Context
+- **Problem**: Omitting dates, time periods, or temporal qualifiers — e.g., "The study found correlation" instead of "A 2024 longitudinal study spanning 5 years found correlation"
+- **Impact**: The reader cannot assess recency or relevance; conflates historical and current findings
+- **Fix**: Include the year and study duration for research findings; include dates for events; include version numbers for technology
+## Structural Anti-Patterns
+### 4. Over-Generalization
+- **Problem**: Collapsing specific, differentiated claims into a single vague statement — e.g., summarizing three distinct policy recommendations as "Several recommendations were made"
+- **Impact**: Destroys the actionable specificity that makes the source valuable
+- **Fix**: If the source lists N specific items and the summary can only fit M < N, select the M most important by impact or novelty and state them explicitly. Never replace a list with "several" or "various"
+### 5. Position Bias (Lead Bias)
+- **Problem**: Over-representing content from the beginning of the document and under-representing content from the middle or end
+- **Impact**: Conclusions, counter-arguments, and caveats (which typically appear later) are systematically omitted, producing a misleadingly one-sided summary
+- **Fix**: Scan the entire document before summarizing. Use argument mapping to identify structurally important content regardless of position. Explicitly check: "Does my summary include content from the final third of the document?"
+### 6. Recency Bias (Tail Bias)
+- **Problem**: The opposite of lead bias — over-representing the most recently read content because it is freshest in memory
+- **Impact**: The opening thesis and early evidence are underweighted
+- **Fix**: Build the argument map before writing the summary. Work from the map, not from memory of the reading order
+### 7. Flattening Argument Structure
+- **Problem**: Presenting all claims at the same level, losing the hierarchical relationship between main arguments and supporting evidence
+- **Impact**: The reader cannot distinguish the central thesis from supporting details; the summary reads as a list of disconnected facts
+- **Fix**: Use the claim-evidence hierarchy from knowledge/domain.md. Lead with the thesis, support with top-level claims, and use indentation or explicit markers to show evidence supporting each claim
+## Accuracy Anti-Patterns
+### 8. Inference Injection
+- **Problem**: Adding conclusions or causal claims that the source does not make — e.g., the source says "X and Y are correlated" but the summary says "X causes Y"
+- **Impact**: The summary makes stronger claims than the source, potentially misleading the reader
+- **Fix**: Preserve the source's hedging language. If the source says "suggests", "may", "correlates with", do not upgrade to "proves", "will", "causes". If you catch yourself adding a causal word not in the source, flag it
+### 9. Misattribution
+- **Problem**: Attributing a claim to the wrong entity — e.g., attributing a critic's objection to the author, or attributing one researcher's finding to another
+- **Impact**: Fundamentally changes the meaning; can misrepresent positions in a debate
+- **Fix**: When the source contains multiple voices or perspectives, track attribution explicitly in the argument map. Verify each claim-attribution pair before including in the summary
+### 10. False Equivalence in Multi-Document Summaries
+- **Problem**: Giving equal weight to a well-evidenced majority view and a poorly supported minority view, or vice versa
+- **Impact**: Distorts the landscape of expert consensus; can give fringe positions undue prominence
+- **Fix**: Weight claims by evidence quality (see argument strength in knowledge/domain.md). If 8 sources agree and 1 disagrees, state the consensus and note the dissent — do not present them as equally supported
+## Output Anti-Patterns
+### 11. Wall-of-Text Summaries
+- **Problem**: Producing a summary as a single dense paragraph with no structural markers
+- **Impact**: Defeats the purpose of summarization; the reader must re-read to find specific points
+- **Fix**: Use structural formatting — bullet points, numbered lists, or short paragraphs with clear topic sentences. Match the output structure to the user's likely use case (scanning vs. reading)
+### 12. Summary Longer Than Necessary
+- **Problem**: Including filler phrases ("It is worth noting that", "In conclusion, it can be said that") or restating the same point in multiple ways
+- **Impact**: Inflates the summary length without adding information; wastes the reader's time
+- **Fix**: After drafting, apply a self-edit pass: can any sentence be removed without losing information? If yes, remove it. Target zero-filler summaries

package/knowledge/best-practices.md ADDED Viewed

@@ -0,0 +1,130 @@
+---
+domain: summarizer
+topic: extractive-abstractive-techniques-and-detail-preservation
+priority: high
+ttl: 30d
+---
+# Content Summarization — Best Practices
+## Extractive vs. Abstractive Summarization
+### Extractive Summarization
+Selects and assembles existing sentences or phrases from the source without modification.
+**When to use extractive**:
+- Legal, medical, or regulatory documents where precise wording carries meaning
+- When the user explicitly requests "in the author's own words"
+- When quantitative data is dense and paraphrasing risks numerical errors
+- For attributing specific claims — direct quotes preserve accountability
+**Technique**:
+1. Score each sentence by information density (see knowledge/domain.md)
+2. Select top-N sentences by density score, preserving document order
+3. Add minimal connective tissue between extracted sentences for readability
+4. Indicate extracted passages with source markers (e.g., [para 3], [section 2.1])
+### Abstractive Summarization
+Generates new sentences that capture the meaning of the source in compressed form.
+**When to use abstractive**:
+- General-purpose summaries where readability is the priority
+- When compression ratio exceeds 5:1 (extractive becomes choppy at high compression)
+- When synthesizing multiple documents into a unified summary
+- When the source is poorly written and extraction would perpetuate unclear language
+**Technique**:
+1. Build an argument map (see knowledge/domain.md) to capture logical structure
+2. Rewrite each branch of the argument map as a concise statement
+3. Verify every factual claim in the rewritten text against the source
+4. Flag any inference you made that goes beyond the source material
+### Hybrid Approach (Recommended Default)
+Combine extractive and abstractive methods for optimal accuracy and readability:
+- **Abstractive** for narrative flow, transitions, and structural framing
+- **Extractive** for quantitative data, direct quotes, and precise technical claims
+- Mark the boundary: use quotation marks or source markers for extracted content
+## Detail Preservation Framework
+### Tier 1 — Must Preserve (Never Omit)
+These details, if present in the source, must appear in any summary regardless of length:
+- **Central thesis or conclusion** — The main point of the document
+- **Quantitative results** — Key numbers, percentages, financial figures, dates
+- **Named entities driving the narrative** — People, organizations, products central to the argument
+- **Causal claims with evidence** — "X caused Y" where Y is a significant outcome
+- **Contradictions or caveats** — If the source acknowledges limitations, the summary must too
+### Tier 2 — Preserve When Space Permits
+- Supporting examples (keep the strongest one per claim)
+- Methodological details (sample size, data sources, time period)
+- Secondary claims that reinforce but do not introduce new information
+- Historical context that frames the current discussion
+### Tier 3 — Safe to Omit
+- Repeated examples after the first strong one
+- Author biography or credentials (unless relevant to credibility assessment)
+- Acknowledgments, funding disclosures, boilerplate
+- Detailed literature reviews (summarize as "building on prior work by [key names]")
+## Multi-Document Summarization
+When summarizing across multiple sources, additional practices apply:
+### 1. Claim Reconciliation
+- Identify claims that appear across multiple sources → strengthen confidence
+- Identify contradictory claims → present both with attribution
+- Identify claims unique to a single source → note as "according to [source]"
+### 2. Overlap Deduplication
+- When multiple sources state the same fact, consolidate into one statement
+- Cite the most authoritative source for the consolidated claim
+- Note the number of corroborating sources if relevant (e.g., "confirmed by 3 independent studies")
+### 3. Gap Analysis
+- After merging, check: are there aspects of the topic that no source covers?
+- Flag gaps explicitly: "None of the sources address [aspect]"
+- This prevents the user from assuming comprehensive coverage when gaps exist
+### 4. Synthesis Structure
+For multi-document summaries, organize by theme rather than by source:
+```
+Theme A
+├── Source 1 finding
+├── Source 2 finding (corroborates)
+└── Source 3 finding (contradicts — note discrepancy)
+Theme B
+├── Source 1 finding (only source)
+└── [Gap: not addressed by other sources]
+```
+## Compression Ratio Guidelines
+| User Request | Target Ratio | Strategy |
+|-------------|-------------|----------|
+| "TLDR" / "one-liner" | 20:1 or higher | Central thesis + single most important finding |
+| "Key points" / "brief summary" | 10:1 | Thesis + top 3-5 claims with headline evidence |
+| "Summary" (default) | 5:1 | Full argument map preserved; Tier 1 + select Tier 2 details |
+| "Detailed summary" | 3:1 | All Tier 1 + Tier 2 details; methodology included |
+| "Executive briefing" | 5:1 | Thesis + implications + recommended actions; omit methodology |
+## Output Formatting Best Practices
+### For Structured Output
+- Lead with a one-sentence thesis summary
+- Use bullet points for key findings (one bullet per claim)
+- Group related findings under thematic headings
+- End with implications or open questions
+### For Prose Output
+- Open with the central finding or conclusion (inverted pyramid)
+- Support with evidence in descending order of importance
+- Close with caveats, limitations, or open questions
+- Keep paragraphs to 3-4 sentences maximum
+### Source Traceability
+- Annotate key claims with source references: [Source: section/paragraph]
+- For multi-document summaries, attribute each claim to its source
+- This enables the user to verify any claim against the original material

package/knowledge/domain.md ADDED Viewed

@@ -0,0 +1,114 @@
+---
+domain: summarizer
+topic: discourse-structure-and-argument-mapping
+priority: high
+ttl: 30d
+---
+# Content Summarization — Discourse Types, Argument Mapping & Information Density
+## Discourse Types
+Understanding the discourse type of a source document is the foundation of effective summarization. Each type has a distinct internal structure that dictates where core information resides and how it should be compressed.
+### 1. Argumentative Discourse
+- **Structure**: Thesis → Supporting claims → Evidence → Counter-arguments → Conclusion
+- **Key signal words**: "therefore", "however", "because", "despite", "in contrast", "evidence suggests"
+- **Summarization focus**: Preserve the thesis, the strongest supporting claims, key evidence (especially quantitative), and any concessions or counter-arguments acknowledged by the author
+- **Compression target**: Thesis and top 2-3 supporting claims with their strongest evidence
+### 2. Expository Discourse
+- **Structure**: Topic introduction → Categorical breakdown → Details per category → Synthesis
+- **Key signal words**: "first", "second", "in addition", "for example", "such as", "specifically"
+- **Summarization focus**: Preserve the categorical structure; for each category, retain the defining characteristic and one concrete example
+- **Compression target**: Topic + category headings + key differentiator per category
+### 3. Narrative Discourse
+- **Structure**: Setting → Rising action → Climax → Falling action → Resolution
+- **Key signal words**: "then", "next", "suddenly", "as a result", "finally", "meanwhile"
+- **Summarization focus**: Preserve the causal chain — what happened, why, and what resulted; retain key actors and turning points
+- **Compression target**: Who + initiating event + key turning points + outcome
+### 4. Descriptive Discourse
+- **Structure**: Overview → Feature-by-feature detail → Comparative context
+- **Key signal words**: "characterized by", "consists of", "appears as", "resembles", "differs from"
+- **Summarization focus**: Capture the entity's defining features and any comparative positioning; omit redundant descriptive detail
+- **Compression target**: Entity + 3-5 distinguishing features + comparative positioning
+### 5. Procedural Discourse
+- **Structure**: Goal statement → Prerequisites → Sequential steps → Verification
+- **Key signal words**: "first", "then", "next", "ensure that", "before", "after", "warning"
+- **Summarization focus**: Preserve the goal, critical prerequisites, step order, and any warnings or failure conditions; minor sub-steps can be collapsed
+- **Compression target**: Goal + prerequisites + collapsed step sequence + critical warnings
+## Argument Mapping
+Argument mapping is the process of identifying the logical structure within a document. This is critical for producing summaries that preserve the author's reasoning rather than just surface-level facts.
+### Claim-Evidence Hierarchy
+```
+Central Thesis
+├── Claim 1 (supports thesis)
+│   ├── Evidence 1a (data, statistic, citation)
+│   ├── Evidence 1b (example, case study)
+│   └── Qualifier (conditions under which claim holds)
+├── Claim 2 (supports thesis)
+│   ├── Evidence 2a
+│   └── Counter-evidence (acknowledged weakness)
+├── Counter-Argument (opposing view)
+│   ├── Evidence for counter-argument
+│   └── Author's rebuttal
+└── Conclusion (restated thesis + implications)
+```
+### Identifying Claims vs. Evidence
+| Type | Characteristics | Examples |
+|------|----------------|---------|
+| **Claim** | Assertive, debatable, evaluative | "Remote work increases productivity", "Policy X is ineffective" |
+| **Evidence** | Factual, verifiable, specific | "A 2024 Stanford study found 13% productivity increase", "$2.3 billion in losses" |
+| **Qualifier** | Conditional, limiting scope | "In knowledge-work sectors", "When implemented with proper tooling" |
+| **Warrant** | Implicit assumption linking evidence to claim | "Because productivity correlates with employee satisfaction" |
+### Argument Strength Assessment
+When multiple claims compete for inclusion in a summary, prioritize by argument strength:
+1. **Strong**: Claim + multiple independent evidence sources + acknowledged qualifiers
+2. **Moderate**: Claim + single evidence source or claim + logical reasoning without empirical data
+3. **Weak**: Claim with no evidence, appeal to authority alone, or anecdotal support
+4. **Contested**: Claim with counter-evidence presented — must include both sides
+## Information Density Analysis
+Not all sentences carry equal information value. Information density determines which content survives compression.
+### High-Density Signals (Preserve)
+- **Quantitative data**: Percentages, dollar amounts, dates, counts, measurements
+  - Example: "Revenue grew 34% year-over-year to $4.2B in Q3 2024"
+- **Proper nouns**: People, organizations, places, product names
+  - Example: "CEO Maria Chen announced the partnership with Anthropic"
+- **Causal relationships**: Explicit cause-effect statements
+  - Example: "The tariff increase led to a 15% drop in imports within 6 months"
+- **Novel claims**: Information the reader likely does not already know
+  - Example: "Contrary to prior research, the study found no correlation between X and Y"
+- **Definitions**: First introduction of a key term or concept
+  - Example: "Retrieval-augmented generation (RAG) combines a retriever with a generator"
+### Low-Density Signals (Compress or Remove)
+- **Redundant restatements**: Same idea expressed in different words
+- **Background context**: Common knowledge or widely known facts
+- **Hedging language**: "It is important to note that", "One might argue"
+- **Transitional padding**: "Having examined X, let us now turn to Y"
+- **Excessive examples**: When one example suffices to illustrate a point, additional examples can be dropped
+### Density Scoring Heuristic
+For each sentence, assign a density score (0-3):
+- **3**: Contains quantitative data + causal claim + proper noun
+- **2**: Contains a novel claim or key evidence
+- **1**: Provides context or elaboration on a higher-density sentence
+- **0**: Transitional, redundant, or common knowledge
+Sentences scoring 2-3 are primary candidates for summary inclusion. Sentences scoring 1 are included only if the summary length permits. Sentences scoring 0 are excluded.

package/manifest.json ADDED Viewed

@@ -0,0 +1,26 @@
+{
+  "name": "@botlearn/summarizer",
+  "version": "0.1.0",
+  "description": "Content summarization with core argument extraction, key detail retention, and accuracy self-checking for OpenClaw Agent",
+  "category": "content-processing",
+  "author": "BotLearn",
+  "benchmarkDimension": "content-understanding",
+  "expectedImprovement": 45,
+  "dependencies": {},
+  "compatibility": {
+    "openclaw": ">=0.5.0"
+  },
+  "files": {
+    "skill": "skill.md",
+    "knowledge": [
+      "knowledge/domain.md",
+      "knowledge/best-practices.md",
+      "knowledge/anti-patterns.md"
+    ],
+    "strategies": [
+      "strategies/main.md"
+    ],
+    "smokeTest": "tests/smoke.json",
+    "benchmark": "tests/benchmark.json"
+  }
+}

package/package.json ADDED Viewed

@@ -0,0 +1,35 @@
+{
+  "name": "@botlearn/summarizer",
+  "version": "0.1.0",
+  "description": "Content summarization with core argument extraction, key detail retention, and accuracy self-checking for OpenClaw Agent",
+  "type": "module",
+  "main": "manifest.json",
+  "files": [
+    "manifest.json",
+    "skill.md",
+    "knowledge/",
+    "strategies/",
+    "tests/",
+    "README.md"
+  ],
+  "keywords": [
+    "botlearn",
+    "openclaw",
+    "skill",
+    "content-processing"
+  ],
+  "author": "BotLearn",
+  "license": "MIT",
+  "repository": {
+    "type": "git",
+    "url": "https://github.com/readai-team/botlearn-awesome-skills.git",
+    "directory": "packages/skills/summarizer"
+  },
+  "homepage": "https://github.com/readai-team/botlearn-awesome-skills/tree/main/packages/skills/summarizer",
+  "bugs": {
+    "url": "https://github.com/readai-team/botlearn-awesome-skills/issues"
+  },
+  "publishConfig": {
+    "access": "public"
+  }
+}

package/skill.md ADDED Viewed

@@ -0,0 +1,43 @@
+---
+name: summarizer
+role: Content Summarization Specialist
+version: 1.0.0
+triggers:
+  - "summarize"
+  - "summary"
+  - "key points"
+  - "TLDR"
+  - "digest"
+  - "main ideas"
+  - "boil down"
+---
+# Role
+You are a Content Summarization Specialist. When activated, you analyze documents, articles, and multi-source content to extract core arguments, retain critical details (numbers, dates, names, causal claims), and produce accurate, well-structured summaries calibrated to the user's desired length and depth.
+# Capabilities
+1. Identify discourse structure (narrative, argumentative, expository, descriptive, procedural) and adapt extraction strategy accordingly
+2. Extract and map core arguments, distinguishing claims from evidence, and preserving the logical chain between them
+3. Prioritize details by information density — retain quantitative data, proper nouns, causal relationships, and novel insights while compressing redundant or low-signal passages
+4. Synthesize multi-document inputs into unified summaries, reconciling overlapping claims and surfacing contradictions
+5. Perform accuracy self-checks by verifying that every key claim in the summary traces back to a specific passage in the source material
+# Constraints
+1. Never fabricate details — every fact, number, date, and name in the summary must appear in the source material
+2. Never omit quantitative data (percentages, dollar amounts, dates, counts) that support core arguments
+3. Never flatten nuance — if the source presents competing viewpoints, the summary must reflect that tension
+4. Always preserve attribution — if a claim is attributed to a specific person or organization, maintain that attribution
+5. Always state the summary's compression ratio and flag if the requested length risks losing critical information
+# Activation
+WHEN the user requests a summary, key points, TLDR, or digest of content:
+1. Determine target length and depth (brief / standard / detailed) from user cues
+2. Analyze document structure following strategies/main.md
+3. Apply discourse type recognition from knowledge/domain.md
+4. Extract arguments and details using knowledge/best-practices.md
+5. Verify against knowledge/anti-patterns.md to avoid common summarization errors
+6. Output a structured summary with source traceability annotations

package/strategies/main.md ADDED Viewed

@@ -0,0 +1,85 @@
+---
+strategy: summarizer
+version: 1.0.0
+steps: 5
+---
+# Content Summarization Strategy
+## Step 1: Structure Identification
+- Read the full source material before beginning summarization — never summarize while reading for the first time
+- Classify the discourse type: argumentative / expository / narrative / descriptive / procedural (see knowledge/domain.md)
+- Identify the document's structural skeleton:
+  - IF argumentative THEN locate thesis, claims, evidence blocks, counter-arguments, conclusion
+  - IF expository THEN locate topic statement, category headings, detail sections, synthesis
+  - IF narrative THEN locate setting, key actors, initiating event, turning points, resolution
+  - IF descriptive THEN locate subject overview, feature enumerations, comparative sections
+  - IF procedural THEN locate goal statement, prerequisites, step sequence, warnings, verification
+- Determine the total length and estimate a target compression ratio based on user intent:
+  - "TLDR" / "one-liner" → 20:1+
+  - "key points" / "brief" → 10:1
+  - "summary" (default) → 5:1
+  - "detailed summary" → 3:1
+- IF the source is multi-document THEN note the number of documents and scan for thematic overlap before proceeding
+## Step 2: Argument Extraction
+- Build a claim-evidence hierarchy using the argument mapping framework from knowledge/domain.md:
+  - Central thesis → supporting claims → evidence for each claim → qualifiers and counter-arguments
+- For each claim identified, record:
+  - The claim statement
+  - The type and strength of its evidence (quantitative data, case study, expert opinion, logical reasoning)
+  - Any qualifiers or conditions limiting the claim's scope
+  - Any counter-evidence or objections acknowledged by the source
+- Rank claims by argument strength:
+  - **Strong**: Multiple independent evidence sources + qualifiers acknowledged
+  - **Moderate**: Single evidence source or logical reasoning only
+  - **Weak**: Assertion without evidence or anecdotal support only
+  - **Contested**: Counter-evidence presented — both sides must appear in summary
+- IF the source is multi-document THEN merge argument maps:
+  - Corroborated claims (multiple sources) → high confidence
+  - Contradictory claims → flag with attribution to each source
+  - Unique claims (single source) → note as "according to [source]"
+## Step 3: Detail Prioritization
+- Score each extracted detail by information density (see knowledge/domain.md):
+  - **Score 3**: Quantitative data + causal claim + named entity (always include)
+  - **Score 2**: Novel claim or key evidence (include in standard+ summaries)
+  - **Score 1**: Contextual elaboration (include only in detailed summaries)
+  - **Score 0**: Transitional, redundant, or common knowledge (exclude)
+- Apply the detail preservation tiers from knowledge/best-practices.md:
+  - **Tier 1 — Must Preserve**: Central thesis, quantitative results, key named entities, causal claims with evidence, contradictions and caveats
+  - **Tier 2 — Space Permitting**: Strongest example per claim, methodology details, secondary reinforcing claims, historical context
+  - **Tier 3 — Safe to Omit**: Repeated examples, author credentials, acknowledgments, detailed literature reviews
+- Cross-check against anti-patterns from knowledge/anti-patterns.md:
+  - Am I dropping any numbers or dates? → Restore them
+  - Am I stripping proper nouns? → Restore attribution
+  - Am I over-generalizing specific lists? → Name the top items explicitly
+  - Am I showing position bias? → Verify coverage spans the full document
+## Step 4: Synthesis
+- Construct the summary using the hybrid approach from knowledge/best-practices.md:
+  - **Abstractive**: Use for narrative flow, transitions, and structural framing
+  - **Extractive**: Use for quantitative data, direct quotes, and precise technical claims (mark with source references)
+- Structure the output based on user intent:
+  - IF "TLDR" THEN one sentence: thesis + most important finding
+  - IF "key points" THEN bullet list: thesis + top 3-5 claims with headline evidence
+  - IF "summary" THEN structured prose: full argument skeleton with Tier 1 details
+  - IF "detailed summary" THEN section-by-section: all Tier 1 + Tier 2 details with source annotations
+  - IF multi-document THEN organize by theme, not by source
+- Apply formatting best practices:
+  - Lead with the central conclusion (inverted pyramid)
+  - One idea per bullet point or paragraph
+  - Use headings or numbered lists for multi-part summaries
+  - Add source traceability markers for key claims
+## Step 5: Accuracy Self-Check
+- Perform a systematic verification pass before outputting the summary:
+  1. **Fact check**: For every number, date, percentage, and proper noun in the summary, verify it appears in the source material — if it does not, remove or correct it
+  2. **Claim fidelity**: For every causal or evaluative statement, verify the source makes the same claim at the same strength — if the summary says "causes" but the source says "correlates with", downgrade the language
+  3. **Attribution check**: For every attributed claim ("X said", "according to Y"), verify the attribution is correct and not confused with another entity
+  4. **Completeness check**: Review the argument map — is any strong or contested claim missing from the summary? If yes and space permits, add it
+  5. **Position bias check**: Does the summary draw from all sections of the source, or only the beginning? If the final third is unrepresented, reassess
+  6. **Anti-pattern scan**: Run through the 12 anti-patterns in knowledge/anti-patterns.md — does the summary violate any?
+- IF any check fails THEN revise the summary and re-run the verification
+- State the compression ratio in the output: "[Original: ~X words → Summary: ~Y words, compression ratio Z:1]"
+- IF the target compression risks losing Tier 1 details THEN warn the user: "Note: The requested length requires omitting [specific detail]. Expand to [length] for full coverage."

package/tests/benchmark.json ADDED Viewed

@@ -0,0 +1,476 @@
+{
+  "version": "0.0.1",
+  "dimension": "content-understanding",
+  "tasks": [
+    {
+      "id": "bench-easy-01",
+      "difficulty": "easy",
+      "description": "Summarize a short factual news article with key statistics",
+      "input": "Summarize the following news article:\n\n---\nGlobal Renewable Energy Hits Record in 2024\n\nThe International Energy Agency (IEA) reported on January 15, 2025 that global renewable energy capacity additions reached 507 gigawatts (GW) in 2024, a 25% increase from the 405 GW added in 2023. Solar photovoltaic (PV) accounted for 75% of new additions at 380 GW, while wind power contributed 117 GW.\n\nChina led installations with 292 GW (57% of global total), followed by the United States at 52 GW, India at 29 GW, and Germany at 18 GW. The IEA's Executive Director Fatih Birol stated that renewables are now 'the cheapest source of new electricity in virtually every country.'\n\nDespite the record additions, the IEA warned that the pace must accelerate to 1,000 GW annually by 2030 to meet Paris Agreement targets. Grid infrastructure remains a bottleneck, with approximately 1,500 GW of renewable projects globally awaiting grid connection.\n---",
+      "rubric": [
+        {
+          "criterion": "Key Fact Retention",
+          "weight": 0.4,
+          "scoring": {
+            "5": "Retains: 507 GW total, 25% increase, solar 380 GW/wind 117 GW split, China 292 GW, 1,000 GW 2030 target, 1,500 GW backlog",
+            "3": "Retains total and top 2-3 statistics but misses some",
+            "1": "Only retains 1-2 numbers",
+            "0": "No quantitative data preserved"
+          }
+        },
+        {
+          "criterion": "Structure Accuracy",
+          "weight": 0.3,
+          "scoring": {
+            "5": "Captures the three-part structure: record achievement, country breakdown, and gap/warning",
+            "3": "Captures achievement and one other aspect",
+            "1": "Only captures the headline finding",
+            "0": "Misrepresents the article structure"
+          }
+        },
+        {
+          "criterion": "Conciseness",
+          "weight": 0.3,
+          "scoring": {
+            "5": "Summary is 40-80 words, well-compressed, no filler",
+            "3": "Summary is reasonable length but contains some unnecessary detail",
+            "1": "Summary is nearly as long as the original or extremely truncated",
+            "0": "No meaningful compression"
+          }
+        }
+      ],
+      "expectedScoreWithout": 45,
+      "expectedScoreWith": 85
+    },
+    {
+      "id": "bench-easy-02",
+      "difficulty": "easy",
+      "description": "Extract key points from a procedural/instructional text",
+      "input": "Give me the key points from this guide:\n\n---\nSetting Up a Production Kubernetes Cluster: Essential Steps\n\n1. Capacity Planning: Before provisioning, calculate expected workload. A typical starting configuration for a medium SaaS application is 3 control plane nodes (4 vCPU, 16 GB RAM each) and 5-10 worker nodes (8 vCPU, 32 GB RAM each). Over-provisioning by 30% is recommended for burst capacity.\n\n2. Network Configuration: Use a CNI plugin (Calico or Cilium recommended). Ensure pod CIDR does not overlap with existing VPC ranges. Allocate at least a /16 for pod networking (65,536 addresses). Enable network policies from day one — retrofitting is significantly harder.\n\n3. Security Hardening: Enable RBAC (mandatory since Kubernetes 1.6). Disable anonymous authentication. Use Pod Security Admission (replacing PodSecurityPolicy since v1.25). Rotate certificates every 90 days. Store secrets in an external manager (HashiCorp Vault or AWS Secrets Manager), not in etcd.\n\n4. Monitoring Stack: Deploy Prometheus + Grafana for metrics, Loki for logs, and Jaeger for distributed tracing. Set up alerts for: node CPU > 80%, pod restart count > 3 in 5 minutes, and persistent volume usage > 85%.\n\n5. Disaster Recovery: Back up etcd every 6 hours using etcdctl snapshot. Test restoration quarterly. Maintain infrastructure-as-code (Terraform or Pulumi) so the entire cluster can be recreated from scratch in under 2 hours. Document the runbook.\n---",
+      "rubric": [
+        {
+          "criterion": "Completeness",
+          "weight": 0.4,
+          "scoring": {
+            "5": "All 5 sections represented with their most critical actionable detail from each",
+            "3": "4 of 5 sections represented or some sections lack specifics",
+            "1": "Only 2-3 sections captured",
+            "0": "Major sections missing"
+          }
+        },
+        {
+          "criterion": "Technical Precision",
+          "weight": 0.3,
+          "scoring": {
+            "5": "Preserves specific technical details: node specs, CIDR range, tool names (Calico/Cilium, Vault), alert thresholds, backup intervals",
+            "3": "Preserves tool names but loses specific configurations",
+            "1": "Generic summary without technical specifics",
+            "0": "Technical details incorrect or fabricated"
+          }
+        },
+        {
+          "criterion": "Format Appropriateness",
+          "weight": 0.3,
+          "scoring": {
+            "5": "Uses bullet or numbered list matching the procedural source; each point is concise and actionable",
+            "3": "Structured but some points are verbose or unclear",
+            "1": "Prose format that buries the procedural structure",
+            "0": "Unstructured output"
+          }
+        }
+      ],
+      "expectedScoreWithout": 40,
+      "expectedScoreWith": 82
+    },
+    {
+      "id": "bench-easy-03",
+      "difficulty": "easy",
+      "description": "TLDR-style ultra-brief summary of a product announcement",
+      "input": "TLDR this announcement:\n\n---\nAnthropic Launches Claude 4 with Expanded Capabilities\n\nSan Francisco, March 15, 2025 — Anthropic today announced Claude 4, the latest version of its AI assistant. The new model features a 500,000-token context window (up from 200,000 in Claude 3.5), improved reasoning capabilities scoring 92% on the GPQA benchmark (up from 65%), and native multimodal support for images, PDFs, and audio.\n\nKey pricing: Claude 4 Opus is available at $15 per million input tokens and $75 per million output tokens through the API. A new 'Projects' feature allows enterprise users to create persistent knowledge bases that Claude can reference across conversations.\n\nThe model was trained using Anthropic's Constitutional AI 2.0 framework, which the company claims reduces harmful outputs by 60% compared to the previous version while maintaining helpfulness scores. Claude 4 is available immediately through the API and web interface, with mobile apps launching in April 2025.\n---",
+      "rubric": [
+        {
+          "criterion": "Brevity",
+          "weight": 0.35,
+          "scoring": {
+            "5": "Summary is 1-2 sentences (under 40 words) capturing the essential announcement",
+            "3": "Summary is 3-4 sentences, slightly verbose for a TLDR",
+            "1": "Summary exceeds 5 sentences",
+            "0": "No meaningful compression"
+          }
+        },
+        {
+          "criterion": "Essential Information",
+          "weight": 0.4,
+          "scoring": {
+            "5": "Captures: what (Claude 4), who (Anthropic), and at least 2 key improvements (context window, GPQA score, multimodal, or pricing)",
+            "3": "Captures product and company but only 1 specific improvement",
+            "1": "Generic statement without specifics",
+            "0": "Misidentifies the product or company"
+          }
+        },
+        {
+          "criterion": "Accuracy",
+          "weight": 0.25,
+          "scoring": {
+            "5": "All stated facts match the source exactly — no number errors or misattributions",
+            "3": "Minor imprecision (e.g., rounded numbers that were already round)",
+            "1": "Contains a factual error",
+            "0": "Multiple factual errors or fabricated details"
+          }
+        }
+      ],
+      "expectedScoreWithout": 45,
+      "expectedScoreWith": 88
+    },
+    {
+      "id": "bench-med-01",
+      "difficulty": "medium",
+      "description": "Summarize an argumentative text with competing viewpoints and nuance",
+      "input": "Summarize the following policy analysis, preserving the competing viewpoints:\n\n---\nThe Universal Basic Income Debate: Evidence from Recent Pilots\n\nThe concept of Universal Basic Income (UBI) has moved from theoretical discussion to empirical testing, with over 50 pilot programs conducted worldwide since 2017. The evidence is both encouraging and cautionary, with results varying significantly based on program design, cultural context, and measurement methodology.\n\nThe strongest evidence for UBI comes from Finland's 2017-2018 experiment, which provided 2,000 unemployed citizens with EUR 560/month unconditionally. The final report, published by Kela in 2020, found that recipients experienced a 0.5 percentage point increase in employment compared to the control group, a statistically significant but modest effect. More notably, recipients reported substantially higher life satisfaction (7.3 vs 6.8 on a 10-point scale) and lower stress levels, with a 37% reduction in self-reported depression symptoms.\n\nStockton, California's SEED program (2019-2021) provided 125 residents with $500/month. UC Berkeley researchers found that full-time employment among recipients rose from 28% to 40%, contradicting the hypothesis that cash transfers reduce work incentive. However, critics note the small sample size and the presence of a Hawthorne effect.\n\nSkeptical perspectives come from several directions. Economist Gregory Mankiw argues that UBI at a meaningful level ($1,000/month per adult in the US) would cost approximately $3.1 trillion annually — roughly 75% of the current federal budget — making it fiscally unsustainable without massive tax increases or program consolidation. The Congressional Budget Office estimated that funding UBI through income tax alone would require a flat rate increase of 39 percentage points.\n\nA different criticism comes from labor economists like Daron Acemoglu, who contends that UBI addresses the symptom (income insufficiency) rather than the cause (labor market disruption from automation). Acemoglu advocates instead for wage subsidies and retraining programs that maintain the dignity and social integration benefits of employment.\n\nThe Kenya GiveDirectly study, the largest randomized controlled trial of basic income (20,000 recipients, 12-year program), has published interim results showing a 27% increase in household assets and a 13% increase in small business revenues among recipients. However, the study also found that recipients in villages where only some people received transfers experienced increased social tension, raising questions about targeting versus universality.\n\nProponents like economist Guy Standing argue that UBI should not be evaluated solely as an employment intervention but as a freedom-enhancing policy that gives individuals bargaining power and the ability to refuse exploitative work. Standing points to India's Madhya Pradesh pilot, where UBI recipients, particularly women, showed increased likelihood of starting small businesses (32% vs 19% in control) and girls' school attendance increased by 25%.\n\nThe emerging view among policy researchers is that UBI works best as a complement to, not a replacement for, existing social safety nets — a position articulated by Stanford's Basic Income Lab. The optimal design appears to involve smaller, unconditional payments ($300-500/month) paired with maintained access to healthcare, housing assistance, and education support.\n---",
+      "rubric": [
+        {
+          "criterion": "Viewpoint Balance",
+          "weight": 0.3,
+          "scoring": {
+            "5": "Presents all major perspectives: pro-UBI evidence (Finland, Stockton, Kenya, India), fiscal criticism (Mankiw/CBO), structural criticism (Acemoglu), and the emerging synthesis position — without favoring any",
+            "3": "Covers pro and con but misses 1-2 perspectives or shows implicit bias",
+            "1": "Presents mostly one side, mentions opposition superficially",
+            "0": "One-sided summary or misrepresents positions"
+          }
+        },
+        {
+          "criterion": "Quantitative Fidelity",
+          "weight": 0.25,
+          "scoring": {
+            "5": "Retains key numbers from both sides: EUR 560/month, employment effect (0.5pp or 28%→40%), $3.1T cost, 39pp tax increase, 27% asset increase, 32% vs 19% business starts",
+            "3": "Retains 4-5 key statistics with correct attribution",
+            "1": "Retains 1-2 numbers or uses vague quantifiers",
+            "0": "No numbers preserved or numbers are incorrect"
+          }
+        },
+        {
+          "criterion": "Nuance Preservation",
+          "weight": 0.25,
+          "scoring": {
+            "5": "Captures qualifications: Finland effect was 'modest', Stockton had small sample/Hawthorne, Kenya showed social tension, complement-not-replacement conclusion",
+            "3": "Captures 2-3 qualifications",
+            "1": "Presents findings as unqualified",
+            "0": "Strips all nuance, presents black-and-white conclusions"
+          }
+        },
+        {
+          "criterion": "Structural Coherence",
+          "weight": 0.2,
+          "scoring": {
+            "5": "Organized thematically (evidence → criticisms → synthesis) with clear flow; appropriate length (150-250 words)",
+            "3": "Logical but some transitions are weak or organization is source-by-source rather than thematic",
+            "1": "Disjointed list of facts without narrative structure",
+            "0": "Incoherent structure"
+          }
+        }
+      ],
+      "expectedScoreWithout": 35,
+      "expectedScoreWith": 80
+    },
+    {
+      "id": "bench-med-02",
+      "difficulty": "medium",
+      "description": "Summarize a technical document preserving methodology and causal reasoning",
+      "input": "Summarize this research findings section, preserving the methodology and causal chain:\n\n---\nEffects of Sleep Deprivation on Cognitive Performance: A Meta-Analysis\n\nMethodology: We conducted a systematic review and meta-analysis of 147 studies (total N = 12,438 participants) published between 2010 and 2024, examining the relationship between sleep deprivation (defined as <6 hours of sleep per night for >=3 consecutive nights) and cognitive performance across five domains. Studies were identified through PubMed, PsycINFO, and Cochrane Library databases. Random-effects models were used to pool effect sizes, and heterogeneity was assessed using the I-squared statistic.\n\nFindings:\n\n1. Attention and Vigilance: The largest effect was observed in sustained attention tasks, with a pooled Cohen's d of -1.42 (95% CI: -1.58 to -1.26, I² = 34%). Psychomotor vigilance test (PVT) lapses increased by 287% after 3 nights of restricted sleep (<6h). Recovery required 2 nights of unrestricted sleep (mean 8.2 hours) to return to baseline.\n\n2. Working Memory: Moderate impairment was observed (d = -0.86, 95% CI: -1.01 to -0.71, I² = 42%). N-back task performance declined linearly: 2-back accuracy dropped from 89% to 67% after 5 nights of restriction. Notably, subjective awareness of impairment plateaued after night 3, while objective performance continued to decline — a phenomenon termed 'sleep debt denial.'\n\n3. Executive Function: Significant but variable effects (d = -0.73, 95% CI: -0.95 to -0.51, I² = 67%). High heterogeneity was driven by task type: planning tasks (Tower of London) showed d = -1.11, while cognitive flexibility tasks (Wisconsin Card Sort) showed only d = -0.38. This suggests that sleep deprivation disproportionately affects top-down, effortful processing.\n\n4. Emotional Regulation: Large effects on emotional processing (d = -1.18, 95% CI: -1.39 to -0.97, I² = 29%). Amygdala reactivity to negative stimuli increased by 60% (fMRI data from 23 studies), while prefrontal cortex regulatory activity decreased by 33%. This neural mechanism explains the consistent finding of increased irritability and impulsive decision-making.\n\n5. Long-term Memory Consolidation: Moderate effects (d = -0.91, 95% CI: -1.08 to -0.74, I² = 38%). Declarative memory (fact recall) was impaired by 23%, while procedural memory (motor skill learning) showed only 8% impairment. Sleep spindle density during NREM Stage 2, reduced by 41% under sleep restriction, correlated with declarative memory decline (r = 0.67, p < 0.001).\n\nDose-Response Relationship: Performance decline followed a nonlinear curve. The critical threshold was identified at 6.5 hours: participants sleeping 6-6.5 hours showed minimal impairment (d = -0.15), while those sleeping 5.5-6 hours showed moderate impairment (d = -0.62), and those below 5 hours showed severe impairment (d = -1.34). This nonlinear relationship has implications for public health guidelines.\n\nLimitations: 78% of studies used young adult samples (age 18-30), limiting generalizability. Only 12 studies examined chronic restriction beyond 7 days. Publication bias was detected (Egger's test p = 0.03), suggesting the true effects may be slightly smaller.\n---",
+      "rubric": [
+        {
+          "criterion": "Methodology Preservation",
+          "weight": 0.25,
+          "scoring": {
+            "5": "Retains: 147 studies, N=12,438, sleep deprivation definition (<6h for >=3 nights), 5 cognitive domains, meta-analytic approach",
+            "3": "Mentions it is a meta-analysis with study count but loses definition or scope details",
+            "1": "Vaguely mentions 'research' without methodology",
+            "0": "No methodology mentioned"
+          }
+        },
+        {
+          "criterion": "Finding Hierarchy",
+          "weight": 0.3,
+          "scoring": {
+            "5": "Preserves the ranked structure: attention most affected (d=-1.42), emotional regulation (d=-1.18), long-term memory (d=-0.91), working memory (d=-0.86), executive function (d=-0.73); includes at least 3 effect sizes",
+            "3": "Captures 3-4 domains with some effect sizes but loses ranking",
+            "1": "Lists domains without quantitative comparison",
+            "0": "Domains not differentiated"
+          }
+        },
+        {
+          "criterion": "Causal Mechanisms",
+          "weight": 0.25,
+          "scoring": {
+            "5": "Preserves key causal explanations: amygdala reactivity increase explaining emotional effects, sleep spindle reduction explaining memory decline, 'sleep debt denial' phenomenon, nonlinear dose-response with 6.5h threshold",
+            "3": "Captures 2 causal mechanisms",
+            "1": "Only reports effects without mechanisms",
+            "0": "No causal reasoning preserved"
+          }
+        },
+        {
+          "criterion": "Limitations Inclusion",
+          "weight": 0.2,
+          "scoring": {
+            "5": "Notes at least 2 of: young adult sample bias, limited chronic studies, publication bias",
+            "3": "Mentions limitations exist but lacks specifics",
+            "1": "No limitations mentioned",
+            "0": "Implies findings are unqualified/universal"
+          }
+        }
+      ],
+      "expectedScoreWithout": 35,
+      "expectedScoreWith": 82
+    },
+    {
+      "id": "bench-med-03",
+      "difficulty": "medium",
+      "description": "Multi-document summarization requiring claim reconciliation",
+      "input": "Synthesize the following three source excerpts into a unified summary:\n\nSource A (The Economist, January 2025):\n\"India's GDP growth slowed to 5.4% in Q3 2024, down from 6.7% in Q2, marking the weakest quarterly performance in two years. Manufacturing output contracted for the first time since 2020, falling 2.2% year-over-year. The Reserve Bank of India (RBI) maintained its repo rate at 6.5%, resisting calls for rate cuts despite the slowdown, citing persistent food inflation at 9.6%.\"\n\nSource B (IMF World Economic Outlook Update, January 2025):\n\"India remains the fastest-growing major economy, with projected full-year 2024 GDP growth of 6.5%, revised down from the October forecast of 7.0%. The medium-term outlook is robust, driven by demographic dividends, digitization, and infrastructure investment (National Infrastructure Pipeline: $1.4 trillion through 2027). However, the IMF flagged risks from elevated household debt (rising from 32% to 40% of GDP between 2019-2024) and a widening current account deficit.\"\n\nSource C (Reserve Bank of India Monetary Policy Statement, December 2024):\n\"The Indian economy continues to demonstrate resilience. While Q3 growth moderated due to temporary factors including monsoon disruptions and inventory adjustments in manufacturing, the RBI projects H2 FY2025 growth at 6.7-7.0%, supported by robust services sector performance and recovering rural demand. Food inflation, currently at 9.6%, is expected to moderate to 5.2% by Q1 FY2026 as supply-side pressures ease.\"\n\n---\nProvide a unified summary that reconciles these sources.",
+      "rubric": [
+        {
+          "criterion": "Claim Reconciliation",
+          "weight": 0.35,
+          "scoring": {
+            "5": "Identifies corroborated facts (5.4% Q3, 9.6% food inflation), contrasting framings (Economist: 'weakest', RBI: 'temporary'), and unique claims per source (IMF: household debt, RBI: H2 recovery projection)",
+            "3": "Reconciles some claims but misses key contradictions or agreements",
+            "1": "Presents sources separately without reconciliation",
+            "0": "Confuses or conflates source claims"
+          }
+        },
+        {
+          "criterion": "Source Attribution",
+          "weight": 0.25,
+          "scoring": {
+            "5": "Every claim correctly attributed to its source; framing differences noted with attribution (e.g., 'The Economist characterizes this as... while the RBI frames it as...')",
+            "3": "Most claims attributed but 1-2 unattributed or misattributed",
+            "1": "Generic 'sources say' without specific attribution",
+            "0": "No attribution or incorrect attribution"
+          }
+        },
+        {
+          "criterion": "Data Accuracy",
+          "weight": 0.2,
+          "scoring": {
+            "5": "Key numbers preserved accurately: 5.4% Q3, 6.5% full-year, -2.2% manufacturing, 6.5% repo rate, $1.4T infrastructure, 40% household debt",
+            "3": "4-5 key numbers correct",
+            "1": "1-2 numbers or numbers with errors",
+            "0": "Numbers fabricated or systematically wrong"
+          }
+        },
+        {
+          "criterion": "Thematic Organization",
+          "weight": 0.2,
+          "scoring": {
+            "5": "Organized by theme (current performance, outlook, risks) rather than by source; flows as a coherent narrative",
+            "3": "Partially thematic but some source-by-source structure remains",
+            "1": "Purely source-by-source with no integration",
+            "0": "Disorganized"
+          }
+        }
+      ],
+      "expectedScoreWithout": 30,
+      "expectedScoreWith": 78
+    },
+    {
+      "id": "bench-med-04",
+      "difficulty": "medium",
+      "description": "Summarize a narrative text preserving causal chain and key actors",
+      "input": "Summarize this case study, preserving the causal chain and key decisions:\n\n---\nThe Collapse of FTX: A Timeline of Decisions\n\nFTX, founded by Sam Bankman-Fried (SBF) in 2019, grew to become the world's third-largest cryptocurrency exchange with a peak valuation of $32 billion in January 2022. Its collapse in November 2022 represents one of the fastest destructions of value in financial history.\n\nThe seeds of collapse were planted in the organizational structure itself. Alameda Research, SBF's quantitative trading firm, maintained an undisclosed special relationship with FTX. Internal documents revealed that Alameda held a $65 billion line of credit from FTX — funded by customer deposits — with no collateral requirements, no independent risk oversight, and no board approval.\n\nThe trigger event occurred on November 2, 2022, when CoinDesk published a leaked Alameda balance sheet showing that $5.8 billion of its $14.6 billion in assets consisted of FTT tokens — FTX's own exchange token. This circular dependency meant that Alameda's solvency was contingent on the market price of a token controlled by its sister company.\n\nBinance CEO Changpeng Zhao (CZ) announced on November 6 that Binance would liquidate its $580 million FTT holdings, triggering a bank run. Within 72 hours, FTX customers attempted to withdraw $6 billion. FTX halted withdrawals on November 8 when it became clear that approximately $8 billion in customer funds had been transferred to Alameda and could not be returned.\n\nBinance signed a non-binding letter of intent to acquire FTX on November 8, but withdrew within 24 hours after due diligence revealed the scale of the shortfall. FTX filed for Chapter 11 bankruptcy on November 11, 2022. SBF was arrested in the Bahamas on December 12, 2022, and was convicted on all seven federal charges — including wire fraud and money laundering — on November 2, 2023. He was sentenced to 25 years in prison on March 28, 2024.\n\nThe bankruptcy proceedings, led by restructuring specialist John Ray III (who previously oversaw the Enron liquidation), recovered approximately $14.5 billion in assets — enough to repay all customers at the time of filing with interest. Ray described the corporate governance at FTX as the worst he had seen in his 40-year career.\n---",
+      "rubric": [
+        {
+          "criterion": "Causal Chain Preservation",
+          "weight": 0.35,
+          "scoring": {
+            "5": "Preserves the full chain: Alameda-FTX relationship (undisclosed credit line) → CoinDesk leak (FTT concentration) → CZ liquidation announcement → bank run → withdrawal halt → bankruptcy → conviction. Each link causally connected to the next",
+            "3": "Captures 4-5 links but some causal connections are implied rather than stated",
+            "1": "Lists events without causal linkage",
+            "0": "Causal chain broken or fabricated"
+          }
+        },
+        {
+          "criterion": "Key Figure Retention",
+          "weight": 0.2,
+          "scoring": {
+            "5": "Names SBF, CZ, and John Ray III with their roles; correctly attributes actions to each",
+            "3": "Names SBF and one other; actions mostly attributed correctly",
+            "1": "Only SBF named; others generic",
+            "0": "Key actors omitted or confused"
+          }
+        },
+        {
+          "criterion": "Critical Numbers",
+          "weight": 0.25,
+          "scoring": {
+            "5": "Retains: $32B valuation, $65B credit line, $5.8B FTT exposure, $6B withdrawal attempt, $8B shortfall, 25-year sentence, $14.5B recovery",
+            "3": "Retains 4-5 of these numbers",
+            "1": "Retains 1-2 numbers",
+            "0": "No numbers or numbers are wrong"
+          }
+        },
+        {
+          "criterion": "Compression Quality",
+          "weight": 0.2,
+          "scoring": {
+            "5": "Summary is 100-180 words, chronologically coherent, no filler",
+            "3": "Reasonable length but some chronological confusion or filler",
+            "1": "Too long (>250 words) or too short (<50 words) losing essential content",
+            "0": "Poor compression"
+          }
+        }
+      ],
+      "expectedScoreWithout": 40,
+      "expectedScoreWith": 85
+    },
+    {
+      "id": "bench-hard-01",
+      "difficulty": "hard",
+      "description": "Summarize a dense technical paper with complex argumentation and multiple competing frameworks",
+      "input": "Summarize the following research paper excerpt, preserving the methodological debate and the authors' contribution:\n\n---\nRethinking Attention: Sparse Mixture-of-Experts vs. Dense Transformers for Long-Context Understanding\n\nAbstract: We present SparseCtx, a novel architecture that replaces dense self-attention with a learned sparse routing mechanism for long-context tasks. On documents exceeding 100K tokens, SparseCtx achieves 94.2% of dense transformer accuracy while using only 18% of the compute (measured in FLOPs). We evaluate on five benchmarks: NarrativeQA, QuALITY, ScrollsQA, RULER-128K, and a new proprietary legal document analysis task.\n\nRelated Work and Methodological Debate: The efficient attention literature has bifurcated into two camps. Linear attention methods (Katharopoulos et al., 2020; Choromanski et al., 2021) replace softmax attention with kernel approximations, achieving O(n) complexity but sacrificing up to 15% accuracy on retrieval-intensive tasks (Tay et al., 2022). Sparse attention methods (Beltagy et al., 2020; Zaheer et al., 2020; Ainslie et al., 2023) maintain exact attention over selected token subsets, preserving accuracy but requiring careful selection heuristics that may miss critical long-range dependencies.\n\nOur work departs from both camps. Rather than approximating or subsetting attention, SparseCtx learns a routing function g(x) that maps each query token to k=32 expert attention heads (from a pool of 256), where each expert specializes in different dependency patterns: local context (window ≤512 tokens), medium-range (512-8K), long-range (8K-64K), and global (full context). The routing function is trained end-to-end with a load-balancing auxiliary loss (coefficient λ=0.01) to prevent expert collapse.\n\nKey Results:\n- NarrativeQA (F1): SparseCtx 71.8 vs. Dense 73.2 vs. Longformer 64.1 vs. Linear Attention 62.7\n- RULER-128K (accuracy): SparseCtx 89.4 vs. Dense 91.1 vs. Longformer 72.3 vs. Linear Attention 68.9\n- Inference latency at 128K tokens: SparseCtx 2.3s vs. Dense 12.8s vs. Longformer 4.1s\n- Training cost (GPU-hours for equivalent performance): SparseCtx 3,200 vs. Dense 18,400\n\nCritical Limitation: We observe a 'routing collapse' phenomenon in 3 of 20 training runs where the router degenerates to always selecting the same 4-5 experts regardless of input. While the load-balancing loss mitigates this, it does not eliminate it. We hypothesize this is an optimization landscape issue analogous to mode collapse in GANs and propose but do not validate a curriculum-based routing warmup strategy.\n---",
+      "rubric": [
+        {
+          "criterion": "Technical Accuracy",
+          "weight": 0.3,
+          "scoring": {
+            "5": "Correctly explains: sparse routing mechanism, the two existing camps (linear attention vs sparse attention) and their tradeoffs, how SparseCtx differs (learned routing to expert heads), and the routing collapse limitation",
+            "3": "Captures the main contribution but loses the methodological debate or specific mechanism details",
+            "1": "Vaguely describes it as an 'efficient attention method' without differentiation",
+            "0": "Technical content incorrect or fabricated"
+          }
+        },
+        {
+          "criterion": "Quantitative Results Fidelity",
+          "weight": 0.25,
+          "scoring": {
+            "5": "Preserves benchmark comparisons showing SparseCtx's position relative to dense and existing efficient methods; includes both accuracy and efficiency metrics (94.2% accuracy at 18% compute, latency comparison)",
+            "3": "Preserves some benchmark numbers but loses the comparative context",
+            "1": "Vague performance claims without numbers",
+            "0": "Numbers wrong or fabricated"
+          }
+        },
+        {
+          "criterion": "Contribution Clarity",
+          "weight": 0.25,
+          "scoring": {
+            "5": "Clearly articulates the paper's unique contribution: learned routing to specialized expert attention heads as a third approach beyond linear approximation and sparse subsetting, achieving near-dense accuracy at fraction of compute",
+            "3": "Describes what the paper does but doesn't clearly position it against alternatives",
+            "1": "Contribution unclear or lost in technical details",
+            "0": "Contribution misrepresented"
+          }
+        },
+        {
+          "criterion": "Limitation Honesty",
+          "weight": 0.2,
+          "scoring": {
+            "5": "Includes routing collapse issue (3/20 runs), notes it is not fully solved, mentions proposed but unvalidated mitigation",
+            "3": "Mentions limitations exist but lacks specifics",
+            "1": "No limitations mentioned, implying the approach is problem-free",
+            "0": "Ignores or obscures limitations"
+          }
+        }
+      ],
+      "expectedScoreWithout": 30,
+      "expectedScoreWith": 78
+    },
+    {
+      "id": "bench-hard-02",
+      "difficulty": "hard",
+      "description": "Summarize a long text with position bias trap — critical information buried in the middle and end",
+      "input": "Summarize the following report. Important: the most decision-relevant information is in sections 3 and 4, not the introduction.\n\n---\nQ4 2024 Market Analysis: Enterprise SaaS Sector\n\nSection 1 — Market Overview:\nThe global enterprise SaaS market reached $232 billion in 2024, growing 18% year-over-year. North America accounted for 52% of revenue ($120.6B), followed by Europe at 26% ($60.3B) and Asia-Pacific at 16% ($37.1B). Total venture capital investment in SaaS declined for the third consecutive year, falling to $41.2 billion from $52.8 billion in 2023 and $89.3 billion in 2022.\n\nSection 2 — Valuation Multiples:\nMedian EV/Revenue multiples for public SaaS companies compressed from 7.2x in January 2024 to 5.8x by December 2024. High-growth companies (>30% YoY revenue growth) maintained premiums at 12.4x, while sub-20% growth companies traded at 3.9x. The Rule of 40 (growth rate + profit margin) threshold separated premium from discount valuations: companies scoring >40 traded at 2.3x the multiple of those below 40.\n\nSection 3 — Churn Analysis (Critical for Strategic Planning):\nThe most significant trend of 2024 was the divergence in net revenue retention (NRR) between AI-augmented and traditional SaaS products. AI-augmented products achieved median NRR of 127%, compared to 108% for traditional products. More critically, gross churn for traditional products increased from 8.2% to 11.7% annually — the first time traditional SaaS churn exceeded 10% at the median.\n\nThe primary churn driver was identified through exit surveys (n=4,200 churned accounts): 43% cited replacement by AI-native alternatives, 28% cited budget consolidation, and 22% cited feature commoditization. The 43% AI-replacement figure represents a 3x increase from the 14% recorded in 2023 exit surveys.\n\nFor traditional SaaS companies, the data implies a 24-36 month window to integrate AI capabilities before churn reaches critical levels. Companies that launched AI features in 2024 saw churn reduce by an average of 3.1 percentage points within two quarters of launch.\n\nSection 4 — Emerging Risk: Usage-Based Pricing Erosion:\nA less-discussed but potentially more disruptive trend: AI-driven productivity gains are reducing per-seat and per-usage metrics that underpin SaaS pricing models. In customer support SaaS, AI automation reduced ticket volume by 40-60%, directly reducing usage-based revenue. CRM platforms report that AI-generated emails and call summaries reduced per-user activity metrics by 35%, triggering downgrades to lower pricing tiers.\n\nThis creates a paradox: the more effective a SaaS product's AI features, the more they cannibalize the product's own usage-based revenue. Five public SaaS companies revised their 2025 revenue guidance downward by 5-12% specifically citing AI-driven usage compression. The companies have not yet found a stable alternative pricing model, though value-based pricing (charging for outcomes rather than usage) is being piloted by 23% of surveyed companies.\n\nSection 5 — Outlook:\nThe consensus analyst forecast projects the SaaS market to reach $295 billion by 2026, but this projection assumes pricing models remain stable. If AI-driven usage compression spreads across categories, realized market size could be $250-265 billion, a 10-15% shortfall. The next 12 months will likely determine whether the industry can transition to new pricing paradigms.\n---",
+      "rubric": [
+        {
+          "criterion": "Position Bias Resistance",
+          "weight": 0.35,
+          "scoring": {
+            "5": "Summary prioritizes Sections 3-4 content (churn divergence AI vs traditional, 43% AI replacement, pricing erosion paradox) over the market overview numbers; demonstrates awareness that critical info was later in the document",
+            "3": "Covers Sections 3-4 but gives equal weight to the less critical Sections 1-2 market size numbers",
+            "1": "Dominated by Section 1-2 market overview; Sections 3-4 mentioned briefly or omitted",
+            "0": "Only summarizes the introduction"
+          }
+        },
+        {
+          "criterion": "Strategic Insight Extraction",
+          "weight": 0.3,
+          "scoring": {
+            "5": "Captures: NRR divergence (127% vs 108%), churn increase (8.2% to 11.7%), 24-36 month integration window, AI usage compression paradox, pricing model instability, and the 10-15% market size risk",
+            "3": "Captures 3-4 of these strategic insights",
+            "1": "Only surface-level observations without strategic implications",
+            "0": "Misses the strategic narrative"
+          }
+        },
+        {
+          "criterion": "Quantitative Precision",
+          "weight": 0.2,
+          "scoring": {
+            "5": "Retains key decision-relevant numbers: 43% AI replacement churn, 3.1pp churn reduction from AI launch, 40-60% ticket reduction, 35% activity metric reduction, 5-12% guidance revision, $250-265B adjusted forecast",
+            "3": "Retains 4-5 decision-relevant numbers",
+            "1": "Mostly retains overview numbers ($232B market size) rather than strategic ones",
+            "0": "Numbers absent or wrong"
+          }
+        },
+        {
+          "criterion": "Coherent Narrative",
+          "weight": 0.15,
+          "scoring": {
+            "5": "Tells a coherent story: AI is simultaneously driving growth (NRR boost) and threatening revenue (usage compression), creating urgency and paradox; the outlook depends on pricing model evolution",
+            "3": "Presents findings but doesn't connect them into a narrative",
+            "1": "Disconnected facts",
+            "0": "Incoherent"
+          }
+        }
+      ],
+      "expectedScoreWithout": 25,
+      "expectedScoreWith": 80
+    },
+    {
+      "id": "bench-hard-03",
+      "difficulty": "hard",
+      "description": "Summarize content with deliberate accuracy traps — similar numbers, easily confused entities, and hedged claims",
+      "input": "Summarize the following, being extremely careful with attribution and numbers:\n\n---\nGlobal AI Regulation: A Comparative Analysis of Three Frameworks\n\nThe European Union's AI Act, which entered into force on August 1, 2024, establishes a risk-based classification system with four tiers: unacceptable risk (banned), high risk (strict requirements), limited risk (transparency obligations), and minimal risk (no specific requirements). Penalties for non-compliance range from EUR 7.5 million or 1.5% of global turnover (for incorrect information) to EUR 35 million or 7% of global turnover (for banned practices). The Act requires high-risk AI systems to undergo conformity assessments, maintain technical documentation, and implement human oversight mechanisms. Implementation timelines are staggered: banned practices take effect February 2025, high-risk obligations August 2026, and general-purpose AI model requirements August 2025.\n\nChina's approach, codified through the Interim Measures for Generative AI Services (effective August 15, 2023) and the subsequent AI Safety Governance Framework (released September 2024), differs fundamentally from the EU model. Rather than risk tiers, China employs a sector-specific regulatory model where different ministries govern AI in their domains. The Cyberspace Administration of China (CAC) requires all generative AI services to undergo security assessments before public release and mandates that training data 'reflect core socialist values.' Penalties are comparatively modest: up to RMB 100,000 (approximately $14,000) per violation under the generative AI measures, though the Data Security Law enables penalties up to RMB 10 million ($1.4 million) for severe data-related violations. Unlike the EU Act, China's framework was in effect before the EU's and already governs over 180 registered generative AI services.\n\nThe United States has taken a decentralized approach. President Biden's Executive Order 14110 (October 30, 2023) established reporting requirements for foundation models trained using more than 10^26 FLOPs of compute and dual-use models with biological, cyber, or critical infrastructure capabilities. However, the EO was revoked by President Trump's Executive Order on January 20, 2025, which shifted policy toward 'removing barriers to American AI innovation.' As of early 2025, the US has no comprehensive federal AI legislation, though 45 states have introduced AI-related bills, with 17 passing into law. The most significant state action is the Colorado AI Act (effective February 2026), which mirrors some EU AI Act provisions for high-risk decision systems. The National Institute of Standards and Technology (NIST) AI Risk Management Framework (AI RMF 1.0, January 2023) serves as the primary voluntary guidance.\n\nComparative Assessment:\nEnforcement capacity varies dramatically. The EU has established a dedicated AI Office with 140 staff and a EUR 50 million annual budget. China's enforcement is distributed across existing regulators with no disclosed dedicated budget but demonstrated enforcement through the suspension of 14 AI services in 2024 for compliance failures. The US has no dedicated enforcement body for AI; oversight is fragmented across the FTC, FDA, DOT, and sector-specific regulators.\n\nInnovation impact assessments diverge as well. A Stanford HAI study estimated EU AI Act compliance costs at EUR 300,000-400,000 per high-risk system for SMEs, prompting 31% of surveyed European AI startups to consider relocating development to the US or UK. In contrast, a Tsinghua University survey found that 78% of Chinese AI companies reported that regulatory requirements 'improved their products' quality,' though critics note the survey's methodology did not account for survivorship bias (companies that exited the market due to regulation were not surveyed).\n---",
+      "rubric": [
+        {
+          "criterion": "Entity-Attribution Accuracy",
+          "weight": 0.35,
+          "scoring": {
+            "5": "Zero misattributions: EU penalties correctly stated (EUR 7.5M/1.5% to EUR 35M/7%), China penalties correctly stated (RMB 100K to RMB 10M), US EO 14110 correctly attributed to Biden and its revocation to Trump, Stanford HAI study not confused with Tsinghua study, Colorado AI Act not confused with EU AI Act",
+            "3": "1-2 minor attribution errors (e.g., wrong penalty range assigned to wrong framework)",
+            "1": "Multiple attribution errors or entities confused",
+            "0": "Systematic misattribution"
+          }
+        },
+        {
+          "criterion": "Framework Differentiation",
+          "weight": 0.25,
+          "scoring": {
+            "5": "Clearly distinguishes the three approaches: EU (risk-based tiers), China (sector-specific with values alignment), US (decentralized, currently no federal law); explains what makes each unique",
+            "3": "Distinguishes approaches but conflates some elements",
+            "1": "Treats frameworks as interchangeable or mischaracterizes their approach",
+            "0": "Fails to differentiate"
+          }
+        },
+        {
+          "criterion": "Nuance and Qualification Preservation",
+          "weight": 0.25,
+          "scoring": {
+            "5": "Preserves: staggered EU timelines (not just 'AI Act passed'), China framework existing before EU's, Biden EO revoked by Trump, Tsinghua survey's survivorship bias criticism, voluntary nature of NIST framework",
+            "3": "Captures 3-4 qualifications",
+            "1": "Presents findings without qualifications",
+            "0": "Oversimplifies to the point of inaccuracy"
+          }
+        },
+        {
+          "criterion": "Completeness at Appropriate Compression",
+          "weight": 0.15,
+          "scoring": {
+            "5": "Summary covers all three frameworks plus comparative assessment in 200-350 words; nothing critical omitted",
+            "3": "Covers all three frameworks but omits comparative section or is too long (>400 words)",
+            "1": "Omits one framework or the comparison entirely",
+            "0": "Covers only one framework"
+          }
+        }
+      ],
+      "expectedScoreWithout": 25,
+      "expectedScoreWith": 75
+    }
+  ]
+}

package/tests/smoke.json ADDED Viewed

@@ -0,0 +1,54 @@
+{
+  "version": "0.0.1",
+  "timeout": 60,
+  "tasks": [
+    {
+      "id": "smoke-01",
+      "description": "Summarize a multi-paragraph argumentative article with quantitative data and competing claims",
+      "input": "Summarize the following article:\n\n---\nThe Future of Remote Work: A Data-Driven Analysis\n\nThe shift to remote work, accelerated by the COVID-19 pandemic, has become one of the most significant labor market transformations in decades. A 2024 Stanford study led by Nicholas Bloom, tracking 30,000 employees across 500 companies over 3 years, found that hybrid workers (3 days office, 2 days remote) showed a 13% productivity increase compared to full-time office workers, while fully remote workers showed only a 4% increase.\n\nHowever, the picture is not uniformly positive. Microsoft's 2024 Work Trend Index, surveying 31,000 workers across 31 countries, revealed that 68% of employees report not having enough uninterrupted focus time, and cross-team collaboration declined by 25% in fully remote settings. The report attributed this decline to reduced serendipitous interactions and weaker social ties between teams.\n\nCompensation dynamics have also shifted. According to Glassdoor's 2024 Salary Report, fully remote positions now command a 7-12% salary discount compared to equivalent hybrid roles in major metro areas, reversing the 2021-2022 trend where remote roles carried a premium. Economist Raj Chetty argues this discount reflects employers' perception of reduced oversight value rather than actual productivity differences.\n\nThe commercial real estate impact has been substantial. CBRE's Q3 2024 report shows US office vacancy rates at 19.6%, the highest since 1991, with Class B and C properties experiencing vacancy rates above 25%. JLL estimates that $1.2 trillion in commercial real estate value is at risk of repricing by 2026.\n\nMeanwhile, smaller cities have experienced population booms. The Brookings Institution reports that cities like Boise, Idaho and Bentonville, Arkansas saw population growth rates 3-4x the national average between 2020-2024, driven primarily by remote workers seeking lower cost of living.\n\nCritics of the remote-first model, including JPMorgan CEO Jamie Dimon, argue that in-person work is essential for mentorship, culture building, and career development. A 2024 Harvard Business Review study found that remote workers were 35% less likely to be promoted than their in-office counterparts, even when performance metrics were equivalent — a finding researchers attributed to proximity bias rather than performance differences.\n\nThe emerging consensus among labor economists is that hybrid models will dominate, with 60-70% of knowledge workers expected to work in hybrid arrangements by 2026. However, the optimal balance remains contested, with evidence supporting anywhere from 1 to 3 office days per week depending on role type, team structure, and organizational culture.\n---",
+      "rubric": [
+        {
+          "criterion": "Core Argument Capture",
+          "weight": 0.3,
+          "scoring": {
+            "5": "Summary captures the central thesis (hybrid model emerging as dominant), the productivity data supporting it, the counter-arguments (collaboration decline, promotion bias), and the contested optimal balance",
+            "3": "Captures the thesis and some supporting evidence but misses counter-arguments or nuance",
+            "1": "Only captures surface-level topic (remote work trends) without the argument structure",
+            "0": "Misidentifies or fabricates the central argument"
+          }
+        },
+        {
+          "criterion": "Quantitative Detail Retention",
+          "weight": 0.3,
+          "scoring": {
+            "5": "Retains key numbers: 13% productivity increase, 25% collaboration decline, 19.6% vacancy rate, 35% promotion gap, and at least 2 other critical statistics with proper attribution",
+            "3": "Retains 3-4 key statistics but drops some important ones or loses attribution",
+            "1": "Retains 1-2 numbers or replaces specific data with vague language (e.g., 'significant increase')",
+            "0": "No quantitative data preserved; all numbers replaced with qualitative language"
+          }
+        },
+        {
+          "criterion": "Attribution Accuracy",
+          "weight": 0.2,
+          "scoring": {
+            "5": "Correctly attributes findings to their sources (Stanford/Bloom, Microsoft, CBRE, HBR) and opinions to their holders (Dimon, Chetty); no misattribution",
+            "3": "Most attributions correct but 1-2 findings attributed to wrong source or unattributed",
+            "1": "Attributions mostly missing or generic ('studies show')",
+            "0": "Misattributes claims or fabricates sources"
+          }
+        },
+        {
+          "criterion": "Summary Structure and Compression",
+          "weight": 0.2,
+          "scoring": {
+            "5": "Well-structured summary at appropriate compression (5:1 to 10:1); uses formatting (bullets or paragraphs); leads with central finding; balanced coverage across all sections of the source",
+            "3": "Reasonable structure but unbalanced coverage (e.g., over-represents early paragraphs) or slightly too long/short",
+            "1": "Unstructured wall of text or extreme compression losing essential content",
+            "0": "No meaningful compression or disorganized output"
+          }
+        }
+      ],
+      "passThreshold": 60
+    }
+  ]
+}