npm - @botlearn/keyword-extractor - Versions diffs - 0.1.0 - Mend

@botlearn/keyword-extractor 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (11) hide show

package/LICENSE +21 -0
package/README.md +35 -0
package/knowledge/anti-patterns.md +74 -0
package/knowledge/best-practices.md +114 -0
package/knowledge/domain.md +131 -0
package/manifest.json +26 -0
package/package.json +35 -0
package/skill.md +45 -0
package/strategies/main.md +113 -0
package/tests/benchmark.json +476 -0
package/tests/smoke.json +54 -0

package/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2025 BotLearn
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

package/README.md ADDED Viewed

@@ -0,0 +1,35 @@
+# @botlearn/keyword-extractor
+> Semantic-level keyword extraction, topic clustering, and domain-aware term ranking for OpenClaw Agent
+## Installation
+```bash
+# via npm
+npm install @botlearn/keyword-extractor
+# via clawhub
+clawhub install @botlearn/keyword-extractor
+```
+## Category
+content-processing
+## Dependencies
+None
+## Files
+| File | Description |
+|------|-------------|
+| `manifest.json` | Skill metadata and configuration |
+| `skill.md` | Role definition and activation rules |
+| `knowledge/` | Domain knowledge documents |
+| `strategies/` | Behavioral strategy definitions |
+| `tests/` | Smoke and benchmark tests |
+## License
+MIT

package/knowledge/anti-patterns.md ADDED Viewed

@@ -0,0 +1,74 @@
+---
+domain: keyword-extractor
+topic: anti-patterns
+priority: medium
+ttl: 30d
+---
+# Keyword Extraction — Anti-Patterns
+## Extraction Anti-Patterns
+### 1. Surface-Only Extraction
+- **Problem**: Extracting only explicitly stated words while ignoring implied concepts, synonyms, and latent topics
+- **Symptom**: A text about "convolutional neural networks" yields keywords like "convolutional", "neural", "networks" but misses "deep learning", "image recognition", "computer vision"
+- **Fix**: Apply semantic-level extraction using embeddings to discover implicit topic keywords; use topic modeling to surface latent themes; check that extracted keywords cover at least 80% of the text's identifiable topics
+### 2. Frequency-Only Ranking
+- **Problem**: Ranking keywords solely by how often they appear, causing common-but-generic terms to dominate
+- **Symptom**: Words like "system", "data", "process", "method" rank highest even though they carry minimal topic-specific information
+- **Fix**: Use TF-IDF or BM25 to discount terms that are frequent across many documents; apply the composite scoring formula from knowledge/best-practices.md that combines statistical, semantic, positional, and domain signals
+### 3. Over-Extraction (Keyword Flooding)
+- **Problem**: Extracting too many keywords, diluting the signal and overwhelming the consumer of the output
+- **Symptom**: A 300-word paragraph produces 40+ keywords; every noun and adjective is treated as a keyword
+- **Fix**: Apply the optimal keyword count guidelines from knowledge/best-practices.md; enforce primary/secondary/tertiary classification; default to outputting only primary + secondary keywords
+### 4. Stopword Leakage
+- **Problem**: Stopwords and function words slip through extraction and appear as keywords
+- **Symptom**: Keywords include "the", "however", "various", "also", "including", "such"
+- **Fix**: Apply comprehensive stopword filtering in preprocessing; extend standard stopword lists with domain-specific function words (e.g., "et al.", "respectively", "furthermore" in academic text)
+### 5. Entity Fragmentation
+- **Problem**: Multi-word named entities are split into individual tokens instead of being preserved as single keyword units
+- **Symptom**: "United Nations" becomes two keywords: "United" and "Nations"; "New York Times" becomes "New", "York", "Times"
+- **Fix**: Run NER before tokenization-based extraction; lock named entity spans as atomic units; apply multi-word expression detection using PMI thresholds
+## Ranking Anti-Patterns
+### 6. Ignoring Positional Signals
+- **Problem**: Treating all keyword occurrences equally regardless of where they appear in the text
+- **Symptom**: A term mentioned once in the title is scored lower than a term mentioned three times in footnotes
+- **Fix**: Apply positional weight multipliers from knowledge/best-practices.md; title and heading terms receive 2-3x weight; footnote terms receive 0.5x
+### 7. Flat Output (No Hierarchy)
+- **Problem**: Returning keywords as an unstructured flat list with no grouping, ranking, or topic assignment
+- **Symptom**: Output is a comma-separated list: "learning, neural, network, training, data, model, accuracy"
+- **Fix**: Always cluster keywords into topic groups; rank within each cluster by score; provide cluster labels and hierarchical structure
+### 8. Score Opacity
+- **Problem**: Presenting keywords without explaining why they were selected or how they were scored
+- **Symptom**: User receives a list of keywords with no scores, no confidence levels, and no extraction method attribution
+- **Fix**: Include normalized scores (0-100), confidence levels (high/medium/low), and extraction level (lexical/phrasal/semantic) for each keyword
+## Semantic Anti-Patterns
+### 9. Synonym Duplication
+- **Problem**: Treating synonyms and abbreviations as separate keywords, inflating keyword count without adding information
+- **Symptom**: Output includes both "artificial intelligence" and "AI", both "natural language processing" and "NLP", both "machine learning" and "ML"
+- **Fix**: Apply semantic deduplication using embedding similarity (threshold > 0.85); merge synonyms under the canonical (most frequent or most specific) form; note alternative forms as aliases
+### 10. Over-Generalization
+- **Problem**: Replacing specific, informative terms with vague parent categories during taxonomy mapping
+- **Symptom**: "TensorFlow" is generalized to "software"; "BERT" becomes "language model"; "gradient descent" becomes "optimization"
+- **Fix**: Preserve the original specific term as the keyword; use taxonomy categories only as metadata labels, not as replacements; never lose specificity in the primary keyword
+### 11. Context-Blind Extraction
+- **Problem**: Extracting keywords without considering the document's domain or the user's intent
+- **Symptom**: A legal document's keyword list looks the same as a technical blog post's; domain-specific terms are not prioritized
+- **Fix**: Detect the document domain in preprocessing; boost domain-specific terminology using DomScore; suppress cross-domain generic terms that are not informative within the detected domain
+### 12. Negation Blindness
+- **Problem**: Extracting keywords from negated or hypothetical statements as if they were affirmative
+- **Symptom**: "This approach does NOT use reinforcement learning" produces "reinforcement learning" as a positive keyword
+- **Fix**: Detect negation patterns ("not", "no", "without", "lacks", "unlike") and either exclude negated terms or flag them as "negated-context" keywords with reduced scores

package/knowledge/best-practices.md ADDED Viewed

@@ -0,0 +1,114 @@
+---
+domain: keyword-extractor
+topic: multi-level-extraction-and-scoring
+priority: high
+ttl: 30d
+---
+# Keyword Extraction — Best Practices
+## Multi-Level Extraction
+### 1. Lexical Level (Surface Terms)
+- Extract individual tokens after stopword removal and normalization
+- Apply lemmatization to group inflected forms: "running", "runs", "ran" -> "run"
+- Preserve case for proper nouns and acronyms: "API", "JavaScript", "NATO"
+- Retain domain-specific compound terms: "machine-learning", "open-source"
+### 2. Phrasal Level (Multi-Word Expressions)
+- Extract n-grams (bigrams, trigrams) that co-occur more frequently than expected by chance
+- Use Pointwise Mutual Information (PMI) to identify statistically significant phrases:
+  - PMI(x, y) = log2(P(x, y) / (P(x) * P(y)))
+  - PMI > 3.0 indicates a meaningful phrase
+- Preserve noun phrases identified by POS patterns: ADJ* NOUN+ (e.g., "deep neural network")
+- Named entities are automatically phrasal keywords — never split them
+### 3. Semantic Level (Latent Concepts)
+- Identify concepts implied but not explicitly stated
+- Example: a text discussing "gradient descent", "loss function", and "epochs" implies the concept "model training"
+- Use embedding similarity to surface these implicit topic markers
+- Assign lower confidence scores to inferred concepts vs. explicit terms
+## Contextual Scoring
+### Composite Score Formula
+```
+Score(keyword) = w1*StatScore + w2*SemScore + w3*PosScore + w4*DomScore
+```
+| Component | Weight (w) | Description |
+|-----------|-----------|-------------|
+| StatScore | 0.30 | TF-IDF or BM25 statistical significance |
+| SemScore  | 0.30 | Semantic centrality in the keyword graph |
+| PosScore  | 0.20 | Positional weight (title > first paragraph > body > footnotes) |
+| DomScore  | 0.20 | Domain relevance — how central is this term to the detected domain? |
+### Positional Weight Table
+| Position | Weight Multiplier | Rationale |
+|----------|------------------|-----------|
+| Title / Heading | 3.0x | Titles contain the most important terms |
+| First paragraph / Abstract | 2.0x | Introductions state the core topic |
+| Section headings | 2.5x | Sub-topics are signaled by headings |
+| Body text | 1.0x | Baseline weight |
+| Lists and enumerations | 1.5x | Structured content highlights key items |
+| Captions and labels | 1.2x | Annotated content carries keyword signal |
+| Footnotes and references | 0.5x | Supporting material, lower priority |
+### Score Normalization
+- Normalize final scores to 0-100 scale for readability
+- Top keyword = 100; all others scaled proportionally
+- Report scores rounded to 1 decimal place
+## Cluster Formation
+### Topic Cluster Construction
+1. Compute pairwise semantic similarity between all extracted keywords
+2. Apply agglomerative clustering with a similarity threshold of 0.65
+3. Name each cluster using the highest-scoring keyword within the cluster
+4. Limit cluster count based on text length (see knowledge/domain.md, Topic Modeling section)
+### Cluster Quality Checks
+- **Minimum size**: Clusters with only 1 keyword should be merged into the nearest cluster or flagged as standalone terms
+- **Maximum size**: Clusters with 15+ keywords should be examined for sub-topic splitting
+- **Coherence**: All keywords in a cluster should have pairwise similarity > 0.5; outliers should be reassigned
+- **Coverage**: The union of all cluster keywords should cover at least 80% of the text's core content
+### Cluster Hierarchy
+- Support 2-level hierarchy: primary clusters (broad topics) and sub-clusters (specific aspects)
+- Example:
+  - Primary: "Machine Learning"
+    - Sub-cluster: "Neural Networks" (deep learning, CNN, RNN, transformer)
+    - Sub-cluster: "Training" (gradient descent, loss function, optimizer, epoch)
+    - Sub-cluster: "Evaluation" (accuracy, F1, precision, recall, ROC)
+## Optimal Keyword Count
+| Text Length | Recommended Keywords | Max Topics |
+|------------|---------------------|------------|
+| < 200 words | 5-8 | 2-3 |
+| 200-500 words | 8-12 | 3-5 |
+| 500-1500 words | 12-20 | 4-7 |
+| 1500-5000 words | 15-30 | 5-10 |
+| 5000+ words | 20-40 | 7-12 |
+### Primary vs. Secondary Keywords
+- **Primary keywords** (top 30%): Directly represent the text's main thesis or topic — always include in output
+- **Secondary keywords** (next 40%): Supporting concepts that elaborate on primary topics — include by default
+- **Tertiary keywords** (bottom 30%): Peripheral or contextual terms — include only when comprehensive extraction is requested
+## Output Format Best Practices
+### Structured Output
+Each keyword entry should include:
+- **term**: The keyword or phrase
+- **score**: Normalized relevance score (0-100)
+- **level**: lexical / phrasal / semantic
+- **cluster**: Topic cluster assignment
+- **type**: entity type (if NER-detected) or "concept"
+- **confidence**: Extraction confidence (high / medium / low)
+### Grouping
+- Present keywords grouped by cluster, not as a flat list
+- Within each cluster, sort by score descending
+- Provide a cluster summary sentence describing the sub-topic

package/knowledge/domain.md ADDED Viewed

@@ -0,0 +1,131 @@
+---
+domain: keyword-extractor
+topic: extraction-methods-and-theory
+priority: high
+ttl: 30d
+---
+# Keyword Extraction — Methods, Techniques & Theory
+## 1. TF-IDF (Term Frequency-Inverse Document Frequency)
+### Core Formula
+- **TF(t, d)** = (Number of times term t appears in document d) / (Total number of terms in d)
+- **IDF(t, D)** = log(Total number of documents in corpus D / Number of documents containing t)
+- **TF-IDF(t, d, D)** = TF(t, d) x IDF(t, D)
+### Interpretation
+- High TF-IDF = term is frequent in the document but rare across the corpus — strong keyword signal
+- Low TF-IDF = term is either rare in the document or common across many documents — weak signal
+- Zero TF-IDF = term does not appear in the document
+### Variants
+- **Sublinear TF**: `1 + log(TF)` — dampens the effect of high-frequency terms
+- **Smoothed IDF**: `log(1 + N/df)` — prevents division by zero for new terms
+- **BM25 weighting**: Adds document length normalization — preferred for variable-length texts
+### Single-Document Application
+When no corpus is available (single-document extraction):
+- Use a general-purpose reference corpus IDF (e.g., Wikipedia, Common Crawl frequencies)
+- Alternatively, use sentence-level TF-IDF: treat each sentence as a "document" within the text
+- Compare term frequencies against expected frequencies from domain language models
+## 2. Named Entity Recognition (NER)
+### Entity Categories for Keyword Extraction
+| Entity Type | Examples | Keyword Priority |
+|------------|---------|-----------------|
+| PERSON | "Elon Musk", "Ada Lovelace" | High — often central to topic |
+| ORGANIZATION | "OpenAI", "United Nations" | High — key actors or subjects |
+| LOCATION | "Silicon Valley", "European Union" | Medium — contextual anchors |
+| TECHNOLOGY | "Kubernetes", "GPT-4", "React" | High — core in technical texts |
+| EVENT | "COP28", "Black Friday" | Medium-High — temporal anchors |
+| CONCEPT | "machine learning", "supply chain" | High — abstract topic markers |
+| PRODUCT | "iPhone 15", "Tesla Model S" | Medium — specific references |
+| DATE/TIME | "Q3 2024", "2023 fiscal year" | Low — usually metadata, not keywords |
+### NER as Keyword Signal
+- Named entities are almost always keywords — they carry high information density
+- Multi-word entities (e.g., "natural language processing") should be extracted as single keyword units, not split into individual words
+- Entity type provides automatic categorization for topic clustering
+### Entity Disambiguation
+- "Apple" (company) vs "apple" (fruit) — resolve using surrounding context
+- "Python" (language) vs "python" (snake) — use co-occurring terms as disambiguation signals
+- When ambiguous, include the entity with its most probable interpretation noted
+## 3. Semantic Similarity & Embeddings
+### Concept
+- Represent words/phrases as dense vectors in a high-dimensional space
+- Semantically similar terms have vectors that are close together (high cosine similarity)
+- Enables discovery of keywords that are conceptually important but may not appear frequently
+### Applications to Keyword Extraction
+#### Semantic Deduplication
+- "machine learning" and "ML" — cosine similarity > 0.9 → merge as single keyword
+- "artificial intelligence" and "AI" — same concept, different surface forms
+- Group synonyms and abbreviations under a canonical keyword
+#### Concept Expansion
+- A text about "neural networks" likely also relates to "deep learning", "backpropagation", "gradient descent"
+- Use embedding proximity to identify implicit keywords not explicitly stated in the text
+- Threshold: cosine similarity > 0.7 for concept expansion candidates
+#### Centrality-Based Ranking
+- Build a keyword graph where edge weights = semantic similarity between terms
+- Keywords with high graph centrality (many strong connections) are core topics
+- Peripheral keywords (few weak connections) are supporting or tangential
+### Embedding Models for Keyword Tasks
+| Model Type | Best For | Trade-off |
+|-----------|---------|-----------|
+| Word-level (Word2Vec, GloVe) | Individual term similarity | Fast, but misses phrase-level meaning |
+| Sentence-level (SBERT, E5) | Phrase and concept similarity | Better semantics, moderate speed |
+| Document-level (Doc2Vec) | Overall topic similarity | Good for clustering, less granular |
+## 4. Topic Modeling
+### Latent Topic Discovery
+- Discover hidden thematic structures that may not be obvious from individual keywords
+- Maps documents to a mixture of topics; maps topics to distributions over words
+### LDA (Latent Dirichlet Allocation)
+- Probabilistic model: each document is a mixture of K topics
+- Each topic is a distribution over vocabulary terms
+- Top terms per topic become topic-level keywords
+- Hyperparameters: K (number of topics), alpha (document-topic density), beta (topic-word density)
+### Topic-Keyword Relationship
+- Topic model outputs complement TF-IDF extraction:
+  - TF-IDF finds **statistically distinctive** terms
+  - Topic modeling finds **thematically coherent** term groups
+- Use topic assignments to cluster TF-IDF keywords into meaningful groups
+### Dynamic Topic Allocation
+- For short texts (< 500 words): 2-4 topics maximum
+- For medium texts (500-2000 words): 3-7 topics
+- For long texts (2000+ words): 5-12 topics
+- IF topics overlap significantly (shared top terms > 50%) THEN reduce K
+## 5. Domain Taxonomy Mapping
+### Purpose
+- Map extracted keywords to standardized category hierarchies
+- Provides consistent labeling across different texts in the same domain
+- Enables cross-document keyword comparison and aggregation
+### Common Taxonomies
+| Domain | Taxonomy | Example Path |
+|--------|----------|-------------|
+| Technology | ACM Computing Classification | Computing > AI > Machine Learning > Deep Learning |
+| Science | Library of Congress Subject Headings | Science > Computer Science > Algorithms |
+| Business | NAICS Industry Codes | Information > Software Publishers |
+| Academic | MESH (Medical), JEL (Economics) | Domain-specific controlled vocabularies |
+### Mapping Strategy
+1. Extract raw keywords from text
+2. For each keyword, find the closest match in the domain taxonomy (exact match > partial match > semantic match)
+3. Assign the taxonomy path as metadata to the keyword
+4. IF no taxonomy match exists THEN flag as "uncategorized" and suggest the nearest taxonomy node

package/manifest.json ADDED Viewed

@@ -0,0 +1,26 @@
+{
+  "name": "@botlearn/keyword-extractor",
+  "version": "0.1.0",
+  "description": "Semantic-level keyword extraction, topic clustering, and domain-aware term ranking for OpenClaw Agent",
+  "category": "content-processing",
+  "author": "BotLearn",
+  "benchmarkDimension": "content-understanding",
+  "expectedImprovement": 40,
+  "dependencies": {},
+  "compatibility": {
+    "openclaw": ">=0.5.0"
+  },
+  "files": {
+    "skill": "skill.md",
+    "knowledge": [
+      "knowledge/domain.md",
+      "knowledge/best-practices.md",
+      "knowledge/anti-patterns.md"
+    ],
+    "strategies": [
+      "strategies/main.md"
+    ],
+    "smokeTest": "tests/smoke.json",
+    "benchmark": "tests/benchmark.json"
+  }
+}

package/package.json ADDED Viewed

@@ -0,0 +1,35 @@
+{
+  "name": "@botlearn/keyword-extractor",
+  "version": "0.1.0",
+  "description": "Semantic-level keyword extraction, topic clustering, and domain-aware term ranking for OpenClaw Agent",
+  "type": "module",
+  "main": "manifest.json",
+  "files": [
+    "manifest.json",
+    "skill.md",
+    "knowledge/",
+    "strategies/",
+    "tests/",
+    "README.md"
+  ],
+  "keywords": [
+    "botlearn",
+    "openclaw",
+    "skill",
+    "content-processing"
+  ],
+  "author": "BotLearn",
+  "license": "MIT",
+  "repository": {
+    "type": "git",
+    "url": "https://github.com/readai-team/botlearn-awesome-skills.git",
+    "directory": "packages/skills/keyword-extractor"
+  },
+  "homepage": "https://github.com/readai-team/botlearn-awesome-skills/tree/main/packages/skills/keyword-extractor",
+  "bugs": {
+    "url": "https://github.com/readai-team/botlearn-awesome-skills/issues"
+  },
+  "publishConfig": {
+    "access": "public"
+  }
+}

package/skill.md ADDED Viewed

@@ -0,0 +1,45 @@
+---
+name: keyword-extractor
+role: Keyword Extraction Specialist
+version: 1.0.0
+triggers:
+  - "extract keywords"
+  - "key terms"
+  - "topic extraction"
+  - "keyword analysis"
+  - "find keywords"
+  - "identify topics"
+  - "extract key phrases"
+---
+# Role
+You are a Keyword Extraction Specialist. When activated, you analyze text at multiple linguistic levels to extract semantically meaningful keywords, cluster them into coherent topics, and rank them by relevance, covering both surface-level terms and deep semantic concepts to achieve 90%+ topic coverage.
+# Capabilities
+1. Extract keywords at multiple levels: lexical (individual terms), phrasal (multi-word expressions), and semantic (latent concepts inferred from context)
+2. Apply statistical methods (TF-IDF, co-occurrence analysis) and semantic similarity to identify high-signal terms beyond simple frequency counting
+3. Recognize named entities (people, organizations, locations, technologies) and classify them as domain-specific keywords
+4. Cluster related keywords into coherent topic groups using semantic proximity and co-occurrence patterns
+5. Rank keywords by composite scoring that combines statistical significance, semantic centrality, positional weight, and domain relevance
+6. Contextualize extracted keywords against domain taxonomies to map terms to standardized topic hierarchies
+# Constraints
+1. Never rely solely on term frequency to rank keywords — always incorporate semantic and positional signals
+2. Never extract stopwords, boilerplate phrases, or formatting artifacts as keywords
+3. Never return an unranked flat list — always provide scored, ordered results with topic cluster assignments
+4. Always distinguish between primary keywords (core to the text's thesis) and secondary keywords (supporting concepts)
+5. Always preserve the original semantic intent — do not generalize specific terms into vague categories without retaining the original term
+6. Never exceed the optimal keyword density for the text length — follow the guidelines in knowledge/best-practices.md
+# Activation
+WHEN the user requests keyword extraction, topic identification, or key term analysis:
+1. Preprocess the input text following strategies/main.md Step 1
+2. Apply multi-level extraction using knowledge/domain.md techniques (TF-IDF, NER, semantic similarity)
+3. Cluster and rank keywords following strategies/main.md Steps 2-4
+4. Contextualize results against domain taxonomy using knowledge/best-practices.md
+5. Verify output against knowledge/anti-patterns.md to avoid common extraction mistakes
+6. Output ranked keywords with scores, cluster assignments, and domain context

package/strategies/main.md ADDED Viewed

@@ -0,0 +1,113 @@
+---
+strategy: keyword-extractor
+version: 1.0.0
+steps: 5
+---
+# Keyword Extraction Strategy
+## Step 1: Preprocessing & Domain Detection
+- Receive the input text and determine its **length**, **structure** (plain text / structured document / code), and **language**
+- Detect the document domain by analyzing high-frequency terms and named entities against known domain vocabularies:
+  - IF technical terms dominate (API, function, deploy) THEN domain = "technology"
+  - IF financial terms dominate (revenue, equity, margin) THEN domain = "finance"
+  - IF medical terms dominate (diagnosis, treatment, symptom) THEN domain = "healthcare"
+  - IF no clear domain signal THEN domain = "general"
+- Normalize the text:
+  - Convert to consistent Unicode encoding (NFC normalization)
+  - Preserve original casing for NER but create a lowercased copy for statistical analysis
+  - Remove boilerplate elements: headers, footers, navigation text, disclaimers, copyright notices
+  - Segment into sentences for positional analysis
+- Build a stopword filter:
+  - Start with standard stopword list (language-specific)
+  - EXTEND with domain-specific function words (e.g., "et al.", "ibid." for academic; "herein", "whereas" for legal)
+  - Never include domain-specific technical terms in the stopword list
+- Identify text structure:
+  - Map title, headings, paragraphs, lists, captions, footnotes
+  - Assign positional weight multipliers from knowledge/best-practices.md
+## Step 2: Multi-Level Extraction
+- **Lexical extraction**:
+  - Tokenize the text and remove stopwords
+  - Apply lemmatization to group inflected forms
+  - Calculate TF-IDF scores for each unique term using knowledge/domain.md formulas
+  - IF no reference corpus is available THEN use sentence-level TF-IDF (each sentence = one document)
+  - Retain terms with TF-IDF score above the 60th percentile
+- **Phrasal extraction**:
+  - Extract bigrams and trigrams using a sliding window
+  - Score multi-word expressions using PMI (retain PMI > 3.0)
+  - Apply POS pattern matching for noun phrases: (ADJ)* (NOUN)+
+  - Merge overlapping n-grams into the longest meaningful phrase
+- **Named entity extraction**:
+  - Run NER to identify PERSON, ORGANIZATION, LOCATION, TECHNOLOGY, EVENT, CONCEPT, PRODUCT entities
+  - Lock entity spans as atomic keyword units — do not split
+  - Assign entity type metadata to each extracted entity keyword
+- **Semantic extraction**:
+  - Generate embeddings for all candidate keywords
+  - Identify implicit concepts by finding high-centrality nodes in the keyword similarity graph
+  - IF a concept is implied by 3+ explicit keywords (similarity > 0.7) but not stated THEN add as a semantic-level keyword with confidence = "medium"
+  - Check for negation context: IF keyword appears in a negated clause THEN flag as "negated" and reduce score by 50%
+## Step 3: Clustering & Topic Assignment
+- Build a keyword similarity matrix using embedding cosine similarity
+- Apply agglomerative clustering with similarity threshold 0.65:
+  - Each keyword starts as its own cluster
+  - Iteratively merge the two most similar clusters until no pair exceeds the threshold
+- Post-process clusters:
+  - IF cluster size = 1 AND keyword score < 50 THEN merge into the nearest cluster
+  - IF cluster size > 15 THEN attempt to split into sub-clusters at threshold 0.75
+  - IF two clusters share > 50% of their top-5 keywords THEN merge them
+- Name each cluster:
+  - Primary label = highest-scoring keyword in the cluster
+  - Secondary label = second-highest-scoring keyword (provides disambiguation)
+- Assign topic hierarchy:
+  - Map cluster labels to domain taxonomy paths from knowledge/domain.md
+  - IF taxonomy match confidence < 0.5 THEN label as "uncategorized" with nearest taxonomy suggestion
+- Validate coverage:
+  - Compute what percentage of the text's sentences contain at least one keyword from any cluster
+  - IF coverage < 80% THEN re-examine uncovered sentences for missed keywords and add them
+## Step 4: Ranking & Scoring
+- Compute the composite score for each keyword using the formula from knowledge/best-practices.md:
+  ```
+  Score(keyword) = 0.30*StatScore + 0.30*SemScore + 0.20*PosScore + 0.20*DomScore
+  ```
+  - **StatScore**: TF-IDF normalized to 0-1 range
+  - **SemScore**: Graph centrality score (eigenvector centrality) normalized to 0-1
+  - **PosScore**: Maximum positional weight multiplier for any occurrence of the keyword
+  - **DomScore**: 1.0 if the keyword matches a domain taxonomy term, 0.5 if partial match, 0.2 if no match
+- Normalize all scores to 0-100 scale (top keyword = 100)
+- Classify keywords:
+  - **Primary** (score >= 70): Core topic keywords — always included in output
+  - **Secondary** (score 40-69): Supporting concepts — included by default
+  - **Tertiary** (score < 40): Peripheral terms — included only on request
+- Apply semantic deduplication:
+  - IF two keywords have embedding similarity > 0.85 THEN merge under the more specific or more frequent form
+  - Preserve the merged term as an alias in the output
+- Re-rank within each cluster by score descending
+- Verify keyword count against guidelines from knowledge/best-practices.md:
+  - IF count exceeds the recommended maximum THEN prune lowest-scoring tertiary keywords
+  - IF count is below the recommended minimum THEN lower the extraction threshold and re-extract
+## Step 5: Domain Contextualization & Output
+- For each keyword, enrich with domain context:
+  - Map to taxonomy path (if available)
+  - Note relationships to other keywords: "broader than", "narrower than", "related to"
+  - IF the keyword is a named entity THEN include entity type
+- Assemble structured output:
+  - Group keywords by topic cluster
+  - Within each cluster, provide:
+    - **Cluster label** and taxonomy path
+    - **Cluster summary**: One sentence describing the sub-topic
+    - **Keywords**: Ordered by score, each with: term, score, level (lexical/phrasal/semantic), type, confidence
+  - After all clusters, provide:
+    - **Coverage metric**: Percentage of text content addressed by extracted keywords
+    - **Keyword count**: Total primary / secondary / tertiary breakdown
+- SELF-CHECK against knowledge/anti-patterns.md:
+  - Are there any stopwords in the output? IF yes THEN remove
+  - Are any named entities fragmented? IF yes THEN reassemble
+  - Are there synonym duplicates? IF yes THEN merge
+  - Is the output a flat list without clustering? IF yes THEN re-cluster
+  - Are scores included for all keywords? IF no THEN add scores
+  - Is coverage below 80%? IF yes THEN loop back to Step 2 with lower thresholds
+  - IF any check fails THEN fix the issue before presenting output

package/tests/benchmark.json ADDED Viewed

@@ -0,0 +1,476 @@
+{
+  "version": "0.0.1",
+  "dimension": "content-understanding",
+  "tasks": [
+    {
+      "id": "bench-easy-01",
+      "difficulty": "easy",
+      "description": "Extract keywords from a short product description",
+      "input": "Extract the key terms from this product description:\n\nThe Sony WH-1000XM5 wireless headphones feature industry-leading noise cancellation powered by two processors and eight microphones. With 30-hour battery life, multipoint Bluetooth connection, and speak-to-chat technology, these over-ear headphones deliver exceptional audio quality through 30mm carbon fiber composite drivers. Available in black, silver, and midnight blue.",
+      "rubric": [
+        {
+          "criterion": "Keyword Completeness",
+          "weight": 0.4,
+          "scoring": {
+            "5": "Extracts product name (Sony WH-1000XM5), key features (noise cancellation, multipoint Bluetooth, speak-to-chat), specs (30-hour battery, 30mm drivers, carbon fiber), and category (wireless headphones, over-ear)",
+            "3": "Extracts the product name and major features but misses technical specs or category terms",
+            "1": "Extracts only 2-3 obvious terms",
+            "0": "Fails to identify relevant keywords"
+          }
+        },
+        {
+          "criterion": "Entity Recognition",
+          "weight": 0.3,
+          "scoring": {
+            "5": "Correctly identifies Sony as ORGANIZATION, WH-1000XM5 as PRODUCT, Bluetooth as TECHNOLOGY; preserves multi-word entities intact",
+            "3": "Identifies some entities but fragments multi-word names or misclassifies types",
+            "1": "No entity recognition — treats all terms as generic keywords",
+            "0": "Fragments named entities into meaningless tokens"
+          }
+        },
+        {
+          "criterion": "Ranking Accuracy",
+          "weight": 0.3,
+          "scoring": {
+            "5": "Product name and primary differentiators (noise cancellation) ranked highest; color options and generic terms ranked lowest",
+            "3": "Reasonable ranking but some feature terms are over- or under-weighted",
+            "1": "No meaningful ranking distinction",
+            "0": "Generic terms ranked above specific product features"
+          }
+        }
+      ],
+      "expectedScoreWithout": 40,
+      "expectedScoreWith": 80
+    },
+    {
+      "id": "bench-easy-02",
+      "difficulty": "easy",
+      "description": "Extract keywords from a news headline and summary",
+      "input": "Extract keywords from this news summary:\n\nThe European Central Bank held interest rates steady at 4.5% on Thursday, signaling that rate cuts could begin in June if inflation continues to fall toward the 2% target. ECB President Christine Lagarde noted that wage growth is moderating and the eurozone economy shows signs of bottoming out after a period of stagnation.",
+      "rubric": [
+        {
+          "criterion": "Topic Identification",
+          "weight": 0.4,
+          "scoring": {
+            "5": "Identifies core topics: monetary policy, interest rates, inflation, eurozone economy; extracts key entities: ECB, Christine Lagarde, European Central Bank",
+            "3": "Identifies main topic (interest rates) but misses supporting economic concepts",
+            "1": "Extracts generic financial terms without capturing the specific topic",
+            "0": "Fails to identify the article's subject"
+          }
+        },
+        {
+          "criterion": "Specificity",
+          "weight": 0.3,
+          "scoring": {
+            "5": "Extracts specific values and terms: 4.5%, 2% target, rate cuts, wage growth, stagnation; preserves numeric context",
+            "3": "Extracts key terms but loses specific values and quantitative context",
+            "1": "Only general terms like 'economy' and 'bank'",
+            "0": "No domain-specific terms extracted"
+          }
+        },
+        {
+          "criterion": "Cluster Coherence",
+          "weight": 0.3,
+          "scoring": {
+            "5": "Groups into logical clusters: Monetary Policy (interest rates, rate cuts, ECB), Economic Indicators (inflation, wage growth, stagnation), Entities (ECB, Lagarde, eurozone)",
+            "3": "Some grouping present but clusters mix unrelated concepts",
+            "1": "No clustering — flat keyword list",
+            "0": "Incoherent grouping"
+          }
+        }
+      ],
+      "expectedScoreWithout": 40,
+      "expectedScoreWith": 80
+    },
+    {
+      "id": "bench-easy-03",
+      "difficulty": "easy",
+      "description": "Extract keywords from a recipe",
+      "input": "Extract the key terms and topics from this recipe:\n\nClassic Beef Bourguignon\nSlow-braised beef chuck in a rich Burgundy wine sauce with pearl onions, mushrooms, and bacon lardons. This traditional French stew combines aromatics — carrots, celery, garlic, and thyme — with a flour-thickened stock base. Braise at 325°F for 2.5 hours until the beef is fork-tender. Serve over egg noodles or crusty bread. Pairs well with a medium-bodied Pinot Noir.",
+      "rubric": [
+        {
+          "criterion": "Domain Keyword Coverage",
+          "weight": 0.4,
+          "scoring": {
+            "5": "Extracts dish name (Beef Bourguignon), technique (slow-braised, braise), key ingredients (beef chuck, Burgundy wine, pearl onions, mushrooms, bacon lardons), aromatics, and serving suggestions",
+            "3": "Extracts main ingredients and dish name but misses cooking techniques or serving context",
+            "1": "Only extracts 3-4 obvious food items",
+            "0": "Fails to extract culinary-relevant keywords"
+          }
+        },
+        {
+          "criterion": "Categorization",
+          "weight": 0.3,
+          "scoring": {
+            "5": "Categorizes keywords into: Dish Identity (Beef Bourguignon, French stew), Techniques (slow-braised, braise, flour-thickened), Main Ingredients, Aromatics, Serving",
+            "3": "Some categorization but mixes techniques with ingredients",
+            "1": "No categorization attempted",
+            "0": "Incorrect categorization"
+          }
+        },
+        {
+          "criterion": "Semantic Understanding",
+          "weight": 0.3,
+          "scoring": {
+            "5": "Recognizes 'Beef Bourguignon' as a single entity (not fragmented); identifies 'French cuisine' as an implied topic; understands 'fork-tender' as a doneness descriptor, not a keyword",
+            "3": "Preserves some multi-word terms but includes non-informative descriptors as keywords",
+            "1": "Fragments multi-word expressions; includes stopwords",
+            "0": "No semantic understanding demonstrated"
+          }
+        }
+      ],
+      "expectedScoreWithout": 35,
+      "expectedScoreWith": 75
+    },
+    {
+      "id": "bench-med-01",
+      "difficulty": "medium",
+      "description": "Extract keywords from a technical research abstract with domain-specific terminology",
+      "input": "Extract keywords and identify topics from this research abstract:\n\nWe present a novel approach to few-shot text classification using prompt-tuning with retrieval-augmented generation (RAG). Our method combines a frozen large language model (LLM) with a learned continuous prompt and a non-parametric retrieval module that fetches semantically similar examples from a labeled datastore at inference time. On the GLUE and SuperGLUE benchmarks, our approach achieves state-of-the-art performance with only 16 labeled examples per class, outperforming full fine-tuning by 3.2% on average. We demonstrate that the retrieval component reduces hallucination by 47% compared to prompt-only baselines, as measured by a factual consistency score. Ablation studies reveal that prompt length (20 tokens), retrieval top-k (k=5), and temperature scaling are the most impactful hyperparameters. Our code and models are available at github.com/example/rag-fewshot.",
+      "rubric": [
+        {
+          "criterion": "Technical Keyword Precision",
+          "weight": 0.3,
+          "scoring": {
+            "5": "Extracts all key technical terms: few-shot text classification, prompt-tuning, RAG, LLM, continuous prompt, non-parametric retrieval, GLUE, SuperGLUE, fine-tuning, hallucination, ablation study, hyperparameters, temperature scaling; preserves multi-word terms intact",
+            "3": "Extracts most technical terms but fragments some multi-word concepts or misses benchmark names",
+            "1": "Extracts generic ML terms but misses the specific methodology and metrics",
+            "0": "Fails to extract domain-specific terminology"
+          }
+        },
+        {
+          "criterion": "Topic Clustering Quality",
+          "weight": 0.3,
+          "scoring": {
+            "5": "Creates coherent clusters: Methodology (prompt-tuning, RAG, continuous prompt, retrieval), Evaluation (GLUE, SuperGLUE, few-shot, state-of-the-art), Model Architecture (LLM, frozen model, non-parametric), Experimental Setup (ablation, hyperparameters, top-k, temperature)",
+            "3": "Clusters are present but some terms are misassigned or clusters are too broad",
+            "1": "Minimal clustering with poor coherence",
+            "0": "No clustering"
+          }
+        },
+        {
+          "criterion": "Implicit Concept Detection",
+          "weight": 0.2,
+          "scoring": {
+            "5": "Identifies implied topics not explicitly stated: NLP, transfer learning, in-context learning, semantic search, embedding similarity; flags these as semantic-level with appropriate confidence",
+            "3": "Identifies 1-2 implied concepts but misses others",
+            "1": "Only extracts explicitly stated terms",
+            "0": "No semantic-level extraction"
+          }
+        },
+        {
+          "criterion": "Quantitative Context Preservation",
+          "weight": 0.2,
+          "scoring": {
+            "5": "Preserves key quantitative context: 16 examples per class, 3.2% improvement, 47% hallucination reduction, 20 tokens, k=5; associates metrics with the correct terms",
+            "3": "Some quantitative context preserved but not linked to correct terms",
+            "1": "Numbers extracted but without context",
+            "0": "Quantitative context lost"
+          }
+        }
+      ],
+      "expectedScoreWithout": 30,
+      "expectedScoreWith": 70
+    },
+    {
+      "id": "bench-med-02",
+      "difficulty": "medium",
+      "description": "Extract keywords from a legal contract clause with domain-specific language",
+      "input": "Extract key terms and legal concepts from this contract clause:\n\nIndemnification and Limitation of Liability. The Service Provider shall indemnify, defend, and hold harmless the Client, its officers, directors, employees, and agents from and against any and all claims, damages, losses, liabilities, costs, and expenses (including reasonable attorneys' fees) arising out of or relating to (a) any breach of this Agreement by the Service Provider, (b) any negligent or wrongful act or omission of the Service Provider, or (c) any violation of applicable law by the Service Provider. Notwithstanding the foregoing, in no event shall the Service Provider's aggregate liability under this Agreement exceed the total fees paid by the Client during the twelve (12) months preceding the claim. Neither party shall be liable for any indirect, incidental, special, consequential, or punitive damages, including but not limited to loss of profits, data, or business opportunity, regardless of whether such damages were foreseeable.",
+      "rubric": [
+        {
+          "criterion": "Legal Term Extraction",
+          "weight": 0.35,
+          "scoring": {
+            "5": "Extracts all key legal terms: indemnify, hold harmless, limitation of liability, breach, negligent act, aggregate liability, consequential damages, punitive damages, attorneys' fees, foreseeable damages; preserves legal phrases intact",
+            "3": "Extracts major legal terms but misses nuanced concepts like 'hold harmless' or 'aggregate liability'",
+            "1": "Extracts only 3-4 obvious legal terms",
+            "0": "Fails to identify legal-specific terminology"
+          }
+        },
+        {
+          "criterion": "Concept Hierarchy",
+          "weight": 0.3,
+          "scoring": {
+            "5": "Identifies the hierarchy: Indemnification (broader) containing breach, negligence, legal violation (triggers); Liability Cap (aggregate liability, twelve months, total fees); Damage Exclusions (indirect, consequential, punitive, loss of profits)",
+            "3": "Groups some related terms but misses the hierarchical relationship between indemnification triggers and the liability cap",
+            "1": "Flat list with no structural understanding",
+            "0": "No grouping"
+          }
+        },
+        {
+          "criterion": "Domain Contextualization",
+          "weight": 0.2,
+          "scoring": {
+            "5": "Maps terms to contract law domain; identifies clause type (indemnification + liability limitation); notes that this is a standard commercial contract provision",
+            "3": "Identifies the legal domain but doesn't classify the clause type",
+            "1": "No domain context provided",
+            "0": "Misclassifies the domain"
+          }
+        },
+        {
+          "criterion": "Boilerplate Filtering",
+          "weight": 0.15,
+          "scoring": {
+            "5": "Correctly filters out boilerplate words (herein, notwithstanding, foregoing, including but not limited to) and does not include them as keywords",
+            "3": "Filters some boilerplate but includes 1-2 legal filler phrases as keywords",
+            "1": "Includes multiple boilerplate terms as keywords",
+            "0": "No boilerplate filtering"
+          }
+        }
+      ],
+      "expectedScoreWithout": 25,
+      "expectedScoreWith": 70
+    },
+    {
+      "id": "bench-med-03",
+      "difficulty": "medium",
+      "description": "Extract keywords from a multilingual product review with sentiment signals",
+      "input": "Extract keywords and identify the main topics discussed in these product reviews:\n\nReview 1: 'The battery life on the Pixel 8 Pro is outstanding — easily lasts 2 days with moderate use. The Tensor G3 chip handles AI features like Magic Eraser and Best Take flawlessly. Camera quality rivals the iPhone 15 Pro, especially in low-light with Night Sight. However, the 6.7-inch display feels too large for one-handed use.'\n\nReview 2: 'Disappointed with the fingerprint sensor — it fails about 30% of the time. Software updates are frequent but sometimes introduce new bugs. The 50MP main camera produces oversaturated colors in auto mode. On the positive side, 7 years of OS updates is a huge commitment and the $999 price point is competitive for a flagship.'\n\nReview 3: 'Coming from a Samsung Galaxy S23, the Pixel's stock Android experience is refreshingly clean. Google Assistant integration is leagues ahead of Bixby. The 120Hz LTPO display is smooth, though peak brightness could be higher. Wireless charging and IP68 water resistance are expected at this price tier.'",
+      "rubric": [
+        {
+          "criterion": "Cross-Review Keyword Synthesis",
+          "weight": 0.3,
+          "scoring": {
+            "5": "Synthesizes keywords across all 3 reviews: identifies recurring themes (camera, battery, display, software), consolidates product references (Pixel 8 Pro), and captures both positive and negative aspect terms",
+            "3": "Extracts keywords from each review but doesn't synthesize or identify recurring themes",
+            "1": "Extracts keywords from only 1-2 reviews",
+            "0": "Fails to handle multi-document input"
+          }
+        },
+        {
+          "criterion": "Feature-Level Clustering",
+          "weight": 0.3,
+          "scoring": {
+            "5": "Clusters by product features: Camera (50MP, Night Sight, Magic Eraser, low-light), Display (6.7-inch, 120Hz, LTPO), Performance (Tensor G3, AI features), Battery, Software (stock Android, OS updates), Biometrics (fingerprint sensor), Build (IP68, wireless charging)",
+            "3": "Some feature-level clustering but merges distinct features or misses categories",
+            "1": "Minimal clustering; most keywords ungrouped",
+            "0": "No feature clustering"
+          }
+        },
+        {
+          "criterion": "Comparative Entity Handling",
+          "weight": 0.2,
+          "scoring": {
+            "5": "Correctly identifies compared products (Pixel 8 Pro, iPhone 15 Pro, Samsung Galaxy S23) and technologies (Tensor G3 vs Bixby vs Google Assistant) as distinct entity keywords; doesn't conflate them",
+            "3": "Identifies most product entities but misses technology comparisons",
+            "1": "Fragments product names or misses comparison context",
+            "0": "No entity recognition"
+          }
+        },
+        {
+          "criterion": "Sentiment-Aware Extraction",
+          "weight": 0.2,
+          "scoring": {
+            "5": "Notes sentiment polarity for key feature keywords: battery (positive), fingerprint sensor (negative), camera (mixed), software updates (mixed); does not extract sentiment words themselves as topic keywords",
+            "3": "Some sentiment awareness but includes sentiment adjectives (outstanding, disappointed) as topic keywords",
+            "1": "No sentiment distinction — treats positive and negative aspects identically",
+            "0": "Sentiment interferes with keyword quality"
+          }
+        }
+      ],
+      "expectedScoreWithout": 30,
+      "expectedScoreWith": 70
+    },
+    {
+      "id": "bench-med-04",
+      "difficulty": "medium",
+      "description": "Extract keywords from a policy document with nested hierarchical topics",
+      "input": "Extract key terms and topic hierarchy from this policy summary:\n\nThe EU Artificial Intelligence Act establishes a risk-based regulatory framework for AI systems across the European Union. AI systems are classified into four risk tiers: (1) Unacceptable Risk — banned outright, including social scoring by governments, real-time biometric surveillance in public spaces, and manipulative AI targeting vulnerable groups; (2) High Risk — subject to strict requirements including conformity assessments, human oversight, data governance, and transparency obligations, covering areas such as employment screening, credit scoring, law enforcement, and critical infrastructure; (3) Limited Risk — requiring transparency measures such as disclosure that content is AI-generated, applicable to chatbots and deepfake generators; (4) Minimal Risk — freely permitted with no special requirements, covering spam filters, AI-enabled video games, and recommendation systems. General-purpose AI models (GPAI) face additional obligations including technical documentation, copyright compliance, and energy consumption reporting. Penalties for non-compliance reach up to 35 million euros or 7% of global annual turnover.",
+      "rubric": [
+        {
+          "criterion": "Hierarchical Term Extraction",
+          "weight": 0.35,
+          "scoring": {
+            "5": "Extracts the full hierarchy: EU AI Act (top-level), four risk tiers with their specific examples, GPAI obligations, and penalties; preserves the nested structure of risk categories and their associated requirements",
+            "3": "Extracts the risk tiers but misses specific examples within each tier or loses the hierarchical relationship",
+            "1": "Extracts flat terms like 'AI', 'risk', 'regulation' without hierarchy",
+            "0": "Fails to identify the regulatory structure"
+          }
+        },
+        {
+          "criterion": "Domain Precision",
+          "weight": 0.3,
+          "scoring": {
+            "5": "Extracts precise regulatory terms: conformity assessment, human oversight, data governance, transparency obligations, social scoring, biometric surveillance, GPAI, global annual turnover; no over-generalization",
+            "3": "Extracts main regulatory concepts but generalizes some specific terms",
+            "1": "Only generic terms like 'regulation' and 'compliance'",
+            "0": "Misidentifies or omits domain-specific terms"
+          }
+        },
+        {
+          "criterion": "Coverage Completeness",
+          "weight": 0.2,
+          "scoring": {
+            "5": "Covers all four risk tiers, GPAI provisions, and penalty structure; no major section of the text is left without keyword representation",
+            "3": "Covers 3 of the 4 risk tiers and most provisions",
+            "1": "Covers only 1-2 aspects of the policy",
+            "0": "Major gaps in coverage"
+          }
+        },
+        {
+          "criterion": "Cluster Organization",
+          "weight": 0.15,
+          "scoring": {
+            "5": "Keywords organized into clusters matching the policy structure: Risk Classification, Requirements & Obligations, Prohibited Practices, GPAI Rules, Enforcement & Penalties",
+            "3": "Some organization but clusters don't align with the policy's natural structure",
+            "1": "Flat list despite clear hierarchical content",
+            "0": "No organization"
+          }
+        }
+      ],
+      "expectedScoreWithout": 30,
+      "expectedScoreWith": 70
+    },
+    {
+      "id": "bench-hard-01",
+      "difficulty": "hard",
+      "description": "Extract keywords from a highly technical cross-disciplinary text spanning biology and computer science",
+      "input": "Extract keywords and map topic clusters from this cross-disciplinary passage:\n\nProtein structure prediction has been revolutionized by deep learning approaches, particularly AlphaFold2's use of attention mechanisms and multiple sequence alignments (MSAs) to achieve atomic-level accuracy. The model's Evoformer module processes evolutionary relationships between amino acid residues through axial attention, while the Structure Module generates 3D coordinates via invariant point attention (IPA). Training on the Protein Data Bank (PDB) with distillation from self-predictions, AlphaFold2 achieves a median GDT-TS score of 92.4 on CASP14 free-modeling targets.\n\nRecent extensions include ESMFold, which eliminates the MSA requirement using a protein language model (pLM) pretrained on 65 million sequences, and RoseTTAFold All-Atom, which extends predictions to protein-ligand and protein-nucleic acid complexes. Diffusion-based approaches like RFdiffusion enable de novo protein design by sampling from learned structure distributions, opening applications in enzyme engineering, drug delivery scaffolds, and biosensor design.\n\nCritically, confidence metrics such as pLDDT (predicted local distance difference test) and PAE (predicted aligned error) allow researchers to assess prediction reliability per-residue, distinguishing well-folded domains from intrinsically disordered regions (IDRs).",
+      "rubric": [
+        {
+          "criterion": "Cross-Domain Keyword Precision",
+          "weight": 0.3,
+          "scoring": {
+            "5": "Extracts terms from both domains: Biology (protein structure, amino acid residues, MSA, PDB, enzyme engineering, intrinsically disordered regions) and CS/ML (deep learning, attention mechanisms, axial attention, diffusion models, language model); preserves all acronyms with expansions",
+            "3": "Extracts terms from both domains but misses specialized terms in one domain",
+            "1": "Extracts terms from only one domain",
+            "0": "Fails to identify domain-specific terminology"
+          }
+        },
+        {
+          "criterion": "Named System Recognition",
+          "weight": 0.25,
+          "scoring": {
+            "5": "Correctly identifies all named systems as atomic entities: AlphaFold2, Evoformer, Structure Module, ESMFold, RoseTTAFold All-Atom, RFdiffusion, CASP14, PDB; classifies each appropriately (model/benchmark/database)",
+            "3": "Identifies most named systems but fragments some (e.g., 'RoseTTAFold' and 'All-Atom' as separate keywords)",
+            "1": "Identifies only the most prominent names (AlphaFold2)",
+            "0": "Fragments or misidentifies named systems"
+          }
+        },
+        {
+          "criterion": "Topic Cluster Coherence",
+          "weight": 0.25,
+          "scoring": {
+            "5": "Creates coherent cross-disciplinary clusters: Structural Biology Methods (MSA, PDB, GDT-TS, residues), Deep Learning Architecture (attention, Evoformer, IPA, diffusion), Prediction Systems (AlphaFold2, ESMFold, RFdiffusion), Applications (enzyme engineering, drug delivery, biosensors), Confidence Metrics (pLDDT, PAE, IDR)",
+            "3": "Reasonable clusters but some terms misassigned across biology/CS boundaries",
+            "1": "Only broad 'biology' and 'CS' clusters without nuance",
+            "0": "No meaningful clustering"
+          }
+        },
+        {
+          "criterion": "Metric and Acronym Handling",
+          "weight": 0.2,
+          "scoring": {
+            "5": "Preserves all metrics (GDT-TS 92.4, pLDDT, PAE) and acronyms (MSA, IPA, pLM, IDR) with their expansions; associates metrics with correct models",
+            "3": "Preserves most acronyms but misses expansions or metric associations",
+            "1": "Some acronyms extracted but without context",
+            "0": "Acronyms fragmented or omitted"
+          }
+        }
+      ],
+      "expectedScoreWithout": 20,
+      "expectedScoreWith": 65
+    },
+    {
+      "id": "bench-hard-02",
+      "difficulty": "hard",
+      "description": "Extract keywords from an ambiguous text with negations, hypotheticals, and contrasting arguments",
+      "input": "Extract keywords from this analytical essay, handling negations and contrasting viewpoints correctly:\n\nThe claim that remote work universally increases productivity is not supported by the evidence. While studies from Stanford (Bloom et al., 2015) showed a 13% performance increase for call center employees, these results do not generalize to creative or collaborative roles. In fact, Microsoft's analysis of 61,000 employees found that cross-team collaboration decreased by 25% during remote work periods.\n\nIt would be misleading to conclude that office work is therefore superior. Hybrid models — combining 2-3 office days with remote flexibility — appear to optimize both individual focus time and team cohesion. However, this finding does not account for industries with strict security requirements (defense, healthcare) where remote access is fundamentally constrained.\n\nNotably, the productivity debate ignores several confounding variables: commute time savings (averaging 41 minutes per day in US metro areas), real estate cost reduction for employers ($11,000 per employee annually), and the environmental impact of reduced commuting (estimated 54 million tons of CO2 equivalent annually). These factors may outweigh marginal productivity differences in either direction.\n\nCritics argue that long-term innovation suffers without serendipitous in-person interactions, but no longitudinal study has yet confirmed this hypothesis with statistical significance.",
+      "rubric": [
+        {
+          "criterion": "Negation-Aware Extraction",
+          "weight": 0.3,
+          "scoring": {
+            "5": "Correctly handles negations: 'remote work increases productivity' is identified as a CONTESTED claim, not an affirmed keyword; 'does not generalize' is not used to assert generalizability; 'no longitudinal study has confirmed' flags innovation impact as UNPROVEN; negated terms are either excluded or clearly marked",
+            "3": "Handles some negations but treats 1-2 negated claims as affirmed keywords",
+            "1": "Ignores negation context entirely — extracts 'productivity increase' as a positive keyword",
+            "0": "Negation blindness — all terms extracted as if affirmed"
+          }
+        },
+        {
+          "criterion": "Argumentative Structure Keywords",
+          "weight": 0.3,
+          "scoring": {
+            "5": "Extracts keywords that reflect the debate structure: remote work productivity (contested), hybrid models (proposed solution), cross-team collaboration (declined), confounding variables (reframing), serendipitous interactions (hypothesis); captures the nuance of each position",
+            "3": "Extracts topic terms but loses the argumentative nuance — doesn't distinguish claims from counter-claims",
+            "1": "Extracts only surface-level terms without argumentative context",
+            "0": "Misrepresents the text's argumentative positions"
+          }
+        },
+        {
+          "criterion": "Quantitative Evidence Linking",
+          "weight": 0.2,
+          "scoring": {
+            "5": "Links statistics to correct claims: 13% increase (Stanford/call center), 25% decrease (Microsoft/collaboration), 41 min commute savings, $11K cost reduction, 54M tons CO2; associates each with its source and scope",
+            "3": "Extracts most statistics but doesn't link all to their correct claims or sources",
+            "1": "Extracts some numbers without context",
+            "0": "Quantitative data lost or misattributed"
+          }
+        },
+        {
+          "criterion": "Topic Coverage Despite Complexity",
+          "weight": 0.2,
+          "scoring": {
+            "5": "Covers all sub-topics: productivity evidence, collaboration impact, hybrid models, industry constraints, economic factors, environmental impact, innovation concerns; organizes into coherent clusters",
+            "3": "Covers 4-5 sub-topics but misses 1-2 or clusters poorly",
+            "1": "Covers only the main topic (remote work) without sub-topic differentiation",
+            "0": "Major gaps in topic coverage"
+          }
+        }
+      ],
+      "expectedScoreWithout": 20,
+      "expectedScoreWith": 60
+    },
+    {
+      "id": "bench-hard-03",
+      "difficulty": "hard",
+      "description": "Extract keywords from a dense technical specification with overlapping terminology across sections",
+      "input": "Extract and deduplicate keywords from this API specification:\n\nEndpoint: POST /v2/models/{model_id}/predictions\nAuthentication: Bearer token (JWT, RS256 signed, 1-hour expiry)\nRate Limit: 100 requests/minute per API key, burst allowance of 20 requests/second\n\nRequest Body:\n- model_id (string, required): The unique identifier of the deployed model. Supports versioned IDs (e.g., 'gpt-4-turbo-2024-04-09').\n- messages (array, required): Conversation history. Each message object contains 'role' (system|user|assistant) and 'content' (string, max 128K tokens).\n- temperature (float, optional, default 1.0): Sampling temperature. Range [0.0, 2.0]. Lower values produce more deterministic outputs.\n- top_p (float, optional, default 1.0): Nucleus sampling parameter. Mutually exclusive with temperature when both differ from defaults.\n- max_tokens (integer, optional): Maximum tokens in the response. Model-dependent upper bound.\n- stream (boolean, optional, default false): Enable server-sent events (SSE) for streaming responses.\n- tools (array, optional): Function calling definitions. Each tool specifies 'name', 'description', and 'parameters' (JSON Schema).\n\nResponse:\n- id (string): Unique prediction identifier (UUID v4)\n- choices (array): Model outputs. Each choice contains 'message' (role + content), 'finish_reason' (stop|length|tool_calls|content_filter), and 'index'.\n- usage (object): Token consumption — 'prompt_tokens', 'completion_tokens', 'total_tokens'.\n- model (string): Actual model used (may differ from requested if aliased).\n\nError Codes: 400 (invalid request), 401 (authentication failed), 429 (rate limited), 500 (internal error), 503 (model unavailable).",
+      "rubric": [
+        {
+          "criterion": "API Concept Extraction",
+          "weight": 0.3,
+          "scoring": {
+            "5": "Extracts all key API concepts: endpoint, authentication (JWT, RS256, Bearer token), rate limiting, request/response schema, streaming (SSE), function calling, token counting; preserves technical precision",
+            "3": "Extracts most API concepts but misses technical details like RS256 or SSE",
+            "1": "Extracts only broad concepts like 'API' and 'authentication'",
+            "0": "Fails to identify API-specific terminology"
+          }
+        },
+        {
+          "criterion": "Parameter Keyword Organization",
+          "weight": 0.25,
+          "scoring": {
+            "5": "Organizes parameters into logical clusters: Sampling Control (temperature, top_p, max_tokens), Input Structure (messages, model_id, roles), Output Format (choices, finish_reason, usage), Configuration (stream, tools); distinguishes request from response terms",
+            "3": "Groups some parameters but mixes request and response terms or misclassifies",
+            "1": "Flat list of parameter names without grouping",
+            "0": "Parameters not extracted as keywords"
+          }
+        },
+        {
+          "criterion": "Deduplication Across Sections",
+          "weight": 0.25,
+          "scoring": {
+            "5": "Correctly deduplicates: 'model_id' appears in endpoint path and request body — extracted once; 'tokens' appears in multiple contexts (max_tokens, prompt_tokens, 128K tokens) — unified under a token-related cluster; 'role' in messages and choices merged",
+            "3": "Some deduplication but 1-2 terms appear redundantly",
+            "1": "No deduplication — same terms appear multiple times from different sections",
+            "0": "Excessive duplication"
+          }
+        },
+        {
+          "criterion": "Technical Precision",
+          "weight": 0.2,
+          "scoring": {
+            "5": "Preserves precise technical terms: UUID v4, JSON Schema, SSE, RS256, JWT, nucleus sampling, server-sent events; includes constraint values where relevant (128K tokens, [0.0, 2.0] range)",
+            "3": "Preserves most technical terms but loses some precision or constraint context",
+            "1": "Generalizes technical terms (e.g., 'identifier' instead of 'UUID v4')",
+            "0": "Technical precision lost"
+          }
+        }
+      ],
+      "expectedScoreWithout": 20,
+      "expectedScoreWith": 60
+    }
+  ]
+}

package/tests/smoke.json ADDED Viewed

@@ -0,0 +1,54 @@
+{
+  "version": "0.0.1",
+  "timeout": 60,
+  "tasks": [
+    {
+      "id": "smoke-01",
+      "description": "Extract keywords from a technical article about cloud-native architecture with topic clustering and domain contextualization",
+      "input": "Extract the key terms and topics from the following article:\n\nCloud-native architecture has transformed how organizations build and deploy software at scale. By leveraging containerization through Docker and orchestration via Kubernetes, teams can achieve unprecedented levels of scalability and resilience. Microservices decompose monolithic applications into independently deployable units, each with its own database and API boundary.\n\nService meshes like Istio and Linkerd provide observability, traffic management, and mutual TLS authentication between services. Combined with CI/CD pipelines using tools like GitHub Actions, ArgoCD, and Tekton, organizations can ship code multiple times per day with confidence.\n\nHowever, cloud-native adoption introduces complexity in distributed tracing, eventual consistency, and network latency. Site Reliability Engineering (SRE) practices — including error budgets, service level objectives (SLOs), and chaos engineering — help teams manage this operational complexity while maintaining high availability.",
+      "rubric": [
+        {
+          "criterion": "Keyword Coverage",
+          "weight": 0.3,
+          "scoring": {
+            "5": "Extracts all major terms: cloud-native, containerization, Docker, Kubernetes, microservices, service mesh, Istio, CI/CD, SRE, SLOs, chaos engineering, distributed tracing; covers both explicit terms and implied concepts like DevOps and infrastructure automation",
+            "3": "Extracts most explicit technical terms but misses some named technologies or implied concepts",
+            "1": "Extracts only a few obvious terms; misses most domain-specific terminology",
+            "0": "Fails to extract meaningful keywords or returns generic words"
+          }
+        },
+        {
+          "criterion": "Topic Clustering",
+          "weight": 0.25,
+          "scoring": {
+            "5": "Groups keywords into coherent clusters (e.g., Container Orchestration, Service Communication, CI/CD Pipeline, Reliability Engineering) with clear labels and intra-cluster coherence",
+            "3": "Some clustering present but clusters are too broad or miss obvious groupings",
+            "1": "Flat list with no meaningful grouping",
+            "0": "No clustering attempted"
+          }
+        },
+        {
+          "criterion": "Ranking Quality",
+          "weight": 0.25,
+          "scoring": {
+            "5": "Keywords are scored and ranked with core architectural concepts (cloud-native, microservices, Kubernetes) ranked highest; supporting tools ranked lower; scores reflect actual importance in the text",
+            "3": "Rankings are present but some important terms are underweighted or generic terms are overweighted",
+            "1": "Keywords listed but not meaningfully ranked",
+            "0": "No ranking or scoring"
+          }
+        },
+        {
+          "criterion": "Output Structure",
+          "weight": 0.2,
+          "scoring": {
+            "5": "Each keyword includes: term, score, extraction level (lexical/phrasal/semantic), cluster assignment, and entity type where applicable; output is grouped by cluster with summaries",
+            "3": "Keywords have scores and some metadata but missing cluster summaries or extraction level",
+            "1": "Basic list of keywords with minimal metadata",
+            "0": "Unstructured raw output"
+          }
+        }
+      ],
+      "passThreshold": 60
+    }
+  ]
+}