npm - @wentorai/research-plugins - Versions diffs - 1.0.0 → 1.2.0 - Mend

@wentorai/research-plugins 1.0.0 → 1.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (415) hide show

package/skills/domains/ai-ml/ai-agent-papers-guide/SKILL.md ADDED Viewed

@@ -0,0 +1,146 @@
+---
+name: ai-agent-papers-guide
+description: "Curated 2024-2026 AI agent research papers collection"
+metadata:
+  openclaw:
+    emoji: "📑"
+    category: "domains"
+    subcategory: "ai-ml"
+    keywords: ["AI agents", "agent papers", "2024 research", "LLM agents", "agent frameworks", "survey"]
+    source: "https://github.com/VoltAgent/awesome-ai-agent-papers"
+---
+# AI Agent Papers Guide (2024-2026)
+## Overview
+A focused collection of AI agent research papers from 2024-2026, tracking the latest developments in LLM-based agent systems. Unlike broader collections, this focuses on recent breakthroughs — new architectures, benchmarks, multi-agent coordination, and real-world applications. Updated frequently as the field evolves rapidly.
+## Paper Categories
+```
+Recent AI Agent Research
+├── Agent Architectures
+│   ├── Planning (o1-style reasoning, search-augmented)
+│   ├── Memory (long-term, episodic, working)
+│   └── Tool use (function calling, code execution)
+├── Multi-Agent Systems
+│   ├── Collaboration (task decomposition, debate)
+│   ├── Competition (red team, adversarial)
+│   └── Emergence (self-organization, culture)
+├── Evaluation
+│   ├── Benchmarks (SWE-bench, WebArena, GAIA)
+│   ├── Safety (jailbreak, misuse, alignment)
+│   └── Reliability (error recovery, hallucination)
+├── Applications
+│   ├── Software engineering (coding agents)
+│   ├── Scientific research (lab automation)
+│   ├── Web automation (browsing, form-filling)
+│   └── Enterprise (workflow, data analysis)
+└── Infrastructure
+    ├── Frameworks (LangGraph, CrewAI, AutoGen)
+    ├── Protocols (MCP, A2A, tool standards)
+    └── Deployment (scaling, monitoring, cost)
+```
+## Highlighted Papers (2024-2025)
+| Paper | Venue | Key Contribution |
+|-------|-------|-----------------|
+| SWE-agent | ICLR 2025 | Agent interface design for SE |
+| OpenHands | 2024 | Open platform for coding agents |
+| AgentBench | ICLR 2024 | Multi-environment agent benchmark |
+| GAIA | ICLR 2024 | General AI assistant benchmark |
+| Voyager | NeurIPS 2024 | Lifelong learning in Minecraft |
+| OS-Copilot | 2024 | Self-improving computer agent |
+| AutoGen | 2024 | Multi-agent conversation framework |
+| Agent-FLAN | ACL 2024 | Agent fine-tuning methodology |
+## Tracking New Papers
+```python
+import arxiv
+from datetime import datetime, timedelta
+def find_recent_agent_papers(days=14):
+    """Find cutting-edge agent papers."""
+    queries = [
+        "ti:agent AND (ti:LLM OR ti:language model)",
+        "abs:autonomous agent AND abs:tool use AND abs:2024",
+        "ti:multi-agent AND abs:large language",
+        "abs:coding agent OR abs:software agent",
+    ]
+    seen = set()
+    papers = []
+    for q in queries:
+        search = arxiv.Search(
+            query=q, max_results=15,
+            sort_by=arxiv.SortCriterion.SubmittedDate,
+        )
+        for r in search.results():
+            if r.entry_id not in seen:
+                seen.add(r.entry_id)
+                papers.append({
+                    "title": r.title,
+                    "date": r.published.strftime("%Y-%m-%d"),
+                    "url": r.entry_id,
+                })
+    papers.sort(key=lambda x: x["date"], reverse=True)
+    for p in papers[:20]:
+        print(f"[{p['date']}] {p['title']}")
+        print(f"  {p['url']}")
+find_recent_agent_papers()
+```
+## Framework Comparison
+```python
+frameworks = {
+    "LangGraph": {
+        "paradigm": "Graph-based workflows",
+        "persistence": "Built-in checkpointing",
+        "multi_agent": "Yes",
+        "language": "Python/JS",
+    },
+    "CrewAI": {
+        "paradigm": "Role-based agents",
+        "persistence": "Memory module",
+        "multi_agent": "Yes (crew)",
+        "language": "Python",
+    },
+    "AutoGen": {
+        "paradigm": "Conversational agents",
+        "persistence": "Chat history",
+        "multi_agent": "Yes (group chat)",
+        "language": "Python/.NET",
+    },
+    "OpenHands": {
+        "paradigm": "Computer use agent",
+        "persistence": "Workspace state",
+        "multi_agent": "No",
+        "language": "Python",
+    },
+}
+for name, info in frameworks.items():
+    print(f"\n{name}:")
+    for k, v in info.items():
+        print(f"  {k}: {v}")
+```
+## Use Cases
+1. **Literature tracking**: Stay current on agent research
+2. **Framework selection**: Compare agent development tools
+3. **Research planning**: Identify open problems and trends
+4. **Course material**: Teach cutting-edge agent systems
+5. **Benchmark tracking**: Compare agent capabilities
+## References
+- [awesome-ai-agent-papers](https://github.com/VoltAgent/awesome-ai-agent-papers)
+- [VoltAgent Framework](https://github.com/VoltAgent/voltagent)

package/skills/domains/ai-ml/ai-model-benchmarking/SKILL.md ADDED Viewed

@@ -0,0 +1,209 @@
+---
+name: ai-model-benchmarking
+description: "Benchmark AI models across 60+ academic evaluation suites and metrics"
+metadata:
+  openclaw:
+    emoji: "📊"
+    category: "domains"
+    subcategory: "ai-ml"
+    keywords: ["benchmarking", "evaluation", "LLM", "MMLU", "leaderboard", "metrics", "lm-eval"]
+    source: "https://github.com/EleutherAI/lm-evaluation-harness"
+---
+# AI Model Benchmarking Guide
+## Overview
+Rigorous evaluation is the backbone of machine learning research. A model is only as credible as its evaluation protocol: which benchmarks were used, how metrics were computed, whether results are reproducible, and how they compare to baselines. The proliferation of LLMs has made this both more important and more complex, with over 60 established benchmarks and a rapidly evolving landscape.
+This guide covers the practical side of model benchmarking: how to use the EleutherAI Language Model Evaluation Harness (lm-evaluation-harness), how to select benchmarks for different research claims, how to avoid common evaluation pitfalls, and how to present results for publication. The focus is on academic rigor rather than leaderboard chasing.
+Whether you are evaluating a fine-tuned model for a paper, comparing architectures for an ablation study, or reviewing a submitted manuscript's evaluation section, these patterns will help ensure the evaluation is sound.
+## The lm-evaluation-harness
+The EleutherAI lm-evaluation-harness is the de facto standard for LLM evaluation in academic research, supporting 60+ tasks and used by most major LLM papers.
+### Installation and Basic Usage
+```bash
+# Install
+pip install lm-eval
+# Run a single benchmark
+lm_eval --model hf \
+    --model_args pretrained=meta-llama/Llama-2-7b-hf \
+    --tasks mmlu \
+    --batch_size auto \
+    --output_path results/llama2-7b/
+# Run multiple benchmarks
+lm_eval --model hf \
+    --model_args pretrained=meta-llama/Llama-2-7b-hf \
+    --tasks mmlu,hellaswag,arc_challenge,winogrande,truthfulqa_mc2 \
+    --batch_size auto \
+    --num_fewshot 5 \
+    --output_path results/llama2-7b/
+```
+### Programmatic API
+```python
+import lm_eval
+results = lm_eval.simple_evaluate(
+    model="hf",
+    model_args="pretrained=meta-llama/Llama-2-7b-hf",
+    tasks=["mmlu", "hellaswag", "arc_challenge"],
+    num_fewshot=5,
+    batch_size="auto",
+    device="cuda",
+)
+# Access results
+for task, metrics in results["results"].items():
+    print(f"{task}: {metrics}")
+```
+## Benchmark Selection by Research Claim
+| Research Claim | Required Benchmarks | Why |
+|---------------|--------------------|----|
+| General knowledge | MMLU, ARC, TriviaQA | Broad factual coverage |
+| Reasoning | GSM8K, BBH, ARC-Challenge | Multi-step logical reasoning |
+| Coding | HumanEval, MBPP, DS-1000 | Code generation and understanding |
+| Instruction following | MT-Bench, AlpacaEval, IFEval | Open-ended instruction quality |
+| Safety | TruthfulQA, ToxiGen, BBQ | Truthfulness, toxicity, bias |
+| Multilingual | MGSM, XWinograd, FLORES | Cross-lingual transfer |
+| Long context | SCROLLS, LongBench, RULER | Long document understanding |
+| Domain-specific | MedQA, LegalBench, SciQ | Professional domain knowledge |
+## Core Benchmarks Deep Dive
+### MMLU (Massive Multitask Language Understanding)
+```
+- 57 subjects: STEM, humanities, social sciences, professional
+- 14,042 questions, multiple choice (4 options)
+- Standard: 5-shot evaluation
+- Metric: Accuracy (macro-averaged across subjects)
+- Citation: Hendrycks et al., 2021
+Score interpretation:
+  < 30%: Below random (model is miscalibrated)
+  30-40%: Near random (4 choices = 25% baseline)
+  40-60%: Basic knowledge
+  60-70%: Strong general knowledge
+  70-80%: Expert-level for most subjects
+  > 80%: State-of-the-art (as of 2024)
+```
+### GSM8K (Grade School Math)
+```
+- 8,792 grade school math word problems
+- Requires multi-step arithmetic reasoning
+- Standard: 8-shot chain-of-thought
+- Metric: Exact match on final numerical answer
+- Citation: Cobbe et al., 2021
+Common pitfalls:
+  - Regex matching for final answer extraction
+  - Calculator use vs. pure model computation
+  - Reporting with vs. without chain-of-thought
+```
+### HumanEval (Code Generation)
+```
+- 164 Python programming problems
+- Function signature + docstring -> implementation
+- Metric: pass@k (k=1 standard, k=10 and k=100 also reported)
+- Citation: Chen et al., 2021
+pass@k computation (unbiased estimator):
+  pass@k = 1 - C(n-c, k) / C(n, k)
+  where n = total samples, c = correct samples
+```
+## Evaluation Pitfalls
+| Pitfall | Problem | Solution |
+|---------|---------|----------|
+| Data contamination | Benchmark data in training set | Use canary strings, report contamination analysis |
+| Prompt sensitivity | Results vary with prompt format | Report results across 3+ prompt variants |
+| Few-shot selection | Cherry-picked examples boost scores | Use fixed random seed for example selection |
+| Metric gaming | Optimizing for specific metrics | Report multiple metrics, include calibration |
+| Incomplete reporting | Only showing best results | Report mean and std across seeds |
+| Version mismatch | Different benchmark versions | Pin exact dataset version and commit hash |
+### Contamination Detection
+```python
+def check_contamination(training_data: list, benchmark_data: list, n: int = 13) -> dict:
+    """
+    Check for n-gram overlap between training data and benchmark.
+    13-gram overlap is the standard threshold (GPT-4 technical report).
+    """
+    from collections import defaultdict
+    def extract_ngrams(text, n):
+        words = text.lower().split()
+        return set(tuple(words[i:i+n]) for i in range(len(words) - n + 1))
+    # Build training n-gram index
+    train_ngrams = set()
+    for text in training_data:
+        train_ngrams.update(extract_ngrams(text, n))
+    # Check benchmark items
+    contaminated = []
+    for i, item in enumerate(benchmark_data):
+        item_ngrams = extract_ngrams(item, n)
+        overlap = item_ngrams & train_ngrams
+        if overlap:
+            contaminated.append({
+                "index": i,
+                "overlap_count": len(overlap),
+                "overlap_ratio": len(overlap) / max(len(item_ngrams), 1),
+            })
+    return {
+        "total_items": len(benchmark_data),
+        "contaminated_items": len(contaminated),
+        "contamination_rate": len(contaminated) / len(benchmark_data),
+        "details": contaminated,
+    }
+```
+## Reporting Results for Publication
+### Standard Results Table Format
+```markdown
+| Model | Params | MMLU | GSM8K | HumanEval | ARC-C | HellaSwag | Avg |
+|-------|--------|------|-------|-----------|-------|-----------|-----|
+| Baseline | 7B | 45.2 | 12.3 | 15.8 | 42.1 | 72.3 | 37.5 |
+| Ours | 7B | 52.1 (+6.9) | 28.7 (+16.4) | 22.0 (+6.2) | 48.9 (+6.8) | 76.1 (+3.8) | 45.6 |
+| Ours (ablation A) | 7B | 49.8 | 24.1 | 19.5 | 46.2 | 74.8 | 42.9 |
+All results: 5-shot for MMLU, 8-shot CoT for GSM8K, 0-shot for HumanEval,
+25-shot for ARC-C, 10-shot for HellaSwag. Mean of 3 seeds reported.
+```
+## Best Practices
+- **Always report the exact evaluation framework version** (e.g., `lm-eval v0.4.2`).
+- **Use the same number of few-shot examples** as the original benchmark paper.
+- **Report standard deviations** across at least 3 random seeds.
+- **Include a contamination analysis** for any new model trained on web data.
+- **Compare against published numbers using the same evaluation code** -- do not mix results from different frameworks.
+- **Report inference details:** precision (fp16/bf16/int8), context length, decoding strategy.
+## References
+- [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) -- EleutherAI's evaluation framework (12,000+ stars)
+- [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) -- Hugging Face community leaderboard
+- [MMLU paper](https://arxiv.org/abs/2009.03300) -- Hendrycks et al., 2021
+- [Holistic Evaluation of Language Models (HELM)](https://crfm.stanford.edu/helm/) -- Stanford CRFM
+- [Chatbot Arena](https://chat.lmsys.org/) -- Human preference-based evaluation (LMSYS)

package/skills/domains/ai-ml/annotated-dl-papers-guide/SKILL.md ADDED Viewed

@@ -0,0 +1,159 @@
+---
+name: annotated-dl-papers-guide
+description: "Annotated deep learning paper implementations with side-by-side notes"
+metadata:
+  openclaw:
+    emoji: "📝"
+    category: "domains"
+    subcategory: "ai-ml"
+    keywords: ["deep-learning", "paper-implementation", "annotations", "transformer", "gan", "diffusion"]
+    source: "https://github.com/labmlai/annotated_deep_learning_paper_implementations"
+---
+# Annotated Deep Learning Papers Guide
+## Overview
+The annotated_deep_learning_paper_implementations project, maintained by labml.ai with over 66,000 GitHub stars, provides 60+ implementations of influential deep learning papers with detailed, line-by-line annotations. Each implementation is presented as a literate programming document where the code and explanations are interwoven, making it possible to read the paper and understand the implementation simultaneously.
+This project bridges the gap between reading a research paper and understanding its practical implementation. For academic researchers, this is an essential resource because many breakthrough papers omit crucial implementation details, and reproducing results from a paper description alone can take weeks. The annotated implementations cover transformers, GANs, diffusion models, reinforcement learning algorithms, optimizers, and many other core deep learning building blocks.
+All implementations are written in PyTorch and are designed to be self-contained, readable, and runnable. The project also provides a web interface at papers.labml.ai where you can browse implementations with syntax-highlighted code alongside formatted annotations.
+## Installation and Setup
+Install the labml packages to use the implementations and experiment tracking:
+```bash
+# Install the core library
+pip install labml labml-nn
+# Clone for direct access to all implementations
+git clone https://github.com/labmlai/annotated_deep_learning_paper_implementations.git
+cd annotated_deep_learning_paper_implementations
+# Install in development mode
+pip install -e .
+```
+Requirements:
+- Python 3.8+
+- PyTorch >= 1.9
+- labml >= 0.5 (experiment tracking and configuration)
+- numpy, einops for tensor operations
+The `labml` library provides experiment tracking, configuration management, and training loop utilities that are used throughout the implementations.
+## Core Paper Categories
+### Transformers and Attention
+The project includes comprehensive implementations of the transformer family:
+- **Original Transformer** (Vaswani et al., 2017): Multi-head attention, positional encoding, encoder-decoder architecture
+- **GPT and GPT-2**: Autoregressive language modeling with causal attention
+- **BERT**: Masked language modeling and next sentence prediction
+- **Vision Transformer (ViT)**: Applying transformers to image classification
+- **Flash Attention**: Memory-efficient attention computation
+- **Rotary Position Embeddings (RoPE)**: Position encoding used in modern LLMs
+- **Mixture of Experts (MoE)**: Sparse expert routing for scaling models
+```python
+# Example: Multi-head attention from the transformer implementation
+from labml_nn.transformers.mha import MultiHeadAttention
+# The implementation includes detailed annotations explaining
+# each step of the attention computation
+mha = MultiHeadAttention(
+    heads=8,
+    d_model=512,
+    dropout_prob=0.1
+)
+```
+### Generative Models
+- **GAN** (Goodfellow et al., 2014): Original generative adversarial network
+- **DCGAN**: Deep convolutional GAN with architectural guidelines
+- **StyleGAN**: Style-based generator architecture
+- **Diffusion Models (DDPM)**: Denoising diffusion probabilistic models
+- **Stable Diffusion**: Latent diffusion with CLIP conditioning
+- **VAE**: Variational autoencoders with KL divergence
+### Optimization and Training
+- **Adam, AdamW**: Adaptive learning rate optimizers
+- **LAMB**: Large batch optimization for distributed training
+- **Noam learning rate schedule**: Warmup + inverse square root decay
+- **Gradient clipping**: Norm-based and value-based clipping
+- **Mixed precision training**: FP16/BF16 training techniques
+### Normalization and Regularization
+- **Batch Normalization**: Per-batch statistics normalization
+- **Layer Normalization**: Per-sample normalization for transformers
+- **RMSNorm**: Simplified normalization used in LLaMA
+- **Dropout and DropPath**: Stochastic regularization methods
+## Using Implementations in Research
+Each implementation can be used as a building block in your own research projects. The modular design allows you to swap components easily:
+```python
+from labml_nn.transformers import TransformerLayer
+from labml_nn.transformers.mha import MultiHeadAttention
+from labml_nn.normalization.rmsnorm import RMSNorm
+# Build a custom transformer block with RMSNorm instead of LayerNorm
+class CustomTransformerBlock(nn.Module):
+    def __init__(self, d_model, heads, d_ff):
+        super().__init__()
+        self.attention = MultiHeadAttention(heads, d_model)
+        self.norm1 = RMSNorm(d_model)
+        self.norm2 = RMSNorm(d_model)
+        self.feed_forward = nn.Sequential(
+            nn.Linear(d_model, d_ff),
+            nn.GELU(),
+            nn.Linear(d_ff, d_model)
+        )
+    def forward(self, x):
+        x = x + self.attention(self.norm1(x), self.norm1(x), self.norm1(x), None)
+        x = x + self.feed_forward(self.norm2(x))
+        return x
+```
+The experiment tracking integration with labml makes it straightforward to log metrics, hyperparameters, and model checkpoints:
+```python
+from labml import experiment, tracker
+# Create an experiment
+experiment.create(name="custom_transformer_ablation")
+# Track metrics during training
+for epoch in range(num_epochs):
+    for batch in dataloader:
+        loss = train_step(batch)
+        tracker.save({"loss": loss, "epoch": epoch})
+```
+## Research Workflow Integration
+This project fits naturally into an academic deep learning research workflow:
+1. **Literature review**: Read the annotated implementation alongside the original paper to build deep understanding
+2. **Baseline reproduction**: Use the provided implementation as a verified baseline for comparison experiments
+3. **Architecture modification**: Fork a specific implementation and modify components for your research hypothesis
+4. **Ablation studies**: Systematically disable or replace components to measure their contribution
+5. **Paper writing**: Reference the annotated implementation for accurate method descriptions
+The web interface at papers.labml.ai provides a searchable index of all implementations, organized by topic. Each page shows the paper citation, a brief summary, and the annotated code with toggleable explanations.
+## References
+- Repository: https://github.com/labmlai/annotated_deep_learning_paper_implementations
+- Web interface: https://nn.labml.ai/
+- labml experiment tracking: https://github.com/labmlai/labml
+- PyTorch documentation: https://pytorch.org/docs/stable/

package/skills/domains/ai-ml/anomaly-detection-papers-guide/SKILL.md ADDED Viewed

@@ -0,0 +1,167 @@
+---
+name: anomaly-detection-papers-guide
+description: "Industrial anomaly detection methods and benchmark papers"
+metadata:
+  openclaw:
+    emoji: "🔍"
+    category: "domains"
+    subcategory: "ai-ml"
+    keywords: ["anomaly detection", "industrial inspection", "defect detection", "MVTec", "unsupervised AD", "visual inspection"]
+    source: "https://github.com/M-3LAB/awesome-industrial-anomaly-detection"
+---
+# Industrial Anomaly Detection Papers Guide
+## Overview
+Industrial anomaly detection uses machine learning to identify defects, faults, and anomalies in manufacturing and quality inspection. This curated collection covers methods from reconstruction-based (autoencoders) to memory-bank approaches (PatchCore), normalizing flows, knowledge distillation, and foundation model-based detectors. Includes benchmark datasets, evaluation metrics, and real-world deployment considerations.
+## Method Taxonomy
+```
+Anomaly Detection Methods
+├── Reconstruction-based
+│   ├── Autoencoder (AE, VAE)
+│   ├── GAN-based (AnoGAN, GANomaly)
+│   └── Diffusion-based (AnoDDPM)
+├── Embedding-based
+│   ├── Memory bank (PatchCore, PaDiM)
+│   ├── Knowledge distillation (STPM, RD4AD)
+│   └── Self-supervised (CutPaste, DRAEM)
+├── Normalizing Flows
+│   ├── FastFlow, CFLOW-AD, CS-Flow
+│   └── DifferNet
+├── Foundation Models
+│   ├── CLIP-based (WinCLIP, AnomalyCLIP)
+│   ├── SAM-based (GroundedSAM-AD)
+│   └── Vision-language (AnomalyGPT)
+└── 3D Anomaly Detection
+    ├── Point cloud methods
+    └── Multi-modal (RGB + 3D)
+```
+## Key Methods
+| Method | Year | Approach | MVTec AUROC |
+|--------|------|----------|-------------|
+| **PatchCore** | 2022 | Memory bank | 99.1% |
+| **PaDiM** | 2021 | Multivariate Gaussian | 97.9% |
+| **RD4AD** | 2022 | Knowledge distillation | 98.5% |
+| **FastFlow** | 2022 | Normalizing flow | 99.4% |
+| **SimpleNet** | 2023 | Feature adaptation | 99.6% |
+| **WinCLIP** | 2023 | CLIP zero-shot | 95.2% |
+| **AnomalyGPT** | 2024 | Vision-language | 96.3% |
+## Benchmark Datasets
+```python
+benchmarks = {
+    "MVTec AD": {
+        "categories": 15,
+        "images": 5354,
+        "type": "Product/texture defects",
+        "annotation": "Pixel-level masks",
+    },
+    "MVTec 3D-AD": {
+        "categories": 10,
+        "images": 4147,
+        "type": "3D point cloud + RGB",
+    },
+    "VisA": {
+        "categories": 12,
+        "images": 10821,
+        "type": "Complex structure anomalies",
+    },
+    "BTAD": {
+        "categories": 3,
+        "images": 2830,
+        "type": "Industrial body/surface",
+    },
+    "MPDD": {
+        "categories": 6,
+        "images": 1064,
+        "type": "Metal parts defects",
+    },
+}
+for name, info in benchmarks.items():
+    print(f"{name}: {info['categories']} categories, "
+          f"{info['images']} images — {info['type']}")
+```
+## Quick Implementation
+```python
+# PatchCore-style anomaly detection
+from anomalib.data import MVTec
+from anomalib.models import Patchcore
+from anomalib.engine import Engine
+# Setup dataset
+datamodule = MVTec(
+    root="./datasets/MVTec",
+    category="bottle",
+    image_size=(256, 256),
+)
+# Initialize model
+model = Patchcore(
+    backbone="wide_resnet50_2",
+    layers=["layer2", "layer3"],
+    coreset_sampling_ratio=0.1,
+)
+# Train and test
+engine = Engine()
+engine.fit(model=model, datamodule=datamodule)
+results = engine.test(model=model, datamodule=datamodule)
+print(f"Image AUROC: {results[0]['image_AUROC']:.3f}")
+print(f"Pixel AUROC: {results[0]['pixel_AUROC']:.3f}")
+```
+## Evaluation Metrics
+```python
+# Standard anomaly detection metrics
+from sklearn.metrics import roc_auc_score
+import numpy as np
+# Image-level: Is this image anomalous?
+image_auroc = roc_auc_score(y_true_image, y_score_image)
+# Pixel-level: Where is the anomaly?
+pixel_auroc = roc_auc_score(
+    y_true_pixel.flatten(), y_score_pixel.flatten()
+)
+# PRO metric: Per-Region Overlap
+# Better than pixel AUROC for small anomalies
+# Weights each connected anomaly region equally
+```
+## Research Frontiers
+```markdown
+### Active Directions (2024-2025)
+1. **Zero/few-shot AD** — Detect anomalies without normal training data
+2. **Multi-class unified** — One model for all product categories
+3. **Foundation model AD** — CLIP/SAM/LLM-based detection
+4. **Logical anomalies** — Structural/contextual defects
+5. **Continual learning** — Adapt to new defect types
+6. **3D anomaly detection** — Point cloud and multi-modal
+7. **Real-time deployment** — Edge device optimization
+```
+## Use Cases
+1. **Manufacturing QC**: Automated visual inspection pipelines
+2. **Research benchmarking**: Compare new methods on standard datasets
+3. **Survey writing**: Comprehensive method taxonomy and comparison
+4. **Course teaching**: Industrial AI and computer vision curricula
+5. **Defect analysis**: Understanding failure modes and patterns
+## References
+- [awesome-industrial-anomaly-detection](https://github.com/M-3LAB/awesome-industrial-anomaly-detection)
+- [Anomalib Library](https://github.com/openvinotoolkit/anomalib)
+- [MVTec AD Dataset](https://www.mvtec.com/company/research/datasets/mvtec-ad)