npm - @wentorai/research-plugins - Versions diffs - 1.0.0 → 1.2.0 - Mend

@wentorai/research-plugins 1.0.0 → 1.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (415) hide show

package/skills/domains/ai-ml/nlp-toolkit-guide/SKILL.md ADDED Viewed

@@ -0,0 +1,247 @@
+---
+name: nlp-toolkit-guide
+description: "NLP analysis with perplexity scoring, burstiness, and entropy metrics"
+metadata:
+  openclaw:
+    emoji: "💬"
+    category: "domains"
+    subcategory: "ai-ml"
+    keywords: ["NLP", "perplexity", "burstiness", "entropy", "tokenization", "text analysis"]
+    source: "https://github.com/huggingface/transformers"
+---
+# NLP Toolkit Guide
+## Overview
+Natural Language Processing research requires a diverse set of analytical tools beyond standard model training. Text quality assessment, AI-generated text detection, linguistic feature extraction, and corpus analysis all depend on well-understood metrics: perplexity, burstiness, entropy, and their variants.
+This guide provides practical implementations of these core NLP metrics alongside patterns for tokenization, embedding analysis, and text feature engineering. The focus is on metrics used in active research areas -- AI text detection (perplexity + burstiness classifiers), information-theoretic analysis of corpora, and linguistic diversity measurement.
+These tools are framework-agnostic where possible, but leverage Hugging Face Transformers for language model operations and standard Python scientific computing libraries for statistical analysis.
+## Perplexity Scoring
+Perplexity measures how well a language model predicts a text. Lower perplexity means the text is more predictable to the model -- a key signal in AI text detection, model evaluation, and domain adaptation.
+```python
+import torch
+import numpy as np
+from transformers import AutoModelForCausalLM, AutoTokenizer
+def compute_perplexity(text: str, model_name: str = "gpt2") -> dict:
+    """
+    Compute token-level and text-level perplexity using a causal LM.
+    Returns:
+        dict with 'perplexity', 'log_likelihood', 'token_perplexities'
+    """
+    tokenizer = AutoTokenizer.from_pretrained(model_name)
+    model = AutoModelForCausalLM.from_pretrained(model_name)
+    model.eval()
+    encodings = tokenizer(text, return_tensors="pt", truncation=True, max_length=1024)
+    input_ids = encodings.input_ids
+    with torch.no_grad():
+        outputs = model(input_ids, labels=input_ids)
+        neg_log_likelihood = outputs.loss.item()
+    # Token-level perplexities for analysis
+    with torch.no_grad():
+        logits = outputs.logits[:, :-1, :]  # Shift for next-token prediction
+        targets = input_ids[:, 1:]
+        log_probs = torch.log_softmax(logits, dim=-1)
+        token_log_probs = log_probs.gather(2, targets.unsqueeze(-1)).squeeze(-1)
+        token_perplexities = torch.exp(-token_log_probs).squeeze().tolist()
+    perplexity = np.exp(neg_log_likelihood)
+    return {
+        "perplexity": perplexity,
+        "log_likelihood": -neg_log_likelihood,
+        "token_perplexities": token_perplexities,
+        "num_tokens": input_ids.size(1),
+    }
+```
+## Burstiness Analysis
+Burstiness measures the tendency of words to appear in clusters rather than uniformly across a text. Human writing tends to be "burstier" -- once a topic is introduced, related terms cluster together, then disappear.
+```python
+from collections import Counter
+import numpy as np
+def compute_burstiness(text: str, min_freq: int = 2) -> dict:
+    """
+    Compute burstiness score for a text.
+    Burstiness B = (sigma - mu) / (sigma + mu)
+    where sigma and mu are the std dev and mean of inter-arrival times.
+    B ranges from -1 (periodic) to 1 (bursty). Human text typically B > 0.
+    """
+    words = text.lower().split()
+    word_positions = {}
+    for i, word in enumerate(words):
+        word_positions.setdefault(word, []).append(i)
+    burstiness_scores = {}
+    for word, positions in word_positions.items():
+        if len(positions) < min_freq:
+            continue
+        inter_arrivals = np.diff(positions)
+        mu = np.mean(inter_arrivals)
+        sigma = np.std(inter_arrivals)
+        if mu + sigma == 0:
+            burstiness_scores[word] = 0.0
+        else:
+            burstiness_scores[word] = (sigma - mu) / (sigma + mu)
+    # Aggregate burstiness
+    if burstiness_scores:
+        avg_burstiness = np.mean(list(burstiness_scores.values()))
+    else:
+        avg_burstiness = 0.0
+    return {
+        "average_burstiness": avg_burstiness,
+        "word_burstiness": burstiness_scores,
+        "num_words_analyzed": len(burstiness_scores),
+    }
+```
+## Entropy and Information-Theoretic Metrics
+```python
+from collections import Counter
+import numpy as np
+def compute_entropy(text: str, level: str = "word") -> dict:
+    """
+    Compute Shannon entropy at word or character level.
+    Higher entropy indicates more diverse, less predictable text.
+    AI-generated text often has lower entropy than human text.
+    """
+    if level == "word":
+        tokens = text.lower().split()
+    elif level == "character":
+        tokens = list(text.lower())
+    else:
+        raise ValueError("level must be 'word' or 'character'")
+    counts = Counter(tokens)
+    total = sum(counts.values())
+    probabilities = np.array([c / total for c in counts.values()])
+    entropy = -np.sum(probabilities * np.log2(probabilities + 1e-12))
+    max_entropy = np.log2(len(counts)) if len(counts) > 1 else 1.0
+    normalized_entropy = entropy / max_entropy
+    return {
+        "entropy": entropy,
+        "normalized_entropy": normalized_entropy,
+        "vocabulary_size": len(counts),
+        "total_tokens": total,
+        "type_token_ratio": len(counts) / total,
+    }
+def compute_conditional_entropy(text: str, n: int = 2) -> float:
+    """Compute conditional entropy H(X_n | X_{n-1}) for n-gram analysis."""
+    words = text.lower().split()
+    if len(words) < n:
+        return 0.0
+    ngrams = [tuple(words[i:i+n]) for i in range(len(words) - n + 1)]
+    contexts = [ng[:-1] for ng in ngrams]
+    context_counts = Counter(contexts)
+    ngram_counts = Counter(ngrams)
+    h = 0.0
+    total = len(ngrams)
+    for ngram, count in ngram_counts.items():
+        context = ngram[:-1]
+        p_ngram = count / total
+        p_context = context_counts[context] / total
+        h -= p_ngram * np.log2(count / context_counts[context] + 1e-12)
+    return h
+```
+## AI Text Detection Pipeline
+Combining perplexity, burstiness, and entropy into a detection pipeline:
+```python
+def analyze_text_authenticity(text: str) -> dict:
+    """
+    Multi-signal analysis for AI vs. human text classification.
+    Uses perplexity, burstiness, and entropy as features.
+    """
+    perplexity_result = compute_perplexity(text)
+    burstiness_result = compute_burstiness(text)
+    entropy_result = compute_entropy(text, level="word")
+    char_entropy = compute_entropy(text, level="character")
+    # Heuristic thresholds from literature
+    signals = {
+        "low_perplexity": perplexity_result["perplexity"] < 30,
+        "low_burstiness": burstiness_result["average_burstiness"] < 0.1,
+        "low_entropy": entropy_result["normalized_entropy"] < 0.7,
+        "uniform_token_ppl": np.std(perplexity_result["token_perplexities"]) < 5,
+    }
+    ai_score = sum(signals.values()) / len(signals)
+    return {
+        "perplexity": perplexity_result["perplexity"],
+        "burstiness": burstiness_result["average_burstiness"],
+        "word_entropy": entropy_result["entropy"],
+        "char_entropy": char_entropy["entropy"],
+        "type_token_ratio": entropy_result["type_token_ratio"],
+        "ai_likelihood_score": ai_score,
+        "signals": signals,
+    }
+```
+## Tokenization Patterns
+```python
+from transformers import AutoTokenizer
+def compare_tokenizers(text: str, models: list = None) -> dict:
+    """Compare tokenization across different models for research analysis."""
+    if models is None:
+        models = ["gpt2", "bert-base-uncased", "facebook/opt-1.3b"]
+    results = {}
+    for model_name in models:
+        tokenizer = AutoTokenizer.from_pretrained(model_name)
+        tokens = tokenizer.tokenize(text)
+        results[model_name] = {
+            "num_tokens": len(tokens),
+            "tokens": tokens[:50],  # First 50 for inspection
+            "vocab_size": tokenizer.vocab_size,
+            "compression_ratio": len(text) / len(tokens),
+        }
+    return results
+```
+## Best Practices
+- **Always specify the model** when computing perplexity. Perplexity is model-relative, not absolute.
+- **Normalize by text length** when comparing entropy across texts of different sizes.
+- **Use sliding windows** for long documents to capture local variation in metrics.
+- **Combine multiple signals** for AI text detection -- no single metric is reliable alone.
+- **Report confidence intervals** by computing metrics on paragraph-level chunks, then aggregating.
+- **Be aware of domain shift.** Perplexity thresholds trained on news text will not transfer to scientific papers.
+## References
+- [Hugging Face Transformers](https://huggingface.co/docs/transformers/) -- Model hub and tokenizer library
+- [DetectGPT](https://arxiv.org/abs/2301.11305) -- Perplexity-based AI text detection (Mitchell et al., 2023)
+- [Burstiness and Memory in Text](https://doi.org/10.1103/PhysRevLett.114.078101) -- Altmann et al., 2015
+- [NLTK documentation](https://www.nltk.org/) -- Classic NLP toolkit for feature engineering
+- [spaCy documentation](https://spacy.io/) -- Industrial-strength NLP for production pipelines

package/skills/domains/ai-ml/npcpy-research-guide/SKILL.md ADDED Viewed

@@ -0,0 +1,137 @@
+---
+name: npcpy-research-guide
+description: "All-in-one Python library for NLP, agents, and knowledge graphs"
+metadata:
+  openclaw:
+    emoji: "🎭"
+    category: "domains"
+    subcategory: "ai-ml"
+    keywords: ["npcpy", "NLP", "agents", "knowledge graph", "all-in-one", "Python library"]
+    source: "https://github.com/NPC-Worldwide/npcpy"
+---
+# npcpy Research Guide
+## Overview
+npcpy is an all-in-one Python library that combines NLP, agent orchestration, and knowledge graph capabilities in a single package. It provides tools for text processing, entity extraction, agent creation, graph-based reasoning, and research automation. Designed as a Swiss Army knife for AI researchers who need quick access to diverse NLP and agent capabilities without juggling many dependencies.
+## Installation
+```bash
+pip install npcpy
+```
+## Core Modules
+### NLP Processing
+```python
+from npcpy import NLP
+nlp = NLP()
+# Text processing pipeline
+doc = nlp.process(
+    "Transformers have revolutionized NLP since Vaswani et al. "
+    "introduced the attention mechanism in 2017."
+)
+# Named entities
+for entity in doc.entities:
+    print(f"[{entity.type}] {entity.text}")
+# [METHOD] Transformers
+# [PERSON] Vaswani
+# [CONCEPT] attention mechanism
+# [DATE] 2017
+# Key phrases
+print(doc.key_phrases)
+# ["attention mechanism", "Transformers", "NLP"]
+# Sentiment / stance
+print(doc.sentiment)  # positive
+```
+### Agent Creation
+```python
+from npcpy import Agent, Tool
+# Create a research agent
+agent = Agent(
+    name="research_assistant",
+    llm_provider="anthropic",
+    tools=[
+        Tool("web_search", description="Search the web"),
+        Tool("paper_search", description="Search academic papers"),
+        Tool("calculator", description="Math calculations"),
+    ],
+)
+# Run a task
+result = agent.run(
+    "Find the top 5 most cited papers on few-shot learning "
+    "from 2023 and summarize their approaches."
+)
+print(result.output)
+```
+### Knowledge Graphs
+```python
+from npcpy import KnowledgeGraph
+kg = KnowledgeGraph()
+# Extract knowledge from text
+kg.extract_from_text(
+    "BERT uses masked language modeling for pre-training. "
+    "GPT uses autoregressive language modeling. "
+    "Both are based on the Transformer architecture."
+)
+# Query the graph
+results = kg.query("What models use Transformer architecture?")
+# ["BERT", "GPT"]
+# Visualize
+kg.visualize("knowledge_graph.html")
+# Export
+kg.export("kg.json")
+```
+## Research Workflows
+```python
+from npcpy import ResearchWorkflow
+workflow = ResearchWorkflow(llm_provider="anthropic")
+# Literature search + synthesis
+report = workflow.literature_review(
+    topic="prompt engineering techniques",
+    num_papers=20,
+    synthesis_style="academic",
+)
+report.save("review.md")
+# Paper analysis
+analysis = workflow.analyze_paper("paper.pdf")
+print(analysis.summary)
+print(analysis.methodology)
+print(analysis.key_findings)
+```
+## Use Cases
+1. **Quick NLP**: Text processing without heavy setup
+2. **Agent prototyping**: Rapid agent creation and testing
+3. **Knowledge extraction**: Build KGs from research text
+4. **Research automation**: Literature search and synthesis
+5. **Teaching**: Demonstrate NLP/agent concepts
+## References
+- [npcpy GitHub](https://github.com/NPC-Worldwide/npcpy)

package/skills/domains/ai-ml/pytorch-guide/SKILL.md ADDED Viewed

@@ -0,0 +1,281 @@
+---
+name: pytorch-guide
+description: "Avoid common PyTorch mistakes and apply robust training patterns"
+metadata:
+  openclaw:
+    emoji: "🔥"
+    category: "domains"
+    subcategory: "ai-ml"
+    keywords: ["PyTorch", "deep learning", "training loop", "GPU", "debugging", "autograd"]
+    source: "https://github.com/pytorch/pytorch"
+---
+# PyTorch Guide
+## Overview
+PyTorch is the dominant deep learning framework in academic research, used in the majority of papers at NeurIPS, ICML, and ICLR. Its eager execution model, Pythonic API, and seamless integration with the Python scientific stack make it the default choice for prototyping and publishing research code.
+However, PyTorch's flexibility is a double-edged sword. Subtle bugs -- forgetting `model.eval()`, accumulating gradients across batches, incorrect device placement, memory leaks from detached tensors -- can silently corrupt results without raising errors. These issues are especially dangerous in research settings where ground truth is unknown.
+This guide catalogs the most common PyTorch mistakes, provides battle-tested training patterns, and covers performance optimization techniques that every researcher should know. The patterns here are drawn from top-tier ML research codebases and the PyTorch team's own best practice recommendations.
+## Common Mistakes and Fixes
+### The Big Five Mistakes
+```python
+# MISTAKE 1: Forgetting model.eval() and torch.no_grad()
+# This causes dropout and batch norm to behave incorrectly during evaluation
+# and wastes memory by tracking gradients
+# WRONG
+def evaluate(model, dataloader):
+    total_correct = 0
+    for x, y in dataloader:
+        output = model(x)  # Dropout still active! BN using batch stats!
+        total_correct += (output.argmax(1) == y).sum().item()
+# RIGHT
+@torch.no_grad()
+def evaluate(model, dataloader):
+    model.eval()
+    total_correct = 0
+    for x, y in dataloader:
+        output = model(x)
+        total_correct += (output.argmax(1) == y).sum().item()
+    model.train()  # Restore training mode
+    return total_correct
+```
+```python
+# MISTAKE 2: Not zeroing gradients (they accumulate by default!)
+# WRONG - gradients from previous batch add to current batch
+for x, y in dataloader:
+    loss = criterion(model(x), y)
+    loss.backward()
+    optimizer.step()
+# RIGHT
+for x, y in dataloader:
+    optimizer.zero_grad()        # Clear previous gradients
+    loss = criterion(model(x), y)
+    loss.backward()
+    optimizer.step()
+# BETTER (slightly faster, avoids memset)
+for x, y in dataloader:
+    optimizer.zero_grad(set_to_none=True)
+    loss = criterion(model(x), y)
+    loss.backward()
+    optimizer.step()
+```
+```python
+# MISTAKE 3: Memory leaks from tensor operations in metrics
+# WRONG - keeps entire computation graph in memory
+losses = []
+for x, y in dataloader:
+    loss = criterion(model(x), y)
+    losses.append(loss)  # Retains computation graph!
+# RIGHT - detach from graph and move to CPU
+losses = []
+for x, y in dataloader:
+    loss = criterion(model(x), y)
+    losses.append(loss.item())  # .item() extracts Python scalar
+```
+```python
+# MISTAKE 4: Incorrect device placement
+# WRONG - model on GPU, data on CPU
+model = model.cuda()
+for x, y in dataloader:
+    output = model(x)  # RuntimeError: tensors on different devices
+# RIGHT
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+model = model.to(device)
+for x, y in dataloader:
+    x, y = x.to(device), y.to(device)
+    output = model(x)
+```
+```python
+# MISTAKE 5: Mutable default arguments in dataset transforms
+# WRONG
+class MyDataset(Dataset):
+    def __init__(self, data, transforms=[]):  # Shared mutable list!
+        self.transforms = transforms
+# RIGHT
+class MyDataset(Dataset):
+    def __init__(self, data, transforms=None):
+        self.transforms = transforms or []
+```
+## Robust Training Template
+```python
+import torch
+import torch.nn as nn
+from torch.utils.data import DataLoader
+from torch.cuda.amp import autocast, GradScaler
+import time
+def train(
+    model: nn.Module,
+    train_loader: DataLoader,
+    val_loader: DataLoader,
+    optimizer: torch.optim.Optimizer,
+    scheduler,
+    num_epochs: int,
+    device: torch.device,
+    use_amp: bool = True,
+):
+    """Production-quality training loop with mixed precision and checkpointing."""
+    criterion = nn.CrossEntropyLoss()
+    scaler = GradScaler(enabled=use_amp)
+    best_val_loss = float("inf")
+    for epoch in range(num_epochs):
+        # --- Training ---
+        model.train()
+        train_loss = 0.0
+        t0 = time.time()
+        for batch_idx, (x, y) in enumerate(train_loader):
+            x, y = x.to(device, non_blocking=True), y.to(device, non_blocking=True)
+            optimizer.zero_grad(set_to_none=True)
+            with autocast(enabled=use_amp):
+                output = model(x)
+                loss = criterion(output, y)
+            scaler.scale(loss).backward()
+            scaler.unscale_(optimizer)
+            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
+            scaler.step(optimizer)
+            scaler.update()
+            train_loss += loss.item()
+        scheduler.step()
+        avg_train_loss = train_loss / len(train_loader)
+        # --- Validation ---
+        model.eval()
+        val_loss = 0.0
+        correct = 0
+        total = 0
+        with torch.no_grad():
+            for x, y in val_loader:
+                x, y = x.to(device, non_blocking=True), y.to(device, non_blocking=True)
+                with autocast(enabled=use_amp):
+                    output = model(x)
+                    loss = criterion(output, y)
+                val_loss += loss.item()
+                correct += (output.argmax(1) == y).sum().item()
+                total += y.size(0)
+        avg_val_loss = val_loss / len(val_loader)
+        val_acc = correct / total
+        # --- Checkpoint ---
+        if avg_val_loss < best_val_loss:
+            best_val_loss = avg_val_loss
+            torch.save({
+                "epoch": epoch,
+                "model_state_dict": model.state_dict(),
+                "optimizer_state_dict": optimizer.state_dict(),
+                "val_loss": avg_val_loss,
+            }, "best_checkpoint.pt")
+        elapsed = time.time() - t0
+        print(f"Epoch {epoch+1}/{num_epochs} | "
+              f"Train Loss: {avg_train_loss:.4f} | "
+              f"Val Loss: {avg_val_loss:.4f} | "
+              f"Val Acc: {val_acc:.4f} | "
+              f"Time: {elapsed:.1f}s")
+```
+## Performance Optimization
+| Technique | Speedup | Effort | When to Use |
+|-----------|---------|--------|-------------|
+| Mixed precision (AMP) | 1.5-3x | Low | Always on modern GPUs |
+| `torch.compile()` | 1.2-2x | Low | PyTorch 2.0+, stable models |
+| `pin_memory=True` in DataLoader | 1.1-1.3x | Trivial | Always with GPU training |
+| `non_blocking=True` in `.to()` | 1.05-1.1x | Trivial | Always with pinned memory |
+| Gradient accumulation | N/A | Low | When batch size limited by memory |
+| `torch.backends.cudnn.benchmark = True` | 1.1-1.5x | Trivial | Fixed input sizes |
+| Distributed Data Parallel | Near-linear | Medium | Multi-GPU training |
+### GPU Memory Management
+```python
+# Check GPU memory usage
+print(f"Allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
+print(f"Cached: {torch.cuda.memory_reserved() / 1e9:.2f} GB")
+# Force garbage collection when debugging OOM
+torch.cuda.empty_cache()
+import gc; gc.collect()
+# Gradient accumulation for effective large batch sizes
+accumulation_steps = 4
+for i, (x, y) in enumerate(dataloader):
+    loss = criterion(model(x.to(device)), y.to(device)) / accumulation_steps
+    loss.backward()
+    if (i + 1) % accumulation_steps == 0:
+        optimizer.step()
+        optimizer.zero_grad(set_to_none=True)
+```
+## Reproducibility Checklist
+```python
+import torch
+import numpy as np
+import random
+def seed_everything(seed=42):
+    random.seed(seed)
+    np.random.seed(seed)
+    torch.manual_seed(seed)
+    torch.cuda.manual_seed_all(seed)
+    torch.backends.cudnn.deterministic = True
+    torch.backends.cudnn.benchmark = False
+    # For DataLoader workers
+    def seed_worker(worker_id):
+        worker_seed = seed + worker_id
+        np.random.seed(worker_seed)
+        random.seed(worker_seed)
+    return seed_worker
+seed_worker = seed_everything(42)
+dataloader = DataLoader(
+    dataset, batch_size=32, shuffle=True,
+    worker_init_fn=seed_worker,
+    generator=torch.Generator().manual_seed(42),
+)
+```
+## Best Practices
+- **Always use `torch.no_grad()` for inference.** It reduces memory usage by ~50%.
+- **Prefer `model.to(device)` over `.cuda()`.** It is device-agnostic and works on CPU, CUDA, and MPS.
+- **Use `torch.compile(model)` on PyTorch 2.0+** for free speedups on stable architectures.
+- **Profile before optimizing.** Use `torch.profiler` to find actual bottlenecks.
+- **Pin your PyTorch version in `requirements.txt`.** Different versions can produce different numerical results.
+- **Use `torchinfo` for model summary** instead of printing the model object.
+## References
+- [PyTorch documentation](https://pytorch.org/docs/stable/) -- Official API reference
+- [PyTorch tutorials](https://pytorch.org/tutorials/) -- End-to-end examples from the PyTorch team
+- [PyTorch best practices](https://pytorch.org/docs/stable/notes/cuda.html) -- CUDA semantics and best practices
+- [Effective PyTorch](https://github.com/vahidk/EffectivePyTorch) -- Community best practices guide
+- [PyTorch Lightning](https://lightning.ai/) -- High-level training framework built on PyTorch