npm - @wentorai/research-plugins - Versions diffs - 1.0.0 → 1.1.0 - Mend

@wentorai/research-plugins 1.0.0 → 1.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (203) hide show

package/skills/domains/ai-ml/ml-pipeline-guide/SKILL.md ADDED Viewed

@@ -0,0 +1,295 @@
+---
+name: ml-pipeline-guide
+description: "Build and deploy reproducible production ML pipelines for research"
+metadata:
+  openclaw:
+    emoji: "🔧"
+    category: "domains"
+    subcategory: "ai-ml"
+    keywords: ["MLOps", "pipeline", "deployment", "reproducibility", "feature engineering", "CI/CD"]
+    source: "https://github.com/mlflow/mlflow"
+---
+# ML Pipeline Guide
+## Overview
+Machine learning research increasingly demands reproducible, end-to-end pipelines that go beyond a single training script. A research ML pipeline encompasses data ingestion, feature engineering, model training, evaluation, experiment tracking, and artifact management. Without a structured pipeline, research results become difficult to reproduce, ablation studies become error-prone, and collaborators cannot build on prior work.
+This guide covers the practical tools and patterns for building ML pipelines in an academic research context. The focus is on reproducibility, experiment tracking, and the transition from notebook prototyping to structured experiments. The patterns use MLflow, DVC, and standard Python tooling -- chosen because they are open source, widely adopted in published research, and require minimal infrastructure.
+Unlike industry MLOps guides that emphasize deployment at scale, this guide prioritizes the research workflow: running many experiments, tracking what changed between runs, and producing results that reviewers can verify.
+## Pipeline Architecture
+A research ML pipeline typically has five stages:
+```
+Data Ingestion → Feature Engineering → Training → Evaluation → Artifact Storage
+     │                  │                 │            │              │
+     ├── raw data       ├── transforms    ├── model    ├── metrics    ├── models
+     ├── splits         ├── features      ├── logs     ├── plots      ├── configs
+     └── metadata       └── cache         └── ckpts    └── tables     └── reports
+```
+### Directory Structure for Reproducible Research
+```
+project/
+├── configs/
+│   ├── base.yaml           # Default hyperparameters
+│   ├── experiment_001.yaml  # Experiment-specific overrides
+│   └── sweep.yaml          # Hyperparameter search space
+├── data/
+│   ├── raw/                # Immutable original data
+│   ├── processed/          # Cleaned and transformed
+│   └── splits/             # Train/val/test splits (versioned)
+├── src/
+│   ├── data/               # Data loading and preprocessing
+│   ├── features/           # Feature engineering
+│   ├── models/             # Model definitions
+│   ├── training/           # Training loops
+│   └── evaluation/         # Metrics and visualization
+├── experiments/            # MLflow/W&B experiment logs
+├── notebooks/              # Exploratory analysis only
+├── tests/                  # Unit tests for pipeline components
+├── Makefile                # Reproducible commands
+├── requirements.txt        # Pinned dependencies
+└── dvc.yaml                # Data version control pipeline
+```
+## Experiment Tracking with MLflow
+```python
+import mlflow
+import mlflow.pytorch
+from pathlib import Path
+def run_experiment(config: dict):
+    """Run a single experiment with full tracking."""
+    mlflow.set_experiment(config["experiment_name"])
+    with mlflow.start_run(run_name=config.get("run_name")):
+        # Log configuration
+        mlflow.log_params({
+            "model": config["model_name"],
+            "learning_rate": config["lr"],
+            "batch_size": config["batch_size"],
+            "epochs": config["epochs"],
+            "optimizer": config["optimizer"],
+            "seed": config["seed"],
+        })
+        # Log environment
+        mlflow.log_param("python_version", sys.version)
+        mlflow.log_param("torch_version", torch.__version__)
+        mlflow.log_param("cuda_version", torch.version.cuda)
+        # Training
+        model = build_model(config)
+        for epoch in range(config["epochs"]):
+            train_loss = train_one_epoch(model, train_loader, optimizer)
+            val_loss, val_metrics = evaluate(model, val_loader)
+            mlflow.log_metrics({
+                "train_loss": train_loss,
+                "val_loss": val_loss,
+                **{f"val_{k}": v for k, v in val_metrics.items()},
+            }, step=epoch)
+        # Log final model
+        mlflow.pytorch.log_model(model, "model")
+        # Log artifacts (plots, configs)
+        mlflow.log_artifact(config_path)
+        save_evaluation_plots(model, test_loader, "plots/")
+        mlflow.log_artifacts("plots/")
+        return val_metrics
+```
+## Data Versioning with DVC
+```yaml
+# dvc.yaml -- Pipeline definition
+stages:
+  prepare_data:
+    cmd: python src/data/prepare.py --config configs/base.yaml
+    deps:
+      - src/data/prepare.py
+      - data/raw/
+    outs:
+      - data/processed/
+    params:
+      - configs/base.yaml:
+          - data.split_ratio
+          - data.random_seed
+  extract_features:
+    cmd: python src/features/extract.py --config configs/base.yaml
+    deps:
+      - src/features/extract.py
+      - data/processed/
+    outs:
+      - data/features/
+    params:
+      - configs/base.yaml:
+          - features
+  train:
+    cmd: python src/training/train.py --config configs/base.yaml
+    deps:
+      - src/training/train.py
+      - src/models/
+      - data/features/
+    outs:
+      - models/
+    metrics:
+      - metrics.json:
+          cache: false
+    plots:
+      - plots/training_curve.csv:
+          x: epoch
+          y: loss
+```
+```bash
+# Reproduce the full pipeline
+dvc repro
+# Compare experiments
+dvc metrics diff
+# Push data to remote storage
+dvc push
+```
+## Configuration Management with Hydra
+```python
+import hydra
+from omegaconf import DictConfig, OmegaConf
+@hydra.main(config_path="configs", config_name="base", version_base=None)
+def main(cfg: DictConfig):
+    print(OmegaConf.to_yaml(cfg))
+    model = build_model(
+        name=cfg.model.name,
+        hidden_dim=cfg.model.hidden_dim,
+        num_layers=cfg.model.num_layers,
+    )
+    train(
+        model=model,
+        lr=cfg.training.lr,
+        epochs=cfg.training.epochs,
+        batch_size=cfg.training.batch_size,
+    )
+# Override from command line:
+# python train.py training.lr=1e-4 model.hidden_dim=512
+# python train.py --multirun training.lr=1e-3,1e-4,1e-5
+```
+```yaml
+# configs/base.yaml
+model:
+  name: resnet50
+  hidden_dim: 256
+  num_layers: 4
+training:
+  lr: 1e-3
+  epochs: 100
+  batch_size: 32
+  optimizer: adamw
+  weight_decay: 0.01
+data:
+  dataset: cifar10
+  split_ratio: [0.8, 0.1, 0.1]
+  random_seed: 42
+  augmentation: true
+```
+## Feature Engineering Patterns
+```python
+from sklearn.pipeline import Pipeline
+from sklearn.preprocessing import StandardScaler, OneHotEncoder
+from sklearn.compose import ColumnTransformer
+from sklearn.impute import SimpleImputer
+import joblib
+def build_feature_pipeline(numeric_cols: list, categorical_cols: list) -> Pipeline:
+    """Build a reproducible feature engineering pipeline."""
+    numeric_transformer = Pipeline([
+        ("imputer", SimpleImputer(strategy="median")),
+        ("scaler", StandardScaler()),
+    ])
+    categorical_transformer = Pipeline([
+        ("imputer", SimpleImputer(strategy="most_frequent")),
+        ("encoder", OneHotEncoder(handle_unknown="ignore", sparse_output=False)),
+    ])
+    preprocessor = ColumnTransformer([
+        ("num", numeric_transformer, numeric_cols),
+        ("cat", categorical_transformer, categorical_cols),
+    ])
+    return preprocessor
+# Save and load for reproducibility
+preprocessor.fit(X_train)
+joblib.dump(preprocessor, "artifacts/preprocessor.pkl")
+# Later: preprocessor = joblib.load("artifacts/preprocessor.pkl")
+```
+## Makefile for Reproducibility
+```makefile
+.PHONY: setup data train evaluate all clean
+setup:
+	pip install -r requirements.txt
+	dvc pull
+data:
+	python src/data/prepare.py --config configs/base.yaml
+train:
+	python src/training/train.py --config configs/base.yaml
+evaluate:
+	python src/evaluation/evaluate.py --config configs/base.yaml
+all: setup data train evaluate
+sweep:
+	python src/training/train.py --multirun \
+		training.lr=1e-3,1e-4,1e-5 \
+		model.hidden_dim=128,256,512
+clean:
+	rm -rf outputs/ multirun/ __pycache__/
+```
+## Best Practices
+- **Never modify raw data.** All transformations should be scripted and reproducible.
+- **Pin every dependency version** including CUDA, cuDNN, and OS-level libraries.
+- **Separate configuration from code.** Use YAML/JSON configs, not hardcoded values.
+- **Track experiments from day one.** Retrofitting experiment tracking is painful.
+- **Write tests for data preprocessing.** Shape mismatches and silent data corruption are common.
+- **Use `Makefile` or `dvc repro`** so any collaborator can reproduce results with one command.
+- **Version your data alongside your code** using DVC, Git-LFS, or cloud storage with manifests.
+## References
+- [MLflow documentation](https://mlflow.org/docs/latest/) -- Experiment tracking and model registry
+- [DVC documentation](https://dvc.org/doc) -- Data version control for ML
+- [Hydra documentation](https://hydra.cc/) -- Configuration management framework
+- [Cookiecutter Data Science](https://drivendata.github.io/cookiecutter-data-science/) -- Project structure template
+- [Made With ML](https://madewithml.com/) -- MLOps best practices for researchers

package/skills/domains/ai-ml/nlp-toolkit-guide/SKILL.md ADDED Viewed

@@ -0,0 +1,247 @@
+---
+name: nlp-toolkit-guide
+description: "NLP analysis with perplexity scoring, burstiness, and entropy metrics"
+metadata:
+  openclaw:
+    emoji: "💬"
+    category: "domains"
+    subcategory: "ai-ml"
+    keywords: ["NLP", "perplexity", "burstiness", "entropy", "tokenization", "text analysis"]
+    source: "https://github.com/huggingface/transformers"
+---
+# NLP Toolkit Guide
+## Overview
+Natural Language Processing research requires a diverse set of analytical tools beyond standard model training. Text quality assessment, AI-generated text detection, linguistic feature extraction, and corpus analysis all depend on well-understood metrics: perplexity, burstiness, entropy, and their variants.
+This guide provides practical implementations of these core NLP metrics alongside patterns for tokenization, embedding analysis, and text feature engineering. The focus is on metrics used in active research areas -- AI text detection (perplexity + burstiness classifiers), information-theoretic analysis of corpora, and linguistic diversity measurement.
+These tools are framework-agnostic where possible, but leverage Hugging Face Transformers for language model operations and standard Python scientific computing libraries for statistical analysis.
+## Perplexity Scoring
+Perplexity measures how well a language model predicts a text. Lower perplexity means the text is more predictable to the model -- a key signal in AI text detection, model evaluation, and domain adaptation.
+```python
+import torch
+import numpy as np
+from transformers import AutoModelForCausalLM, AutoTokenizer
+def compute_perplexity(text: str, model_name: str = "gpt2") -> dict:
+    """
+    Compute token-level and text-level perplexity using a causal LM.
+    Returns:
+        dict with 'perplexity', 'log_likelihood', 'token_perplexities'
+    """
+    tokenizer = AutoTokenizer.from_pretrained(model_name)
+    model = AutoModelForCausalLM.from_pretrained(model_name)
+    model.eval()
+    encodings = tokenizer(text, return_tensors="pt", truncation=True, max_length=1024)
+    input_ids = encodings.input_ids
+    with torch.no_grad():
+        outputs = model(input_ids, labels=input_ids)
+        neg_log_likelihood = outputs.loss.item()
+    # Token-level perplexities for analysis
+    with torch.no_grad():
+        logits = outputs.logits[:, :-1, :]  # Shift for next-token prediction
+        targets = input_ids[:, 1:]
+        log_probs = torch.log_softmax(logits, dim=-1)
+        token_log_probs = log_probs.gather(2, targets.unsqueeze(-1)).squeeze(-1)
+        token_perplexities = torch.exp(-token_log_probs).squeeze().tolist()
+    perplexity = np.exp(neg_log_likelihood)
+    return {
+        "perplexity": perplexity,
+        "log_likelihood": -neg_log_likelihood,
+        "token_perplexities": token_perplexities,
+        "num_tokens": input_ids.size(1),
+    }
+```
+## Burstiness Analysis
+Burstiness measures the tendency of words to appear in clusters rather than uniformly across a text. Human writing tends to be "burstier" -- once a topic is introduced, related terms cluster together, then disappear.
+```python
+from collections import Counter
+import numpy as np
+def compute_burstiness(text: str, min_freq: int = 2) -> dict:
+    """
+    Compute burstiness score for a text.
+    Burstiness B = (sigma - mu) / (sigma + mu)
+    where sigma and mu are the std dev and mean of inter-arrival times.
+    B ranges from -1 (periodic) to 1 (bursty). Human text typically B > 0.
+    """
+    words = text.lower().split()
+    word_positions = {}
+    for i, word in enumerate(words):
+        word_positions.setdefault(word, []).append(i)
+    burstiness_scores = {}
+    for word, positions in word_positions.items():
+        if len(positions) < min_freq:
+            continue
+        inter_arrivals = np.diff(positions)
+        mu = np.mean(inter_arrivals)
+        sigma = np.std(inter_arrivals)
+        if mu + sigma == 0:
+            burstiness_scores[word] = 0.0
+        else:
+            burstiness_scores[word] = (sigma - mu) / (sigma + mu)
+    # Aggregate burstiness
+    if burstiness_scores:
+        avg_burstiness = np.mean(list(burstiness_scores.values()))
+    else:
+        avg_burstiness = 0.0
+    return {
+        "average_burstiness": avg_burstiness,
+        "word_burstiness": burstiness_scores,
+        "num_words_analyzed": len(burstiness_scores),
+    }
+```
+## Entropy and Information-Theoretic Metrics
+```python
+from collections import Counter
+import numpy as np
+def compute_entropy(text: str, level: str = "word") -> dict:
+    """
+    Compute Shannon entropy at word or character level.
+    Higher entropy indicates more diverse, less predictable text.
+    AI-generated text often has lower entropy than human text.
+    """
+    if level == "word":
+        tokens = text.lower().split()
+    elif level == "character":
+        tokens = list(text.lower())
+    else:
+        raise ValueError("level must be 'word' or 'character'")
+    counts = Counter(tokens)
+    total = sum(counts.values())
+    probabilities = np.array([c / total for c in counts.values()])
+    entropy = -np.sum(probabilities * np.log2(probabilities + 1e-12))
+    max_entropy = np.log2(len(counts)) if len(counts) > 1 else 1.0
+    normalized_entropy = entropy / max_entropy
+    return {
+        "entropy": entropy,
+        "normalized_entropy": normalized_entropy,
+        "vocabulary_size": len(counts),
+        "total_tokens": total,
+        "type_token_ratio": len(counts) / total,
+    }
+def compute_conditional_entropy(text: str, n: int = 2) -> float:
+    """Compute conditional entropy H(X_n | X_{n-1}) for n-gram analysis."""
+    words = text.lower().split()
+    if len(words) < n:
+        return 0.0
+    ngrams = [tuple(words[i:i+n]) for i in range(len(words) - n + 1)]
+    contexts = [ng[:-1] for ng in ngrams]
+    context_counts = Counter(contexts)
+    ngram_counts = Counter(ngrams)
+    h = 0.0
+    total = len(ngrams)
+    for ngram, count in ngram_counts.items():
+        context = ngram[:-1]
+        p_ngram = count / total
+        p_context = context_counts[context] / total
+        h -= p_ngram * np.log2(count / context_counts[context] + 1e-12)
+    return h
+```
+## AI Text Detection Pipeline
+Combining perplexity, burstiness, and entropy into a detection pipeline:
+```python
+def analyze_text_authenticity(text: str) -> dict:
+    """
+    Multi-signal analysis for AI vs. human text classification.
+    Uses perplexity, burstiness, and entropy as features.
+    """
+    perplexity_result = compute_perplexity(text)
+    burstiness_result = compute_burstiness(text)
+    entropy_result = compute_entropy(text, level="word")
+    char_entropy = compute_entropy(text, level="character")
+    # Heuristic thresholds from literature
+    signals = {
+        "low_perplexity": perplexity_result["perplexity"] < 30,
+        "low_burstiness": burstiness_result["average_burstiness"] < 0.1,
+        "low_entropy": entropy_result["normalized_entropy"] < 0.7,
+        "uniform_token_ppl": np.std(perplexity_result["token_perplexities"]) < 5,
+    }
+    ai_score = sum(signals.values()) / len(signals)
+    return {
+        "perplexity": perplexity_result["perplexity"],
+        "burstiness": burstiness_result["average_burstiness"],
+        "word_entropy": entropy_result["entropy"],
+        "char_entropy": char_entropy["entropy"],
+        "type_token_ratio": entropy_result["type_token_ratio"],
+        "ai_likelihood_score": ai_score,
+        "signals": signals,
+    }
+```
+## Tokenization Patterns
+```python
+from transformers import AutoTokenizer
+def compare_tokenizers(text: str, models: list = None) -> dict:
+    """Compare tokenization across different models for research analysis."""
+    if models is None:
+        models = ["gpt2", "bert-base-uncased", "facebook/opt-1.3b"]
+    results = {}
+    for model_name in models:
+        tokenizer = AutoTokenizer.from_pretrained(model_name)
+        tokens = tokenizer.tokenize(text)
+        results[model_name] = {
+            "num_tokens": len(tokens),
+            "tokens": tokens[:50],  # First 50 for inspection
+            "vocab_size": tokenizer.vocab_size,
+            "compression_ratio": len(text) / len(tokens),
+        }
+    return results
+```
+## Best Practices
+- **Always specify the model** when computing perplexity. Perplexity is model-relative, not absolute.
+- **Normalize by text length** when comparing entropy across texts of different sizes.
+- **Use sliding windows** for long documents to capture local variation in metrics.
+- **Combine multiple signals** for AI text detection -- no single metric is reliable alone.
+- **Report confidence intervals** by computing metrics on paragraph-level chunks, then aggregating.
+- **Be aware of domain shift.** Perplexity thresholds trained on news text will not transfer to scientific papers.
+## References
+- [Hugging Face Transformers](https://huggingface.co/docs/transformers/) -- Model hub and tokenizer library
+- [DetectGPT](https://arxiv.org/abs/2301.11305) -- Perplexity-based AI text detection (Mitchell et al., 2023)
+- [Burstiness and Memory in Text](https://doi.org/10.1103/PhysRevLett.114.078101) -- Altmann et al., 2015
+- [NLTK documentation](https://www.nltk.org/) -- Classic NLP toolkit for feature engineering
+- [spaCy documentation](https://spacy.io/) -- Industrial-strength NLP for production pipelines