npm - @wentorai/research-plugins - Versions diffs - 1.0.0 → 1.2.0 - Mend

@wentorai/research-plugins 1.0.0 → 1.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (415) hide show

package/skills/domains/ai-ml/kolmogorov-arnold-networks-guide/SKILL.md ADDED Viewed

@@ -0,0 +1,185 @@
+---
+name: kolmogorov-arnold-networks-guide
+description: "Papers and tutorials on KAN learnable activation networks"
+metadata:
+  openclaw:
+    emoji: "📐"
+    category: "domains"
+    subcategory: "ai-ml"
+    keywords: ["KAN", "Kolmogorov-Arnold", "learnable activations", "spline networks", "neural architecture", "interpretability"]
+    source: "https://github.com/mintisan/awesome-kan"
+---
+# Kolmogorov-Arnold Networks (KAN) Guide
+## Overview
+Kolmogorov-Arnold Networks (KANs) are a novel neural network architecture that places learnable activation functions on edges (weights) instead of fixed activations on nodes. Based on the Kolmogorov-Arnold representation theorem, KANs use B-spline functions as learnable edge activations, achieving better accuracy and interpretability than MLPs with fewer parameters in certain domains. This collection tracks the rapidly growing KAN literature.
+## Core Concept
+```
+Traditional MLP:
+  x → [fixed activation(linear transform)] → y
+  Activations on nodes, weights on edges
+KAN:
+  x → [learnable spline functions on edges] → sum → y
+  Each edge learns its own activation function (B-spline)
+Kolmogorov-Arnold Theorem:
+  f(x₁,...,xₙ) = Σ Φᵢ(Σ φᵢⱼ(xⱼ))
+  Any multivariate continuous function = composition of
+  univariate functions and addition
+```
+## Key Papers
+```bibtex
+@article{liu2024kan,
+  title={KAN: Kolmogorov-Arnold Networks},
+  author={Liu, Ziming and Wang, Yixuan and Vaidya, Sachin and
+          Ruehle, Fabian and Halverson, James and
+          Solja{\v{c}}i{\'c}, Marin and Hou, Thomas Y. and
+          Tegmark, Max},
+  journal={arXiv:2404.19756},
+  year={2024}
+}
+```
+## Implementation
+```python
+# Using pykan (official implementation)
+# pip install pykan
+from kan import KAN
+import torch
+# Create a KAN model
+model = KAN(
+    width=[2, 5, 1],    # Input: 2, Hidden: 5, Output: 1
+    grid=5,               # Spline grid resolution
+    k=3,                  # Spline order (cubic)
+)
+# Training data
+x = torch.randn(1000, 2)
+y = torch.sin(x[:, 0]) + torch.cos(x[:, 1])
+y = y.unsqueeze(1)
+# Train
+dataset = {"train_input": x[:800], "train_label": y[:800],
+           "test_input": x[800:], "test_label": y[800:]}
+model.train(dataset, steps=100, lr=0.01)
+# Visualize learned functions
+model.plot()
+# Prune and simplify
+model = model.prune()
+model.plot()
+```
+## KAN vs MLP Comparison
+```python
+# Comparison on function approximation
+from kan import KAN
+import torch.nn as nn
+# KAN: learnable activations on edges
+kan_model = KAN(width=[2, 5, 1], grid=5, k=3)
+# Parameters: ~150 (spline coefficients)
+# MLP: fixed activations on nodes
+class MLP(nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.net = nn.Sequential(
+            nn.Linear(2, 50),
+            nn.ReLU(),
+            nn.Linear(50, 50),
+            nn.ReLU(),
+            nn.Linear(50, 1),
+        )
+    def forward(self, x):
+        return self.net(x)
+mlp_model = MLP()
+# Parameters: ~2,700
+# KAN advantages:
+# - Fewer parameters for same accuracy
+# - Interpretable (visualize learned functions)
+# - Better for scientific discovery (symbolic regression)
+# - Grid refinement for progressive accuracy
+# MLP advantages:
+# - Faster training
+# - Better scaling to high dimensions
+# - More mature tooling and optimization
+```
+## Extensions and Variants
+| Variant | Innovation | Application |
+|---------|-----------|-------------|
+| **KAN 2.0** | MultKAN with multiplication nodes | Improved scaling |
+| **Temporal KAN** | Time-series adaptation | Forecasting |
+| **ConvKAN** | KAN + convolutions | Image processing |
+| **GraphKAN** | KAN on graph structures | Graph learning |
+| **FourierKAN** | Fourier basis instead of splines | Periodic functions |
+| **WavKAN** | Wavelet-based activations | Signal processing |
+| **BSRBF-KAN** | B-spline + radial basis | Function approximation |
+## Scientific Applications
+```python
+# KAN for symbolic regression (discovering equations)
+from kan import KAN
+# Generate data from unknown equation: f(x,y) = x*exp(y)
+import torch
+x = torch.rand(1000, 2) * 2
+y = x[:, 0:1] * torch.exp(x[:, 1:2])
+dataset = {"train_input": x[:800], "train_label": y[:800],
+           "test_input": x[800:], "test_label": y[800:]}
+model = KAN(width=[2, 1, 1], grid=10, k=3)
+model.train(dataset, steps=200)
+# Symbolic fitting — discover the equation
+model.auto_symbolic()
+# Output: f(x₁, x₂) = x₁ * exp(x₂)
+# KAN can discover symbolic expressions from data
+```
+## Research Landscape
+```markdown
+### Key Research Directions
+1. **Scaling** — Making KANs work at LLM scale
+2. **Efficiency** — Reducing spline computation overhead
+3. **Theory** — Understanding approximation guarantees
+4. **Architecture search** — Optimal KAN topologies
+5. **Hybrid models** — Combining KAN and MLP strengths
+6. **Domain applications** — Physics, chemistry, biology
+7. **Interpretability** — Extracting symbolic knowledge
+```
+## Use Cases
+1. **Scientific discovery**: Extract equations from experimental data
+2. **Function approximation**: High-accuracy low-parameter models
+3. **Interpretable ML**: Understand what the network learned
+4. **Physics-informed**: Embed physical constraints in activations
+5. **Education**: Teach alternative neural network architectures
+## References
+- [awesome-kan](https://github.com/mintisan/awesome-kan)
+- [KAN Paper](https://arxiv.org/abs/2404.19756)
+- [pykan Implementation](https://github.com/KindXiaoming/pykan)
+- [KAN 2.0 Paper](https://arxiv.org/abs/2408.10205)

package/skills/domains/ai-ml/llm-from-scratch-guide/SKILL.md ADDED Viewed

@@ -0,0 +1,124 @@
+---
+name: llm-from-scratch-guide
+description: "Build a ChatGPT-like LLM from scratch using PyTorch step by step"
+metadata:
+  openclaw:
+    emoji: "🧱"
+    category: "domains"
+    subcategory: "ai-ml"
+    keywords: ["llm", "pytorch", "transformer", "gpt", "pretraining", "finetuning"]
+    source: "https://github.com/rasbt/LLMs-from-scratch"
+---
+# LLM From Scratch Guide
+## Overview
+LLMs-from-scratch is a comprehensive educational repository with over 87,000 stars on GitHub that teaches you how to build a ChatGPT-like large language model from the ground up using PyTorch. Created by Sebastian Raschka, a machine learning researcher and author, the project provides a complete pipeline covering data preparation, tokenization, attention mechanisms, pretraining, and instruction finetuning.
+Unlike tutorials that treat LLMs as black boxes, this project demystifies every component by walking through the full implementation. Each chapter corresponds to a Jupyter notebook with clear explanations, diagrams, and runnable code. The repository accompanies the book "Build a Large Language Model (From Scratch)" and serves as a standalone learning resource for researchers and engineers who want deep understanding of transformer-based language models.
+The project is particularly valuable for academic researchers who need to understand the internals of LLMs for their own research, whether that involves modifying architectures, running ablation studies, or developing domain-specific language models for scientific applications.
+## Installation and Setup
+Clone the repository and set up a Python environment with the required dependencies:
+```bash
+git clone https://github.com/rasbt/LLMs-from-scratch.git
+cd LLMs-from-scratch
+# Create a virtual environment
+python -m venv llm-env
+source llm-env/bin/activate
+# Install dependencies
+pip install -r requirements.txt
+```
+The project requires Python 3.10+ and PyTorch 2.0+. For GPU-accelerated training, ensure you have CUDA installed. The notebooks can also run on CPU for smaller model configurations, though training times will be significantly longer.
+Key dependencies include:
+- **PyTorch** >= 2.0 for model implementation and training
+- **tiktoken** for BPE tokenization compatible with OpenAI models
+- **matplotlib** for training visualization
+- **jupyter** for interactive notebook execution
+## Core Learning Pipeline
+The project is organized into sequential chapters that build on each other:
+### Chapter 1: Understanding Large Language Models
+Covers the conceptual foundations of LLMs, including the transformer architecture, the difference between encoder and decoder models, and how pretraining and finetuning work at a high level.
+### Chapter 2: Working with Text Data
+Implements text tokenization from scratch, including byte-pair encoding (BPE). You build a custom tokenizer and learn how text is converted to numerical representations:
+```python
+# Tokenization example from the project
+import tiktoken
+tokenizer = tiktoken.get_encoding("gpt2")
+text = "Large language models are fascinating."
+token_ids = tokenizer.encode(text)
+decoded = tokenizer.decode(token_ids)
+```
+### Chapter 3: Coding Attention Mechanisms
+Implements self-attention, multi-head attention, and causal (masked) attention from scratch. This is the core computational primitive of transformers:
+```python
+# Simplified multi-head attention
+class MultiHeadAttention(nn.Module):
+    def __init__(self, d_in, d_out, context_length, num_heads, dropout=0.0):
+        super().__init__()
+        self.W_query = nn.Linear(d_in, d_out, bias=False)
+        self.W_key = nn.Linear(d_in, d_out, bias=False)
+        self.W_value = nn.Linear(d_in, d_out, bias=False)
+        self.out_proj = nn.Linear(d_out, d_out)
+        self.num_heads = num_heads
+        self.head_dim = d_out // num_heads
+```
+### Chapter 4: Implementing a GPT Model
+Assembles the full GPT architecture using the attention mechanism, layer normalization, feed-forward networks, and positional embeddings.
+### Chapter 5: Pretraining on Unlabeled Data
+Trains the GPT model on a text corpus using next-token prediction. Covers the training loop, loss computation, learning rate scheduling, and gradient clipping.
+### Chapter 6: Finetuning for Text Classification
+Adapts the pretrained model for downstream classification tasks, demonstrating how to add a classification head and finetune on labeled data.
+### Chapter 7: Instruction Finetuning
+Converts the pretrained model into an instruction-following assistant using supervised finetuning on instruction-response pairs, similar to how ChatGPT is trained.
+## Research Applications
+This resource is invaluable for several research scenarios:
+- **Architecture ablation studies**: Modify individual components (attention heads, layer count, embedding dimensions) and measure their impact on performance
+- **Domain-specific pretraining**: Use the pipeline to pretrain models on scientific corpora (biomedical literature, physics papers, chemical databases)
+- **Tokenizer research**: Experiment with different tokenization strategies for specialized vocabularies
+- **Efficient training methods**: Test techniques like gradient accumulation, mixed precision, and learning rate warmup
+- **Interpretability research**: Inspect attention patterns and intermediate representations at every layer
+For researchers working with limited compute, the project includes configurations for small models (124M parameters) that can be trained on a single GPU in reasonable time, making it practical for experimentation and prototyping.
+## Integration with Research Workflows
+Combine this project with other tools in your research stack:
+- Use **Weights & Biases** or **MLflow** for experiment tracking during pretraining runs
+- Export trained models to **Hugging Face Hub** for sharing and reproducibility
+- Integrate with **PyTorch Lightning** for distributed training across multiple GPUs
+- Apply **LoRA** or **QLoRA** adapters from the bonus chapters for parameter-efficient finetuning
+The bonus materials in the repository cover additional topics like DPO (Direct Preference Optimization), loading pretrained weights from Hugging Face, and converting models between different formats.
+## References
+- Repository: https://github.com/rasbt/LLMs-from-scratch
+- Book: "Build a Large Language Model (From Scratch)" by Sebastian Raschka (Manning, 2024)
+- Author's blog: https://sebastianraschka.com/blog/
+- PyTorch documentation: https://pytorch.org/docs/stable/

package/skills/domains/ai-ml/ml-pipeline-guide/SKILL.md ADDED Viewed

@@ -0,0 +1,295 @@
+---
+name: ml-pipeline-guide
+description: "Build and deploy reproducible production ML pipelines for research"
+metadata:
+  openclaw:
+    emoji: "🔧"
+    category: "domains"
+    subcategory: "ai-ml"
+    keywords: ["MLOps", "pipeline", "deployment", "reproducibility", "feature engineering", "CI/CD"]
+    source: "https://github.com/mlflow/mlflow"
+---
+# ML Pipeline Guide
+## Overview
+Machine learning research increasingly demands reproducible, end-to-end pipelines that go beyond a single training script. A research ML pipeline encompasses data ingestion, feature engineering, model training, evaluation, experiment tracking, and artifact management. Without a structured pipeline, research results become difficult to reproduce, ablation studies become error-prone, and collaborators cannot build on prior work.
+This guide covers the practical tools and patterns for building ML pipelines in an academic research context. The focus is on reproducibility, experiment tracking, and the transition from notebook prototyping to structured experiments. The patterns use MLflow, DVC, and standard Python tooling -- chosen because they are open source, widely adopted in published research, and require minimal infrastructure.
+Unlike industry MLOps guides that emphasize deployment at scale, this guide prioritizes the research workflow: running many experiments, tracking what changed between runs, and producing results that reviewers can verify.
+## Pipeline Architecture
+A research ML pipeline typically has five stages:
+```
+Data Ingestion → Feature Engineering → Training → Evaluation → Artifact Storage
+     │                  │                 │            │              │
+     ├── raw data       ├── transforms    ├── model    ├── metrics    ├── models
+     ├── splits         ├── features      ├── logs     ├── plots      ├── configs
+     └── metadata       └── cache         └── ckpts    └── tables     └── reports
+```
+### Directory Structure for Reproducible Research
+```
+project/
+├── configs/
+│   ├── base.yaml           # Default hyperparameters
+│   ├── experiment_001.yaml  # Experiment-specific overrides
+│   └── sweep.yaml          # Hyperparameter search space
+├── data/
+│   ├── raw/                # Immutable original data
+│   ├── processed/          # Cleaned and transformed
+│   └── splits/             # Train/val/test splits (versioned)
+├── src/
+│   ├── data/               # Data loading and preprocessing
+│   ├── features/           # Feature engineering
+│   ├── models/             # Model definitions
+│   ├── training/           # Training loops
+│   └── evaluation/         # Metrics and visualization
+├── experiments/            # MLflow/W&B experiment logs
+├── notebooks/              # Exploratory analysis only
+├── tests/                  # Unit tests for pipeline components
+├── Makefile                # Reproducible commands
+├── requirements.txt        # Pinned dependencies
+└── dvc.yaml                # Data version control pipeline
+```
+## Experiment Tracking with MLflow
+```python
+import mlflow
+import mlflow.pytorch
+from pathlib import Path
+def run_experiment(config: dict):
+    """Run a single experiment with full tracking."""
+    mlflow.set_experiment(config["experiment_name"])
+    with mlflow.start_run(run_name=config.get("run_name")):
+        # Log configuration
+        mlflow.log_params({
+            "model": config["model_name"],
+            "learning_rate": config["lr"],
+            "batch_size": config["batch_size"],
+            "epochs": config["epochs"],
+            "optimizer": config["optimizer"],
+            "seed": config["seed"],
+        })
+        # Log environment
+        mlflow.log_param("python_version", sys.version)
+        mlflow.log_param("torch_version", torch.__version__)
+        mlflow.log_param("cuda_version", torch.version.cuda)
+        # Training
+        model = build_model(config)
+        for epoch in range(config["epochs"]):
+            train_loss = train_one_epoch(model, train_loader, optimizer)
+            val_loss, val_metrics = evaluate(model, val_loader)
+            mlflow.log_metrics({
+                "train_loss": train_loss,
+                "val_loss": val_loss,
+                **{f"val_{k}": v for k, v in val_metrics.items()},
+            }, step=epoch)
+        # Log final model
+        mlflow.pytorch.log_model(model, "model")
+        # Log artifacts (plots, configs)
+        mlflow.log_artifact(config_path)
+        save_evaluation_plots(model, test_loader, "plots/")
+        mlflow.log_artifacts("plots/")
+        return val_metrics
+```
+## Data Versioning with DVC
+```yaml
+# dvc.yaml -- Pipeline definition
+stages:
+  prepare_data:
+    cmd: python src/data/prepare.py --config configs/base.yaml
+    deps:
+      - src/data/prepare.py
+      - data/raw/
+    outs:
+      - data/processed/
+    params:
+      - configs/base.yaml:
+          - data.split_ratio
+          - data.random_seed
+  extract_features:
+    cmd: python src/features/extract.py --config configs/base.yaml
+    deps:
+      - src/features/extract.py
+      - data/processed/
+    outs:
+      - data/features/
+    params:
+      - configs/base.yaml:
+          - features
+  train:
+    cmd: python src/training/train.py --config configs/base.yaml
+    deps:
+      - src/training/train.py
+      - src/models/
+      - data/features/
+    outs:
+      - models/
+    metrics:
+      - metrics.json:
+          cache: false
+    plots:
+      - plots/training_curve.csv:
+          x: epoch
+          y: loss
+```
+```bash
+# Reproduce the full pipeline
+dvc repro
+# Compare experiments
+dvc metrics diff
+# Push data to remote storage
+dvc push
+```
+## Configuration Management with Hydra
+```python
+import hydra
+from omegaconf import DictConfig, OmegaConf
+@hydra.main(config_path="configs", config_name="base", version_base=None)
+def main(cfg: DictConfig):
+    print(OmegaConf.to_yaml(cfg))
+    model = build_model(
+        name=cfg.model.name,
+        hidden_dim=cfg.model.hidden_dim,
+        num_layers=cfg.model.num_layers,
+    )
+    train(
+        model=model,
+        lr=cfg.training.lr,
+        epochs=cfg.training.epochs,
+        batch_size=cfg.training.batch_size,
+    )
+# Override from command line:
+# python train.py training.lr=1e-4 model.hidden_dim=512
+# python train.py --multirun training.lr=1e-3,1e-4,1e-5
+```
+```yaml
+# configs/base.yaml
+model:
+  name: resnet50
+  hidden_dim: 256
+  num_layers: 4
+training:
+  lr: 1e-3
+  epochs: 100
+  batch_size: 32
+  optimizer: adamw
+  weight_decay: 0.01
+data:
+  dataset: cifar10
+  split_ratio: [0.8, 0.1, 0.1]
+  random_seed: 42
+  augmentation: true
+```
+## Feature Engineering Patterns
+```python
+from sklearn.pipeline import Pipeline
+from sklearn.preprocessing import StandardScaler, OneHotEncoder
+from sklearn.compose import ColumnTransformer
+from sklearn.impute import SimpleImputer
+import joblib
+def build_feature_pipeline(numeric_cols: list, categorical_cols: list) -> Pipeline:
+    """Build a reproducible feature engineering pipeline."""
+    numeric_transformer = Pipeline([
+        ("imputer", SimpleImputer(strategy="median")),
+        ("scaler", StandardScaler()),
+    ])
+    categorical_transformer = Pipeline([
+        ("imputer", SimpleImputer(strategy="most_frequent")),
+        ("encoder", OneHotEncoder(handle_unknown="ignore", sparse_output=False)),
+    ])
+    preprocessor = ColumnTransformer([
+        ("num", numeric_transformer, numeric_cols),
+        ("cat", categorical_transformer, categorical_cols),
+    ])
+    return preprocessor
+# Save and load for reproducibility
+preprocessor.fit(X_train)
+joblib.dump(preprocessor, "artifacts/preprocessor.pkl")
+# Later: preprocessor = joblib.load("artifacts/preprocessor.pkl")
+```
+## Makefile for Reproducibility
+```makefile
+.PHONY: setup data train evaluate all clean
+setup:
+	pip install -r requirements.txt
+	dvc pull
+data:
+	python src/data/prepare.py --config configs/base.yaml
+train:
+	python src/training/train.py --config configs/base.yaml
+evaluate:
+	python src/evaluation/evaluate.py --config configs/base.yaml
+all: setup data train evaluate
+sweep:
+	python src/training/train.py --multirun \
+		training.lr=1e-3,1e-4,1e-5 \
+		model.hidden_dim=128,256,512
+clean:
+	rm -rf outputs/ multirun/ __pycache__/
+```
+## Best Practices
+- **Never modify raw data.** All transformations should be scripted and reproducible.
+- **Pin every dependency version** including CUDA, cuDNN, and OS-level libraries.
+- **Separate configuration from code.** Use YAML/JSON configs, not hardcoded values.
+- **Track experiments from day one.** Retrofitting experiment tracking is painful.
+- **Write tests for data preprocessing.** Shape mismatches and silent data corruption are common.
+- **Use `Makefile` or `dvc repro`** so any collaborator can reproduce results with one command.
+- **Version your data alongside your code** using DVC, Git-LFS, or cloud storage with manifests.
+## References
+- [MLflow documentation](https://mlflow.org/docs/latest/) -- Experiment tracking and model registry
+- [DVC documentation](https://dvc.org/doc) -- Data version control for ML
+- [Hydra documentation](https://hydra.cc/) -- Configuration management framework
+- [Cookiecutter Data Science](https://drivendata.github.io/cookiecutter-data-science/) -- Project structure template
+- [Made With ML](https://madewithml.com/) -- MLOps best practices for researchers