npm - @wentorai/research-plugins - Versions diffs - 1.0.0 → 1.2.0 - Mend

@wentorai/research-plugins 1.0.0 → 1.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (415) hide show

package/skills/tools/code-exec/google-colab-guide/SKILL.md ADDED Viewed

@@ -0,0 +1,276 @@
+---
+name: google-colab-guide
+description: "Run and manage Google Colab notebooks for Python and ML research"
+metadata:
+  openclaw:
+    emoji: "gpu"
+    category: "tools"
+    subcategory: "code-exec"
+    keywords: ["Google Colab", "Jupyter", "GPU", "machine learning", "Python", "cloud computing"]
+    source: "https://colab.research.google.com"
+---
+# Google Colab Guide
+Run Python code, train machine learning models, and perform data analysis using Google Colab's free cloud-hosted Jupyter notebooks with GPU and TPU access. This skill covers setup, resource management, persistent storage, and best practices for reproducible research computing.
+## Overview
+Google Colab (Colaboratory) provides free access to GPU-accelerated Jupyter notebooks running on Google's cloud infrastructure. For academic researchers, Colab eliminates the barrier of expensive hardware for machine learning experiments, large-scale data processing, and computationally intensive statistical analyses. The free tier includes NVIDIA T4 GPUs, and paid tiers (Colab Pro, Pro+) offer A100 GPUs and extended runtime.
+Colab notebooks run in ephemeral virtual machines that are recycled after inactivity or maximum runtime. This creates unique challenges for research: managing persistent data, saving checkpoints, reproducing results, and working with large datasets. This skill addresses these challenges with proven patterns used by ML researchers worldwide.
+Colab integrates natively with Google Drive for storage, GitHub for version control, and supports the full Python scientific computing ecosystem (NumPy, pandas, scikit-learn, PyTorch, TensorFlow, JAX). Each notebook runs in an isolated environment with root access, allowing installation of any Linux package or Python library.
+## Getting Started
+### Runtime Configuration
+```python
+# Check current runtime type
+import subprocess
+result = subprocess.run(['nvidia-smi'], capture_output=True, text=True)
+print(result.stdout)  # Shows GPU info if GPU runtime is selected
+# Check available resources
+import psutil
+print(f"RAM: {psutil.virtual_memory().total / 1e9:.1f} GB")
+print(f"CPU cores: {psutil.cpu_count()}")
+print(f"Disk: {psutil.disk_usage('/').total / 1e9:.1f} GB")
+```
+### Runtime Selection Guide
+| Runtime | GPU | RAM | Use Case |
+|---------|-----|-----|----------|
+| CPU | None | ~12 GB | Data cleaning, text processing, small models |
+| T4 GPU (free) | 16 GB VRAM | ~12 GB | Training medium models, inference |
+| A100 GPU (Pro) | 40 GB VRAM | ~50 GB | Large model training, LLM fine-tuning |
+| TPU v2 (free) | 8 cores | ~12 GB | JAX/TensorFlow distributed training |
+### Google Drive Mount
+```python
+from google.colab import drive
+drive.mount('/content/drive')
+# Access files in Drive
+import pandas as pd
+df = pd.read_csv('/content/drive/MyDrive/research/dataset.csv')
+```
+## Data Management
+### Downloading Datasets
+```python
+# From URL
+!wget -q https://example.com/dataset.zip -O /content/dataset.zip
+!unzip -q /content/dataset.zip -d /content/data/
+# From Kaggle
+!pip install -q kaggle
+!mkdir -p ~/.kaggle
+# Upload kaggle.json API key first
+!kaggle datasets download -d user/dataset-name -p /content/data/
+# From Hugging Face
+!pip install -q datasets
+from datasets import load_dataset
+dataset = load_dataset("scientific_papers", "arxiv")
+```
+### Persistent Storage Patterns
+Since Colab VMs are ephemeral, always save important outputs to Google Drive:
+```python
+import shutil
+from pathlib import Path
+DRIVE_BASE = Path("/content/drive/MyDrive/research/experiment_001")
+DRIVE_BASE.mkdir(parents=True, exist_ok=True)
+def save_checkpoint(model, optimizer, epoch, loss):
+    """Save training checkpoint to Google Drive."""
+    checkpoint = {
+        'epoch': epoch,
+        'model_state_dict': model.state_dict(),
+        'optimizer_state_dict': optimizer.state_dict(),
+        'loss': loss
+    }
+    path = DRIVE_BASE / f"checkpoint_epoch_{epoch}.pt"
+    torch.save(checkpoint, path)
+    print(f"Checkpoint saved to {path}")
+def save_results(df, name):
+    """Save results DataFrame to Drive."""
+    path = DRIVE_BASE / f"{name}.csv"
+    df.to_csv(path, index=False)
+    print(f"Results saved to {path}")
+```
+## Machine Learning Workflows
+### PyTorch Training Loop
+```python
+!pip install -q torch torchvision
+import torch
+import torch.nn as nn
+from torch.utils.data import DataLoader
+# Automatic device selection
+device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+print(f"Using device: {device}")
+model = MyModel().to(device)
+optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
+criterion = nn.CrossEntropyLoss()
+for epoch in range(num_epochs):
+    model.train()
+    total_loss = 0
+    for batch in train_loader:
+        inputs, labels = batch[0].to(device), batch[1].to(device)
+        optimizer.zero_grad()
+        outputs = model(inputs)
+        loss = criterion(outputs, labels)
+        loss.backward()
+        optimizer.step()
+        total_loss += loss.item()
+    avg_loss = total_loss / len(train_loader)
+    print(f"Epoch {epoch+1}/{num_epochs}, Loss: {avg_loss:.4f}")
+    # Save checkpoint every 5 epochs
+    if (epoch + 1) % 5 == 0:
+        save_checkpoint(model, optimizer, epoch + 1, avg_loss)
+```
+### Hugging Face Transformers
+```python
+!pip install -q transformers accelerate
+from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer
+model_name = "allenai/scibert_scivocab_uncased"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForSequenceClassification.from_pretrained(
+    model_name, num_labels=5
+)
+trainer = Trainer(
+    model=model,
+    args=training_args,
+    train_dataset=train_dataset,
+    eval_dataset=eval_dataset,
+    tokenizer=tokenizer
+)
+trainer.train()
+# Save to Drive
+model.save_pretrained(str(DRIVE_BASE / "fine_tuned_scibert"))
+```
+## Environment Management
+### Installing Packages
+```python
+# Install specific versions for reproducibility
+!pip install -q transformers==4.40.0 datasets==2.18.0 evaluate==0.4.1
+# Install from GitHub
+!pip install -q git+https://github.com/huggingface/peft.git
+# Install system packages
+!apt-get -qq install -y graphviz texlive-latex-base
+```
+### Reproducibility Setup
+```python
+import random
+import numpy as np
+import torch
+def set_seed(seed=42):
+    """Set all random seeds for reproducibility."""
+    random.seed(seed)
+    np.random.seed(seed)
+    torch.manual_seed(seed)
+    torch.cuda.manual_seed_all(seed)
+    torch.backends.cudnn.deterministic = True
+    torch.backends.cudnn.benchmark = False
+set_seed(42)
+```
+### Requirements File
+```python
+# Generate requirements for reproducibility
+!pip freeze > /content/drive/MyDrive/research/requirements.txt
+# Restore environment in new session
+!pip install -q -r /content/drive/MyDrive/research/requirements.txt
+```
+## Performance Optimization
+### Memory Management
+```python
+# Monitor GPU memory
+!nvidia-smi
+# Clear GPU cache
+torch.cuda.empty_cache()
+# Use mixed precision training for 2x speedup
+from torch.cuda.amp import autocast, GradScaler
+scaler = GradScaler()
+for batch in train_loader:
+    optimizer.zero_grad()
+    with autocast():
+        outputs = model(inputs)
+        loss = criterion(outputs, labels)
+    scaler.scale(loss).backward()
+    scaler.step(optimizer)
+    scaler.update()
+```
+### Preventing Disconnection
+Colab disconnects after 90 minutes of inactivity (free tier). Strategies:
+1. Use `tqdm` progress bars to show activity
+2. Save checkpoints frequently to Google Drive
+3. Structure experiments to complete within single sessions
+4. Use Colab Pro for longer runtimes (24 hours)
+## GitHub Integration
+```python
+# Clone a research repository
+!git clone https://github.com/user/research-repo.git /content/repo
+# Push results back
+%cd /content/repo
+!git config user.email "researcher@university.edu"
+!git config user.name "Researcher"
+!git add results/
+!git commit -m "Add experiment results from Colab"
+!git push
+```
+## References
+- Google Colab documentation: https://colab.research.google.com/notebooks/intro.ipynb
+- Colab resource limits FAQ: https://research.google.com/colaboratory/faq.html
+- PyTorch on Colab: https://pytorch.org/tutorials/beginner/colab
+- Hugging Face + Colab: https://huggingface.co/docs/transformers/notebooks

package/skills/tools/code-exec/kaggle-api-guide/SKILL.md ADDED Viewed

@@ -0,0 +1,216 @@
+---
+name: kaggle-api-guide
+description: "Download datasets, manage competitions and notebooks via Kaggle API"
+metadata:
+  openclaw:
+    emoji: "📈"
+    category: "tools"
+    subcategory: "code-exec"
+    keywords: ["kaggle", "datasets", "competitions", "notebooks", "data-science", "machine-learning"]
+    source: "https://www.kaggle.com/docs/api"
+---
+# Kaggle API Guide
+## Overview
+Kaggle is the world's largest data science and machine learning community, hosting thousands of datasets, competitions, and computational notebooks. The Kaggle API provides programmatic access to these resources, enabling researchers to download datasets, submit competition entries, manage kernels (notebooks), and explore the Kaggle ecosystem from the command line or scripts.
+For academic researchers, Kaggle is a valuable resource for accessing curated, well-documented datasets across diverse domains including healthcare, natural language processing, computer vision, economics, and social sciences. Many published research papers use Kaggle datasets as benchmarks, and the platform's competition infrastructure provides standardized evaluation frameworks for comparing methods.
+The Kaggle API is available as a Python CLI tool and library. It requires a free Kaggle account and API token for authentication. The API supports dataset search and download, competition data retrieval, kernel management, and model access.
+## Authentication
+A free Kaggle API token is required. Generate one from your Kaggle account settings at https://www.kaggle.com/settings.
+Download the `kaggle.json` credentials file and place it in the standard location:
+```bash
+# The kaggle.json file should be at ~/.kaggle/kaggle.json
+# It contains your username and key from your Kaggle account settings
+mkdir -p ~/.kaggle
+# Move your downloaded kaggle.json to ~/.kaggle/kaggle.json
+chmod 600 ~/.kaggle/kaggle.json
+```
+Alternatively, use environment variables:
+```bash
+export KAGGLE_USERNAME=$KAGGLE_USERNAME
+export KAGGLE_KEY=$KAGGLE_KEY
+```
+Install the CLI tool:
+```bash
+pip install kaggle
+```
+## Core Endpoints
+### Search Datasets
+Find datasets by keyword, file type, or license.
+```bash
+# Search for datasets
+kaggle datasets list -s "climate change" --sort-by votes
+# Search with specific criteria
+kaggle datasets list -s "medical imaging" --file-type csv --max-size 1000000
+```
+### Download a Dataset
+```bash
+# Download and unzip a dataset
+kaggle datasets download -d "heptapod/titanic" --unzip -p ./data/titanic/
+# Download a specific file from a dataset
+kaggle datasets download -d "yelp-dataset/yelp-dataset" -f "yelp_academic_dataset_review.json" -p ./data/
+```
+### List and Join Competitions
+```bash
+# List active competitions
+kaggle competitions list
+# Download competition data (must accept rules on kaggle.com first)
+kaggle competitions download -c "house-prices-advanced-regression-techniques" -p ./data/house-prices/
+```
+### Submit to a Competition
+```bash
+# Submit predictions
+kaggle competitions submit -c "house-prices-advanced-regression-techniques" \
+  -f ./submission.csv -m "Random forest baseline v1"
+# Check submission status
+kaggle competitions submissions -c "house-prices-advanced-regression-techniques"
+```
+### Manage Notebooks (Kernels)
+```bash
+# Search for notebooks
+kaggle kernels list -s "transformer nlp" --sort-by voteCount
+# Pull a notebook to local
+kaggle kernels pull "username/notebook-name" -p ./notebooks/
+# Push a notebook to Kaggle
+kaggle kernels push -p ./my-notebook/
+```
+### Python Example: Automated Dataset Discovery and Download
+```python
+import subprocess
+import json
+import os
+def search_kaggle_datasets(query, sort_by="votes", max_results=10):
+    """Search Kaggle datasets and return structured results."""
+    cmd = [
+        "kaggle", "datasets", "list",
+        "-s", query,
+        "--sort-by", sort_by,
+        "--max-size", "50000000",
+        "--csv"
+    ]
+    result = subprocess.run(cmd, capture_output=True, text=True)
+    lines = result.stdout.strip().split("\n")
+    if len(lines) < 2:
+        return []
+    headers = lines[0].split(",")
+    datasets = []
+    for line in lines[1:max_results + 1]:
+        values = line.split(",")
+        dataset = dict(zip(headers, values))
+        datasets.append(dataset)
+    return datasets
+def download_dataset(dataset_ref, output_dir="./data"):
+    """Download a Kaggle dataset by reference."""
+    os.makedirs(output_dir, exist_ok=True)
+    cmd = [
+        "kaggle", "datasets", "download",
+        "-d", dataset_ref,
+        "--unzip",
+        "-p", output_dir
+    ]
+    result = subprocess.run(cmd, capture_output=True, text=True)
+    if result.returncode == 0:
+        print(f"Downloaded {dataset_ref} to {output_dir}")
+    else:
+        print(f"Error: {result.stderr}")
+# Search for NLP benchmark datasets
+datasets = search_kaggle_datasets("nlp text classification benchmark")
+for ds in datasets[:5]:
+    print(f"  {ds.get('ref', 'N/A')}")
+    print(f"    Size: {ds.get('totalBytes', 'N/A')} bytes")
+    print(f"    Votes: {ds.get('voteCount', 'N/A')}")
+    print()
+```
+### Python Example: Using the Kaggle Python API Directly
+```python
+from kaggle.api.kaggle_api_extended import KaggleApi
+api = KaggleApi()
+api.authenticate()
+# Search datasets
+datasets = api.dataset_list(search="genomics", sort_by="updated")
+for ds in datasets[:5]:
+    print(f"{ds.ref}: {ds.title} ({ds.size})")
+# Get dataset metadata
+metadata = api.dataset_view("nih-chest-xrays/data")
+print(f"Title: {metadata.title}")
+print(f"Size: {metadata.totalBytes}")
+print(f"Description: {metadata.description[:200]}")
+# Download dataset files
+api.dataset_download_files(
+    "nih-chest-xrays/sample",
+    path="./data/chest-xrays/",
+    unzip=True
+)
+```
+## Common Research Patterns
+**Benchmark Dataset Access:** Download well-established datasets used in published research for reproducibility studies. Kaggle hosts canonical versions of many benchmark datasets referenced in ML papers.
+**Competition as Evaluation Framework:** Use Kaggle competitions as standardized evaluation environments with leaderboards and held-out test sets. Submit predictions from novel methods to compare against state-of-the-art approaches.
+**Data Exploration Notebooks:** Search for and pull community notebooks that explore datasets relevant to your research. These often contain valuable preprocessing code, exploratory analysis, and baseline models.
+**Collaborative Research Datasets:** Upload processed research datasets to Kaggle for sharing with collaborators and the broader community, enabling others to reproduce and extend your work.
+**Cross-Domain Transfer:** Search across Kaggle's diverse dataset collection to find datasets from adjacent domains that could be useful for transfer learning or cross-domain validation studies.
+## Rate Limits and Best Practices
+- **API rate limits:** Kaggle imposes daily limits on API calls; typical free accounts allow several hundred requests per day
+- **Download limits:** Large datasets may take significant time and disk space; check sizes before downloading
+- **Competition rules:** Always accept competition rules on the Kaggle website before attempting to download competition data via API
+- **Kernel push format:** When pushing notebooks, include a `kernel-metadata.json` file specifying the kernel type, language, and datasets
+- **Authentication security:** Never commit `kaggle.json` to version control; use environment variables in CI/CD pipelines
+- **Dataset versioning:** Kaggle datasets support versions; specify version numbers for reproducibility in research
+- **Large files:** For datasets over 10GB, consider using the Kaggle CLI rather than the Python API for more reliable downloads
+## References
+- Kaggle API Documentation: https://www.kaggle.com/docs/api
+- Kaggle API GitHub Repository: https://github.com/Kaggle/kaggle-api
+- Kaggle Datasets: https://www.kaggle.com/datasets
+- Kaggle Competitions: https://www.kaggle.com/competitions
+- Kaggle Notebooks: https://www.kaggle.com/code