npm - @wentorai/research-plugins - Versions diffs - 1.0.0 → 1.2.0 - Mend

@wentorai/research-plugins 1.0.0 → 1.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (415) hide show

package/skills/analysis/statistics/infiagent-benchmark-guide/SKILL.md ADDED Viewed

@@ -0,0 +1,106 @@
+---
+name: infiagent-benchmark-guide
+description: "Agent benchmark for data analysis evaluation (ICML 2024)"
+metadata:
+  openclaw:
+    emoji: "🏆"
+    category: "analysis"
+    subcategory: "statistics"
+    keywords: ["InfiAgent", "benchmark", "data analysis", "agent evaluation", "ICML", "DABench"]
+    source: "https://github.com/InfiAgent/InfiAgent"
+---
+# InfiAgent Data Analysis Benchmark Guide
+## Overview
+InfiAgent (ICML 2024) is a benchmark for evaluating AI agents on data analysis tasks. It provides DABench — a standardized set of data analysis problems ranging from basic EDA to complex statistical modeling, each with ground-truth solutions and automated evaluation metrics. Measures agent capabilities in code generation, statistical reasoning, and visualization.
+## Benchmark Structure
+```
+DABench (Data Analysis Benchmark)
+├── Task Categories
+│   ├── Data Understanding (profiling, cleaning)
+│   ├── Exploratory Analysis (distributions, correlations)
+│   ├── Statistical Testing (hypothesis tests)
+│   ├── Visualization (appropriate chart selection)
+│   ├── Modeling (regression, classification)
+│   └── Interpretation (insights, conclusions)
+├── Difficulty Levels
+│   ├── Easy (single-step operations)
+│   ├── Medium (multi-step analysis)
+│   └── Hard (complex reasoning + code)
+└── Evaluation Metrics
+    ├── Code executability
+    ├── Answer correctness
+    ├── Visualization quality
+    └── Statistical validity
+```
+## Usage
+```python
+from infiagent import DABench
+bench = DABench()
+# List tasks
+for task in bench.tasks[:5]:
+    print(f"[{task.difficulty}] {task.id}: {task.description}")
+    print(f"  Dataset: {task.dataset}")
+    print(f"  Category: {task.category}")
+# Evaluate an agent
+from infiagent import evaluate
+results = evaluate(
+    agent_fn=my_data_agent,
+    tasks="all",
+    timeout=120,
+)
+print(f"Executability: {results.exec_rate:.1%}")
+print(f"Correctness: {results.correct_rate:.1%}")
+print(f"Statistical validity: {results.stats_valid:.1%}")
+```
+## Task Examples
+```python
+# Easy: "What is the mean and standard deviation of column X?"
+# Medium: "Is there a significant correlation between A and B?
+#          Control for confounders C and D."
+# Hard: "Build a predictive model for Y using all available
+#        features. Report cross-validated performance and
+#        identify the 3 most important features."
+```
+## Leaderboard Results
+```python
+# Selected results from DABench
+scores = {
+    "GPT-4 + Code": {"exec": 95, "correct": 67},
+    "Claude 3.5 Sonnet": {"exec": 93, "correct": 64},
+    "GPT-3.5 + Code": {"exec": 88, "correct": 45},
+    "CodeLlama-34B": {"exec": 72, "correct": 31},
+}
+print(f"{'Agent':<22} {'Exec%':>6} {'Correct%':>9}")
+for agent, s in scores.items():
+    print(f"{agent:<22} {s['exec']:>5}% {s['correct']:>8}%")
+```
+## Use Cases
+1. **Agent evaluation**: Standard benchmark for data analysis agents
+2. **Model comparison**: Compare LLMs on analytical tasks
+3. **Capability testing**: Assess statistical reasoning abilities
+4. **Research**: Study agent strengths and failure modes
+5. **Development**: Target specific weak areas for improvement
+## References
+- [InfiAgent GitHub](https://github.com/InfiAgent/InfiAgent)
+- [DABench Paper (ICML 2024)](https://arxiv.org/abs/2401.05507)

package/skills/analysis/statistics/ml-experiment-tracker/SKILL.md ADDED Viewed

@@ -0,0 +1,212 @@
+---
+name: ml-experiment-tracker
+description: "Plan reproducible ML experiment runs with parameters and metrics tracking"
+metadata:
+  openclaw:
+    emoji: "🧪"
+    category: "analysis"
+    subcategory: "statistics"
+    keywords: ["experiment tracking", "machine learning", "reproducibility", "hyperparameters", "MLflow", "model evaluation"]
+    source: "https://github.com/AcademicSkills/ml-experiment-tracker"
+---
+# ML Experiment Tracker
+A skill for planning, executing, and tracking machine learning experiments with full reproducibility. Covers experiment design, hyperparameter management, metric logging, model versioning, and comparison across runs to support rigorous ML research.
+## Overview
+Machine learning research involves running dozens or hundreds of experiments with varying architectures, hyperparameters, data splits, and preprocessing pipelines. Without systematic tracking, it becomes impossible to reproduce results, compare configurations, or identify which changes actually improved performance. This skill provides a structured methodology for experiment management that aligns with academic standards for reproducible ML research.
+The approach is framework-agnostic but demonstrates integration with MLflow, Weights & Biases, and plain file-based logging. It emphasizes the practices needed for publications: complete hyperparameter documentation, statistical significance testing across runs, and artifact management for model checkpoints and evaluation outputs.
+## Experiment Design Framework
+### Defining an Experiment Plan
+Before writing any training code, document the experiment plan:
+```yaml
+# experiment_plan.yaml
+experiment:
+  name: "transformer-sentiment-analysis-v3"
+  hypothesis: "Adding relative positional encoding improves F1 on long reviews (>512 tokens)"
+  dataset:
+    name: "imdb-extended"
+    version: "2025.1"
+    splits: {train: 0.8, val: 0.1, test: 0.1}
+    stratify_by: "label"
+    random_seed: 42
+  baselines:
+    - name: "bert-base-uncased"
+      checkpoint: "bert-base-uncased"
+    - name: "roberta-base"
+      checkpoint: "roberta-base"
+  variables:
+    independent:
+      - positional_encoding: ["absolute", "relative", "rotary"]
+    controlled:
+      - learning_rate: 2e-5
+      - batch_size: 32
+      - max_epochs: 10
+      - early_stopping_patience: 3
+      - optimizer: "AdamW"
+      - weight_decay: 0.01
+  metrics:
+    primary: "f1_macro"
+    secondary: ["accuracy", "precision_macro", "recall_macro", "loss"]
+    report_at: ["best_val", "final"]
+  compute:
+    gpus: 1
+    estimated_time_per_run: "45min"
+    total_runs: 9  # 3 encodings x 3 seeds
+  seeds: [42, 123, 456]
+```
+### Factorial Design for Hyperparameter Studies
+```python
+from itertools import product
+def generate_experiment_grid(config: dict) -> list:
+    """
+    Generate all experiment configurations from a factorial design.
+    """
+    param_names = list(config.keys())
+    param_values = list(config.values())
+    runs = []
+    for combo in product(*param_values):
+        run_config = dict(zip(param_names, combo))
+        run_config['run_id'] = '_'.join(f"{k}={v}" for k, v in run_config.items())
+        runs.append(run_config)
+    return runs
+# Example: 3 learning rates x 2 batch sizes x 3 seeds = 18 runs
+grid = generate_experiment_grid({
+    'learning_rate': [1e-5, 2e-5, 5e-5],
+    'batch_size': [16, 32],
+    'seed': [42, 123, 456]
+})
+```
+## Experiment Logging with MLflow
+### Setup and Run Tracking
+```python
+import mlflow
+import json
+from datetime import datetime
+def start_tracked_experiment(experiment_name: str, run_config: dict):
+    """
+    Initialize an MLflow experiment run with full configuration logging.
+    """
+    mlflow.set_experiment(experiment_name)
+    with mlflow.start_run(run_name=run_config.get('run_id', None)) as run:
+        # Log all hyperparameters
+        mlflow.log_params(run_config)
+        # Log environment info for reproducibility
+        mlflow.log_param("python_version", "3.11.5")
+        mlflow.log_param("torch_version", "2.1.0")
+        mlflow.log_param("timestamp", datetime.now().isoformat())
+        # Log the full config as an artifact
+        with open("/tmp/run_config.json", "w") as f:
+            json.dump(run_config, f, indent=2)
+        mlflow.log_artifact("/tmp/run_config.json")
+        return run.info.run_id
+def log_epoch_metrics(epoch: int, metrics: dict):
+    """Log metrics for a training epoch."""
+    for name, value in metrics.items():
+        mlflow.log_metric(name, value, step=epoch)
+def log_final_results(metrics: dict, model_path: str = None):
+    """Log final evaluation metrics and optionally the model artifact."""
+    for name, value in metrics.items():
+        mlflow.log_metric(f"final_{name}", value)
+    if model_path:
+        mlflow.log_artifact(model_path)
+```
+## Results Comparison and Statistical Testing
+### Comparing Runs Across Seeds
+```python
+from scipy import stats
+import numpy as np
+def compare_experiment_results(results: dict) -> dict:
+    """
+    Compare experiment configurations using statistical tests.
+    Args:
+        results: Dict mapping config_name -> list of metric values across seeds
+        e.g., {'relative_pe': [0.87, 0.86, 0.88], 'absolute_pe': [0.84, 0.83, 0.85]}
+    """
+    config_names = list(results.keys())
+    comparisons = {}
+    for i in range(len(config_names)):
+        for j in range(i + 1, len(config_names)):
+            name_a, name_b = config_names[i], config_names[j]
+            values_a, values_b = results[name_a], results[name_b]
+            # Paired t-test (same seeds)
+            t_stat, p_value = stats.ttest_rel(values_a, values_b)
+            # Effect size (Cohen's d)
+            diff = np.array(values_a) - np.array(values_b)
+            cohens_d = np.mean(diff) / np.std(diff, ddof=1)
+            comparisons[f"{name_a}_vs_{name_b}"] = {
+                'mean_a': np.mean(values_a),
+                'mean_b': np.mean(values_b),
+                'mean_diff': np.mean(diff),
+                't_statistic': round(t_stat, 4),
+                'p_value': round(p_value, 4),
+                'significant': p_value < 0.05,
+                'cohens_d': round(cohens_d, 3)
+            }
+    return comparisons
+```
+### Results Summary Table
+| Configuration | F1 (mean +/- std) | Accuracy | p-value vs. baseline |
+|--------------|-------------------|----------|---------------------|
+| Baseline (absolute PE) | 0.840 +/- 0.010 | 0.852 | -- |
+| Relative PE | 0.870 +/- 0.008 | 0.881 | 0.003 |
+| Rotary PE | 0.865 +/- 0.012 | 0.876 | 0.011 |
+## Reproducibility Checklist
+Before submitting ML results for publication, verify:
+- [ ] Random seeds are fixed and reported for all stochastic operations
+- [ ] Dataset version and exact split indices are saved
+- [ ] All hyperparameters are logged (not just the "important" ones)
+- [ ] Software versions (framework, CUDA, key libraries) are documented
+- [ ] Results are averaged over at least 3 random seeds with standard deviations
+- [ ] Statistical significance tests are performed for key comparisons
+- [ ] Model checkpoints or training scripts are archived
+- [ ] Data preprocessing pipeline is fully specified and deterministic
+## References
+- Bouthillier, X., et al. (2021). Accounting for Variance in Machine Learning Benchmarks. *MLSys 2021*.
+- Zaharia, M., et al. (2018). Accelerating the Machine Learning Lifecycle with MLflow. *IEEE Data Eng. Bull.*
+- Dodge, J., et al. (2019). Show Your Work: Improved Reporting of Experimental Results. *EMNLP 2019*.

package/skills/analysis/statistics/pywayne-statistics-guide/SKILL.md ADDED Viewed

@@ -0,0 +1,192 @@
+---
+name: pywayne-statistics-guide
+description: "37+ statistical testing methods for rigorous hypothesis testing"
+metadata:
+  openclaw:
+    emoji: "📐"
+    category: "analysis"
+    subcategory: "statistics"
+    keywords: ["hypothesis testing", "statistical tests", "p-value", "parametric tests", "nonparametric tests", "effect size", "multiple comparisons"]
+    source: "https://github.com/AcademicSkills/pywayne-statistics-guide"
+---
+# PyWayne Statistics Guide
+A comprehensive reference for 37+ statistical testing methods covering parametric, nonparametric, and resampling-based hypothesis tests. Provides decision trees for test selection, implementation in Python (scipy, statsmodels, pingouin), effect size calculation, and proper reporting standards for academic publications.
+## Overview
+Hypothesis testing remains the backbone of quantitative research across the sciences, social sciences, and engineering. However, selecting the appropriate test for a given research question, data structure, and assumption profile is a persistent challenge, especially for researchers outside statistics. This skill provides a structured decision framework that maps research questions to the correct statistical test, verifies assumptions, computes test statistics and effect sizes, and formats results for publication.
+All 37+ tests are organized by the type of comparison (one-sample, two-sample, k-sample, association, agreement) and whether parametric assumptions are met. Each test entry includes when to use it, assumptions to verify, the Python implementation, and the correct APA-style reporting format.
+## Test Selection Decision Tree
+### Step 1: Identify the Research Question Type
+| Question Type | Examples |
+|--------------|----------|
+| **One-sample** | Is this sample mean different from a known value? |
+| **Two-sample (independent)** | Do treatment and control groups differ? |
+| **Two-sample (paired)** | Do pre-test and post-test scores differ? |
+| **K-sample (independent)** | Do 3+ groups differ on an outcome? |
+| **K-sample (repeated)** | Do measurements differ across 3+ time points? |
+| **Association** | Is variable X related to variable Y? |
+| **Agreement** | Do two raters/methods agree? |
+### Step 2: Check Data Type and Assumptions
+```
+Is the outcome variable continuous?
+├── Yes → Are the data normally distributed?
+│   ├── Yes → Are variances equal (for group comparisons)?
+│   │   ├── Yes → Use PARAMETRIC test
+│   │   └── No → Use Welch's correction or nonparametric
+│   └── No → Use NONPARAMETRIC test
+└── No → Is it ordinal or nominal?
+    ├── Ordinal → Use rank-based NONPARAMETRIC test
+    └── Nominal → Use CHI-SQUARE or exact test
+```
+## Parametric Tests
+### Two-Sample Tests
+```python
+from scipy import stats
+import pingouin as pg
+import numpy as np
+def two_sample_comparison(group_a, group_b, paired=False):
+    """
+    Perform the appropriate two-sample test with assumption checks.
+    """
+    results = {}
+    # Assumption: Normality
+    _, p_norm_a = stats.shapiro(group_a)
+    _, p_norm_b = stats.shapiro(group_b)
+    normal = p_norm_a > 0.05 and p_norm_b > 0.05
+    if paired:
+        if normal:
+            # Paired t-test
+            t, p = stats.ttest_rel(group_a, group_b)
+            d = pg.compute_effsize(group_a, group_b, paired=True, eftype='cohen')
+            results = {'test': 'paired t-test', 't': t, 'p': p, 'cohens_d': d}
+        else:
+            # Wilcoxon signed-rank
+            w, p = stats.wilcoxon(group_a, group_b)
+            r = w / (len(group_a) * (len(group_a) + 1) / 2)
+            results = {'test': 'Wilcoxon signed-rank', 'W': w, 'p': p, 'rank_biserial': r}
+    else:
+        if normal:
+            # Check equal variances
+            _, p_levene = stats.levene(group_a, group_b)
+            if p_levene > 0.05:
+                t, p = stats.ttest_ind(group_a, group_b)
+                results = {'test': 'independent t-test', 't': t, 'p': p}
+            else:
+                t, p = stats.ttest_ind(group_a, group_b, equal_var=False)
+                results = {'test': "Welch's t-test", 't': t, 'p': p}
+            d = pg.compute_effsize(group_a, group_b, eftype='cohen')
+            results['cohens_d'] = d
+        else:
+            # Mann-Whitney U
+            u, p = stats.mannwhitneyu(group_a, group_b, alternative='two-sided')
+            results = {'test': 'Mann-Whitney U', 'U': u, 'p': p}
+    return results
+```
+### K-Sample Tests (ANOVA Family)
+| Test | Use Case | Assumptions |
+|------|----------|-------------|
+| One-way ANOVA | 3+ independent groups, continuous outcome | Normality, homoscedasticity |
+| Welch's ANOVA | 3+ groups, unequal variances | Normality |
+| Repeated measures ANOVA | 3+ related measurements | Normality, sphericity |
+| Two-way ANOVA | Two factors, continuous outcome | Normality, homoscedasticity |
+| ANCOVA | Group comparison controlling for covariate | Normality, homogeneity of slopes |
+| MANOVA | Multiple dependent variables | Multivariate normality |
+```python
+def k_sample_test(groups: list, method: str = 'auto'):
+    """Run the appropriate k-sample comparison."""
+    # Check normality for all groups
+    all_normal = all(stats.shapiro(g)[1] > 0.05 for g in groups)
+    if all_normal:
+        # Check homogeneity of variance
+        _, p_levene = stats.levene(*groups)
+        if p_levene > 0.05:
+            f, p = stats.f_oneway(*groups)
+            return {'test': 'one-way ANOVA', 'F': f, 'p': p}
+        else:
+            # Welch's ANOVA via pingouin
+            return {'test': "Welch's ANOVA", 'note': 'Use pg.welch_anova()'}
+    else:
+        h, p = stats.kruskal(*groups)
+        return {'test': 'Kruskal-Wallis H', 'H': h, 'p': p}
+```
+## Nonparametric Tests Reference
+| Parametric Test | Nonparametric Alternative | When to Use |
+|----------------|--------------------------|-------------|
+| One-sample t-test | Wilcoxon signed-rank | Non-normal single sample |
+| Independent t-test | Mann-Whitney U | Non-normal, 2 independent groups |
+| Paired t-test | Wilcoxon signed-rank | Non-normal, paired data |
+| One-way ANOVA | Kruskal-Wallis H | Non-normal, 3+ groups |
+| Repeated measures ANOVA | Friedman test | Non-normal, 3+ related measures |
+| Pearson correlation | Spearman rho / Kendall tau | Non-linear or ordinal association |
+## Multiple Comparisons Correction
+When performing multiple hypothesis tests, control the family-wise error rate:
+```python
+from statsmodels.stats.multitest import multipletests
+def correct_multiple_tests(p_values: list, method: str = 'fdr_bh') -> dict:
+    """
+    Apply multiple comparisons correction.
+    Methods:
+        'bonferroni': Conservative, controls FWER
+        'holm': Less conservative than Bonferroni, controls FWER
+        'fdr_bh': Benjamini-Hochberg, controls FDR (recommended default)
+        'fdr_by': Benjamini-Yekutieli, conservative FDR control
+    """
+    reject, corrected_p, _, _ = multipletests(p_values, method=method)
+    return {
+        'method': method,
+        'original_p': p_values,
+        'corrected_p': corrected_p.tolist(),
+        'reject': reject.tolist(),
+        'n_significant': int(reject.sum())
+    }
+```
+## Effect Size Reference
+| Test | Effect Size | Small | Medium | Large |
+|------|------------|-------|--------|-------|
+| t-test | Cohen's d | 0.2 | 0.5 | 0.8 |
+| ANOVA | Eta-squared | 0.01 | 0.06 | 0.14 |
+| Correlation | r | 0.1 | 0.3 | 0.5 |
+| Chi-square | Cramér's V | 0.1 | 0.3 | 0.5 |
+| Mann-Whitney | Rank-biserial r | 0.1 | 0.3 | 0.5 |
+## APA Reporting Examples
+- **t-test**: "An independent samples t-test revealed a significant difference, t(58) = 2.45, p = .017, d = 0.63."
+- **ANOVA**: "A one-way ANOVA showed a significant main effect of condition, F(2, 87) = 4.12, p = .020, eta-squared = 0.09."
+- **Mann-Whitney**: "A Mann-Whitney U test indicated that scores were significantly higher in the treatment group, U = 245, p = .003, r = 0.42."
+- **Chi-square**: "A chi-square test of independence revealed a significant association, X2(2, N = 150) = 8.34, p = .015, V = 0.24."
+## References
+- Cohen, J. (1988). *Statistical Power Analysis for the Behavioral Sciences* (2nd ed.). Routledge.
+- Vallat, R. (2018). Pingouin: Statistics in Python. *JOSS*, 3(31), 1026.
+- Lakens, D. (2013). Calculating and Reporting Effect Sizes. *Frontiers in Psychology*, 4, 863.