npm - @wentorai/research-plugins - Versions diffs - 1.2.3 → 1.3.0 - Mend

@wentorai/research-plugins 1.2.3 → 1.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (142) hide show

package/skills/analysis/statistics/general-statistics-guide/SKILL.md DELETED Viewed

@@ -1,226 +0,0 @@
----
-name: general-statistics-guide
-description: "Conceptual foundations of statistical inference for empirical research"
-metadata:
-  openclaw:
-    emoji: "📈"
-    category: "analysis"
-    subcategory: "statistics"
-    keywords: ["statistical inference", "hypothesis testing", "probability", "regression", "confidence intervals", "statistical thinking"]
-    source: "https://clawhub.com/ivangdavila/statistics"
----
-# Statistical Foundations for Empirical Research
-## Overview
-This guide builds statistical intuition from probability fundamentals through inferential methods to practical application in research. It is language-agnostic (not tied to R, Python, or Stata) and focuses on the concepts, assumptions, and interpretation of statistical methods commonly used in empirical papers. Use it as a reference when designing studies, choosing tests, or interpreting results.
-## Probability Foundations
-### Key Distributions
-| Distribution | When to Use | Parameters | Example |
-|-------------|-------------|-----------|---------|
-| **Normal** | Continuous, symmetric data; CLT applications | μ (mean), σ (std) | Height, test scores |
-| **Binomial** | Count of successes in n trials | n (trials), p (probability) | Survey yes/no responses |
-| **Poisson** | Count of rare events in fixed interval | λ (rate) | Paper citations per year |
-| **t-distribution** | Small sample means (n < 30) | df (degrees of freedom) | Pilot study comparisons |
-| **Chi-squared** | Goodness of fit, contingency tables | df | Category frequency tests |
-| **F-distribution** | Ratio of variances, ANOVA | df₁, df₂ | Comparing model fits |
-### Central Limit Theorem
-The sample mean $\bar{X}$ of n independent observations approaches a normal distribution as n increases, regardless of the population distribution:
-```
-If X₁, X₂, ..., Xₙ are i.i.d. with mean μ and variance σ²:
-  √n(X̄ - μ) / σ → N(0, 1) as n → ∞
-Practical rule: n ≥ 30 is usually sufficient
-Exception: heavily skewed distributions may need n ≥ 100
-```
-This is why most inferential statistics (confidence intervals, t-tests, regression) work even when the underlying data is not normally distributed.
-## Descriptive Statistics
-### Measures of Central Tendency
-| Measure | Formula | When to Use | Sensitive to Outliers? |
-|---------|---------|-------------|----------------------|
-| Mean | Σxᵢ / n | Symmetric distributions | Yes |
-| Median | Middle value when sorted | Skewed distributions, ordinal data | No |
-| Mode | Most frequent value | Categorical data, multimodal distributions | No |
-### Measures of Spread
-| Measure | Interpretation | When to Report |
-|---------|---------------|----------------|
-| Standard deviation (σ) | Average distance from mean | With the mean |
-| IQR (Q3 - Q1) | Spread of middle 50% | With the median |
-| Range (max - min) | Total spread | Rarely (sensitive to outliers) |
-| Coefficient of variation (σ/μ) | Relative spread | Comparing variability across scales |
-## Hypothesis Testing
-### The Testing Framework
-```
-1. State hypotheses:
-   H₀: null hypothesis (no effect, no difference)
-   H₁: alternative hypothesis (there is an effect)
-2. Choose significance level: α = 0.05 (conventional)
-3. Compute test statistic from data
-4. Compare to critical value or compute p-value
-5. Decision:
-   p < α → Reject H₀ (statistically significant)
-   p ≥ α → Fail to reject H₀ (not significant)
-```
-### Common Errors
-| | H₀ is True | H₀ is False |
-|---|---|---|
-| **Reject H₀** | Type I Error (α) | Correct (Power = 1 - β) |
-| **Fail to Reject H₀** | Correct | Type II Error (β) |
-**Practical interpretation**:
-- Type I (false positive): Claiming a drug works when it doesn't
-- Type II (false negative): Missing a real drug effect
-- Power: Probability of detecting a real effect (target ≥ 0.80)
-### Choosing the Right Test
-| Question | Data Type | Test | Assumptions |
-|----------|-----------|------|-------------|
-| Compare 2 means | Continuous, normal | Independent t-test | Equal variance (or Welch's) |
-| Compare 2 means (paired) | Continuous, normal | Paired t-test | Paired observations |
-| Compare 2 means (non-normal) | Continuous/ordinal | Mann-Whitney U | Independent samples |
-| Compare >2 means | Continuous, normal | One-way ANOVA | Equal variance, normality |
-| Compare >2 means (non-normal) | Ordinal | Kruskal-Wallis | Independent samples |
-| Association (categorical) | Categorical × Categorical | Chi-squared test | Expected count ≥ 5 |
-| Correlation | Continuous × Continuous | Pearson r | Linear relationship, bivariate normal |
-| Correlation (non-normal) | Ordinal or non-normal | Spearman ρ | Monotonic relationship |
-## Regression Analysis
-### Linear Regression
-```
-Y = β₀ + β₁X₁ + β₂X₂ + ... + βₖXₖ + ε
-Interpretation:
-  β₁ = change in Y for a 1-unit increase in X₁, holding other X's constant
-  R² = proportion of variance in Y explained by the model
-  Adjusted R² = R² penalized for number of predictors
-```
-**Key assumptions** (check before trusting results):
-1. **Linearity**: Y is a linear function of X's
-2. **Independence**: Observations are independent
-3. **Homoscedasticity**: Constant variance of residuals
-4. **Normality**: Residuals are approximately normal (for inference)
-5. **No multicollinearity**: X's are not highly correlated with each other
-**Diagnostic checks**:
-```
-Linearity:        Plot residuals vs. fitted values (no pattern)
-Homoscedasticity: Breusch-Pagan test or residual plot (no funnel shape)
-Normality:        Q-Q plot of residuals, Shapiro-Wilk test
-Multicollinearity: VIF (Variance Inflation Factor) — VIF > 10 is concerning
-Influential obs:  Cook's distance — D > 4/n warrants investigation
-```
-### Logistic Regression
-For binary outcomes (0/1):
-```
-log(p / (1-p)) = β₀ + β₁X₁ + β₂X₂ + ...
-Where p = P(Y = 1 | X)
-Interpretation:
-  exp(β₁) = odds ratio
-  exp(β₁) = 1.5 means "a 1-unit increase in X₁ multiplies the odds by 1.5"
-  Report: odds ratios with 95% CI
-```
-## Confidence Intervals
-```
-Point estimate ± (critical value × standard error)
-For a mean: X̄ ± z*(σ/√n)  or  X̄ ± t*(s/√n)
-Interpretation (frequentist):
-  "If we repeated this study many times, 95% of the resulting intervals
-   would contain the true population parameter."
-NOT: "There is a 95% probability that the true value is in this interval."
-```
-## Effect Sizes
-p-values tell you IF an effect exists; effect sizes tell you HOW BIG it is.
-| Measure | Context | Small | Medium | Large |
-|---------|---------|-------|--------|-------|
-| Cohen's d | Mean difference | 0.2 | 0.5 | 0.8 |
-| Pearson r | Correlation | 0.1 | 0.3 | 0.5 |
-| η² (eta-squared) | ANOVA | 0.01 | 0.06 | 0.14 |
-| Odds ratio | Logistic regression | 1.5 | 2.5 | 4.3 |
-| R² | Regression | 0.02 | 0.13 | 0.26 |
-**Always report effect sizes alongside p-values** — a "significant" result with d = 0.05 is trivial in practice.
-## Multiple Testing
-When testing multiple hypotheses simultaneously, the chance of at least one false positive increases:
-```
-With α = 0.05 and 20 independent tests:
-P(at least one false positive) = 1 - (1 - 0.05)^20 = 0.64
-Corrections:
-  Bonferroni:         α_adj = α / m  (conservative)
-  Benjamini-Hochberg: Controls false discovery rate (FDR) (less conservative)
-  Holm-Bonferroni:    Step-down procedure (more powerful than Bonferroni)
-```
-## Sample Size and Power
-Before collecting data, determine the required sample size:
-```
-Inputs needed:
-  1. Desired power (typically 0.80)
-  2. Significance level (α = 0.05)
-  3. Expected effect size (from pilot study or literature)
-  4. Type of test (t-test, ANOVA, regression, etc.)
-Rule of thumb for two-sample t-test:
-  n per group ≈ 16 / d²  (for 80% power, α = 0.05)
-  d = 0.5 → n ≈ 64 per group
-  d = 0.2 → n ≈ 400 per group
-```
-## Common Pitfalls
-1. **p-hacking**: Trying many analyses until p < 0.05. Fix: pre-register analyses.
-2. **Absence of evidence ≠ evidence of absence**: p > 0.05 does not prove H₀. Consider equivalence tests.
-3. **Correlation ≠ causation**: Regression coefficients are causal only with proper identification strategy.
-4. **Simpson's paradox**: A trend in subgroups can reverse when combined. Always check stratified analyses.
-5. **Overfitting**: Too many predictors relative to sample size. Rule of thumb: n ≥ 10-20 per predictor.
-## References
-- Agresti, A. (2018). *Statistical Methods for the Social Sciences* (5th ed.). Pearson.
-- Wasserstein, R. L., & Lazar, N. A. (2016). "The ASA Statement on p-Values." *The American Statistician*, 70(2), 129-133.
-- Cohen, J. (1988). *Statistical Power Analysis for the Behavioral Sciences* (2nd ed.). Routledge.

package/skills/analysis/statistics/infiagent-benchmark-guide/SKILL.md DELETED Viewed

@@ -1,106 +0,0 @@
----
-name: infiagent-benchmark-guide
-description: "Agent benchmark for data analysis evaluation (ICML 2024)"
-metadata:
-  openclaw:
-    emoji: "🏆"
-    category: "analysis"
-    subcategory: "statistics"
-    keywords: ["InfiAgent", "benchmark", "data analysis", "agent evaluation", "ICML", "DABench"]
-    source: "https://github.com/InfiAgent/InfiAgent"
----
-# InfiAgent Data Analysis Benchmark Guide
-## Overview
-InfiAgent (ICML 2024) is a benchmark for evaluating AI agents on data analysis tasks. It provides DABench — a standardized set of data analysis problems ranging from basic EDA to complex statistical modeling, each with ground-truth solutions and automated evaluation metrics. Measures agent capabilities in code generation, statistical reasoning, and visualization.
-## Benchmark Structure
-```
-DABench (Data Analysis Benchmark)
-├── Task Categories
-│   ├── Data Understanding (profiling, cleaning)
-│   ├── Exploratory Analysis (distributions, correlations)
-│   ├── Statistical Testing (hypothesis tests)
-│   ├── Visualization (appropriate chart selection)
-│   ├── Modeling (regression, classification)
-│   └── Interpretation (insights, conclusions)
-├── Difficulty Levels
-│   ├── Easy (single-step operations)
-│   ├── Medium (multi-step analysis)
-│   └── Hard (complex reasoning + code)
-└── Evaluation Metrics
-    ├── Code executability
-    ├── Answer correctness
-    ├── Visualization quality
-    └── Statistical validity
-```
-## Usage
-```python
-from infiagent import DABench
-bench = DABench()
-# List tasks
-for task in bench.tasks[:5]:
-    print(f"[{task.difficulty}] {task.id}: {task.description}")
-    print(f"  Dataset: {task.dataset}")
-    print(f"  Category: {task.category}")
-# Evaluate an agent
-from infiagent import evaluate
-results = evaluate(
-    agent_fn=my_data_agent,
-    tasks="all",
-    timeout=120,
-)
-print(f"Executability: {results.exec_rate:.1%}")
-print(f"Correctness: {results.correct_rate:.1%}")
-print(f"Statistical validity: {results.stats_valid:.1%}")
-```
-## Task Examples
-```python
-# Easy: "What is the mean and standard deviation of column X?"
-# Medium: "Is there a significant correlation between A and B?
-#          Control for confounders C and D."
-# Hard: "Build a predictive model for Y using all available
-#        features. Report cross-validated performance and
-#        identify the 3 most important features."
-```
-## Leaderboard Results
-```python
-# Selected results from DABench
-scores = {
-    "GPT-4 + Code": {"exec": 95, "correct": 67},
-    "Claude 3.5 Sonnet": {"exec": 93, "correct": 64},
-    "GPT-3.5 + Code": {"exec": 88, "correct": 45},
-    "CodeLlama-34B": {"exec": 72, "correct": 31},
-}
-print(f"{'Agent':<22} {'Exec%':>6} {'Correct%':>9}")
-for agent, s in scores.items():
-    print(f"{agent:<22} {s['exec']:>5}% {s['correct']:>8}%")
-```
-## Use Cases
-1. **Agent evaluation**: Standard benchmark for data analysis agents
-2. **Model comparison**: Compare LLMs on analytical tasks
-3. **Capability testing**: Assess statistical reasoning abilities
-4. **Research**: Study agent strengths and failure modes
-5. **Development**: Target specific weak areas for improvement
-## References
-- [InfiAgent GitHub](https://github.com/InfiAgent/InfiAgent)
-- [DABench Paper (ICML 2024)](https://arxiv.org/abs/2401.05507)

package/skills/analysis/statistics/pywayne-statistics-guide/SKILL.md DELETED Viewed

@@ -1,192 +0,0 @@
----
-name: pywayne-statistics-guide
-description: "37+ statistical testing methods for rigorous hypothesis testing"
-metadata:
-  openclaw:
-    emoji: "📐"
-    category: "analysis"
-    subcategory: "statistics"
-    keywords: ["hypothesis testing", "statistical tests", "p-value", "parametric tests", "nonparametric tests", "effect size", "multiple comparisons"]
-    source: "https://github.com/AcademicSkills/pywayne-statistics-guide"
----
-# PyWayne Statistics Guide
-A comprehensive reference for 37+ statistical testing methods covering parametric, nonparametric, and resampling-based hypothesis tests. Provides decision trees for test selection, implementation in Python (scipy, statsmodels, pingouin), effect size calculation, and proper reporting standards for academic publications.
-## Overview
-Hypothesis testing remains the backbone of quantitative research across the sciences, social sciences, and engineering. However, selecting the appropriate test for a given research question, data structure, and assumption profile is a persistent challenge, especially for researchers outside statistics. This skill provides a structured decision framework that maps research questions to the correct statistical test, verifies assumptions, computes test statistics and effect sizes, and formats results for publication.
-All 37+ tests are organized by the type of comparison (one-sample, two-sample, k-sample, association, agreement) and whether parametric assumptions are met. Each test entry includes when to use it, assumptions to verify, the Python implementation, and the correct APA-style reporting format.
-## Test Selection Decision Tree
-### Step 1: Identify the Research Question Type
-| Question Type | Examples |
-|--------------|----------|
-| **One-sample** | Is this sample mean different from a known value? |
-| **Two-sample (independent)** | Do treatment and control groups differ? |
-| **Two-sample (paired)** | Do pre-test and post-test scores differ? |
-| **K-sample (independent)** | Do 3+ groups differ on an outcome? |
-| **K-sample (repeated)** | Do measurements differ across 3+ time points? |
-| **Association** | Is variable X related to variable Y? |
-| **Agreement** | Do two raters/methods agree? |
-### Step 2: Check Data Type and Assumptions
-```
-Is the outcome variable continuous?
-├── Yes → Are the data normally distributed?
-│   ├── Yes → Are variances equal (for group comparisons)?
-│   │   ├── Yes → Use PARAMETRIC test
-│   │   └── No → Use Welch's correction or nonparametric
-│   └── No → Use NONPARAMETRIC test
-└── No → Is it ordinal or nominal?
-    ├── Ordinal → Use rank-based NONPARAMETRIC test
-    └── Nominal → Use CHI-SQUARE or exact test
-```
-## Parametric Tests
-### Two-Sample Tests
-```python
-from scipy import stats
-import pingouin as pg
-import numpy as np
-def two_sample_comparison(group_a, group_b, paired=False):
-    """
-    Perform the appropriate two-sample test with assumption checks.
-    """
-    results = {}
-    # Assumption: Normality
-    _, p_norm_a = stats.shapiro(group_a)
-    _, p_norm_b = stats.shapiro(group_b)
-    normal = p_norm_a > 0.05 and p_norm_b > 0.05
-    if paired:
-        if normal:
-            # Paired t-test
-            t, p = stats.ttest_rel(group_a, group_b)
-            d = pg.compute_effsize(group_a, group_b, paired=True, eftype='cohen')
-            results = {'test': 'paired t-test', 't': t, 'p': p, 'cohens_d': d}
-        else:
-            # Wilcoxon signed-rank
-            w, p = stats.wilcoxon(group_a, group_b)
-            r = w / (len(group_a) * (len(group_a) + 1) / 2)
-            results = {'test': 'Wilcoxon signed-rank', 'W': w, 'p': p, 'rank_biserial': r}
-    else:
-        if normal:
-            # Check equal variances
-            _, p_levene = stats.levene(group_a, group_b)
-            if p_levene > 0.05:
-                t, p = stats.ttest_ind(group_a, group_b)
-                results = {'test': 'independent t-test', 't': t, 'p': p}
-            else:
-                t, p = stats.ttest_ind(group_a, group_b, equal_var=False)
-                results = {'test': "Welch's t-test", 't': t, 'p': p}
-            d = pg.compute_effsize(group_a, group_b, eftype='cohen')
-            results['cohens_d'] = d
-        else:
-            # Mann-Whitney U
-            u, p = stats.mannwhitneyu(group_a, group_b, alternative='two-sided')
-            results = {'test': 'Mann-Whitney U', 'U': u, 'p': p}
-    return results
-```
-### K-Sample Tests (ANOVA Family)
-| Test | Use Case | Assumptions |
-|------|----------|-------------|
-| One-way ANOVA | 3+ independent groups, continuous outcome | Normality, homoscedasticity |
-| Welch's ANOVA | 3+ groups, unequal variances | Normality |
-| Repeated measures ANOVA | 3+ related measurements | Normality, sphericity |
-| Two-way ANOVA | Two factors, continuous outcome | Normality, homoscedasticity |
-| ANCOVA | Group comparison controlling for covariate | Normality, homogeneity of slopes |
-| MANOVA | Multiple dependent variables | Multivariate normality |
-```python
-def k_sample_test(groups: list, method: str = 'auto'):
-    """Run the appropriate k-sample comparison."""
-    # Check normality for all groups
-    all_normal = all(stats.shapiro(g)[1] > 0.05 for g in groups)
-    if all_normal:
-        # Check homogeneity of variance
-        _, p_levene = stats.levene(*groups)
-        if p_levene > 0.05:
-            f, p = stats.f_oneway(*groups)
-            return {'test': 'one-way ANOVA', 'F': f, 'p': p}
-        else:
-            # Welch's ANOVA via pingouin
-            return {'test': "Welch's ANOVA", 'note': 'Use pg.welch_anova()'}
-    else:
-        h, p = stats.kruskal(*groups)
-        return {'test': 'Kruskal-Wallis H', 'H': h, 'p': p}
-```
-## Nonparametric Tests Reference
-| Parametric Test | Nonparametric Alternative | When to Use |
-|----------------|--------------------------|-------------|
-| One-sample t-test | Wilcoxon signed-rank | Non-normal single sample |
-| Independent t-test | Mann-Whitney U | Non-normal, 2 independent groups |
-| Paired t-test | Wilcoxon signed-rank | Non-normal, paired data |
-| One-way ANOVA | Kruskal-Wallis H | Non-normal, 3+ groups |
-| Repeated measures ANOVA | Friedman test | Non-normal, 3+ related measures |
-| Pearson correlation | Spearman rho / Kendall tau | Non-linear or ordinal association |
-## Multiple Comparisons Correction
-When performing multiple hypothesis tests, control the family-wise error rate:
-```python
-from statsmodels.stats.multitest import multipletests
-def correct_multiple_tests(p_values: list, method: str = 'fdr_bh') -> dict:
-    """
-    Apply multiple comparisons correction.
-    Methods:
-        'bonferroni': Conservative, controls FWER
-        'holm': Less conservative than Bonferroni, controls FWER
-        'fdr_bh': Benjamini-Hochberg, controls FDR (recommended default)
-        'fdr_by': Benjamini-Yekutieli, conservative FDR control
-    """
-    reject, corrected_p, _, _ = multipletests(p_values, method=method)
-    return {
-        'method': method,
-        'original_p': p_values,
-        'corrected_p': corrected_p.tolist(),
-        'reject': reject.tolist(),
-        'n_significant': int(reject.sum())
-    }
-```
-## Effect Size Reference
-| Test | Effect Size | Small | Medium | Large |
-|------|------------|-------|--------|-------|
-| t-test | Cohen's d | 0.2 | 0.5 | 0.8 |
-| ANOVA | Eta-squared | 0.01 | 0.06 | 0.14 |
-| Correlation | r | 0.1 | 0.3 | 0.5 |
-| Chi-square | Cramér's V | 0.1 | 0.3 | 0.5 |
-| Mann-Whitney | Rank-biserial r | 0.1 | 0.3 | 0.5 |
-## APA Reporting Examples
-- **t-test**: "An independent samples t-test revealed a significant difference, t(58) = 2.45, p = .017, d = 0.63."
-- **ANOVA**: "A one-way ANOVA showed a significant main effect of condition, F(2, 87) = 4.12, p = .020, eta-squared = 0.09."
-- **Mann-Whitney**: "A Mann-Whitney U test indicated that scores were significantly higher in the treatment group, U = 245, p = .003, r = 0.42."
-- **Chi-square**: "A chi-square test of independence revealed a significant association, X2(2, N = 150) = 8.34, p = .015, V = 0.24."
-## References
-- Cohen, J. (1988). *Statistical Power Analysis for the Behavioral Sciences* (2nd ed.). Routledge.
-- Vallat, R. (2018). Pingouin: Statistics in Python. *JOSS*, 3(31), 1026.
-- Lakens, D. (2013). Calculating and Reporting Effect Sizes. *Frontiers in Psychology*, 4, 863.