npm - ecological-agent-skills - Versions diffs - 3.1.0 - Mend

ecological-agent-skills 3.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (217) hide show

package/skills/biostatistics-workbench/SKILL.md ADDED Viewed

@@ -0,0 +1,140 @@
+---
+name: biostatistics-workbench
+description: "Runs frequentist statistical analyses including GLMs, GLMMs, model selection, and assumption diagnostics for ecological data. Use this skill when the user needs statistical tests, linear or mixed models, ANOVA, effect sizes, confidence intervals, AIC-based model selection, residual diagnostics, overdispersion checks, regression analysis, p-value interpretation, normality tests, or hypothesis testing on ecological datasets."
+skill_version: 1.0.0
+---
+# Skill: biostatistics-workbench
+**Domain:** Hypothesis testing · GLM/GLMM · Assumptions · Effect sizes · CIs
+**Phase:** 1 — Foundation
+**Used by:** assess-ecological-impact, analyze-community-structure, run-occupancy-analysis, assess-ecosystem-services
+---
+## Purpose
+Guides the agent through the selection, execution, and interpretation of statistical methods appropriate for ecological data. Covers classical tests, generalised linear models, mixed models, assumption diagnostics, effect size estimation, and model selection.
+---
+## When to Invoke
+- Choosing a statistical test for a specific research question and data structure
+- Fitting GLM or GLMM with ecological response variables
+- Checking distributional assumptions (normality, homoscedasticity, independence)
+- Reporting effect sizes and confidence intervals
+- Performing model selection (AIC, LRT, cross-validation)
+---
+## Inputs
+| Input | Format | Required |
+|-------|--------|----------|
+| Response variable(s) | Numeric or count vector | Yes |
+| Predictor variable(s) | Numeric, categorical, or both | Yes |
+| Random effects structure (if any) | Description or formula | Conditional |
+| Study design description | Text | Recommended |
+---
+## Outputs
+| Output | Description |
+|--------|-------------|
+| `model_summary.txt` | Full model output (coefficients, SE, z/t, p) |
+| `model_selection_table.csv` | AIC/BIC/ΔAIC comparison across candidate models |
+| `assumption_diagnostics/` | Residual plots, QQ plots, variance inflation factors |
+| `effect_sizes.csv` | Effect size estimates with 95% CIs |
+| `stats_report.md` | Plain-language interpretation of results |
+---
+## Steps
+### 1. Define the Research Question and Data Structure
+- Clarify the response variable type: continuous, count, binary, proportion, ordinal
+- Clarify the predictor types: fixed categorical, fixed continuous, random grouping
+- Identify the sampling design: independent, nested, repeated measures, spatial
+### 2. Select the Appropriate Method
+| Response | Distribution | Recommended model |
+|----------|-------------|------------------|
+| Continuous, normal | Gaussian | LM / LMM |
+| Continuous, non-normal | Log-normal, Gamma | GLM Gamma / LM on log |
+| Count, no excess zeros | Poisson | GLM Poisson |
+| Count, overdispersed | Negative binomial | GLM NB |
+| Count, zero-inflated | ZIP / ZINB | Zero-inflated model |
+| Binary (0/1) | Binomial | GLM logistic |
+| Proportion (0–1) | Beta | Beta regression |
+| Ordinal | Ordered | Proportional odds model |
+### 3. Check Assumptions Before Fitting
+- Collinearity: compute VIF for all predictors (flag VIF > 5; critical > 10)
+- Sample size adequacy: events-per-variable rule (EPV ≥ 10 for logistic)
+- Independence: confirm no pseudoreplication; identify random effects structure
+### 4. Fit Model(s)
+- Fit the global model first
+- Fit a set of candidate models based on a priori hypotheses
+- Avoid purely data-driven stepwise selection; document candidate model rationale
+### 5. Check Assumptions After Fitting
+- Residual plots (Pearson, deviance, randomised quantile residuals for GLMs)
+- QQ plot of residuals
+- Residuals vs fitted values
+- Scale-location plot for heteroscedasticity
+- Cook's distance for influential observations
+### 6. Model Selection
+- Compute AIC/AICc/BIC for all candidate models
+- Report ΔAIC and Akaike weights
+- Use LRT for nested model comparison
+- Avoid selecting models based on p-value alone
+### 7. Report Effect Sizes and CIs
+- Report standardised coefficients (βstd) for comparability
+- Report 95% CI for all estimates (profile likelihood preferred over Wald for GLMs)
+- Report R²m and R²c for LMMs (marginal and conditional)
+### 8. Generate Outputs
+- Write model summary to `model_summary.txt`
+- Write model selection table to `model_selection_table.csv`
+- Save all diagnostic plots to `assumption_diagnostics/`
+- Write `stats_report.md` with plain-language interpretation
+---
+## Key Decisions to Document
+- Response variable distribution and link function
+- Random effects structure and rationale
+- Candidate model set and justification
+- Model selection criterion used
+- Effect size metric chosen
+---
+## Tools and Libraries
+**R:** `lme4`, `glmmTMB`, `MuMIn`, `DHARMa`, `emmeans`, `performance`, `effectsize`
+**Python:** `statsmodels`, `pymer4`, `pingouin`, `scipy.stats`
+---
+## Resources
+- `resources/test-selection-guide.md` — flowchart for test selection
+- `resources/glm-family-link-reference.md` — GLM family and link function guide
+- `resources/effect-size-reference.md` — which effect size to report per test
+- `examples/` — worked GLM and GLMM examples
+---
+## Notes
+- Never report p-values without effect sizes
+- Multiple comparisons: apply Bonferroni or FDR correction when testing many hypotheses
+- Overdispersion in Poisson models must always be checked (dispersion parameter)

package/skills/biostatistics-workbench/examples/example-prompts.md ADDED Viewed

@@ -0,0 +1,39 @@
+# Example Invocation Prompts — biostatistics-workbench
+## GLM for Species Richness
+```
+Load skill: biostatistics-workbench
+Task: I have species richness counts (count variable, integers) at 80 plots
+across three vegetation types (factor: "forest", "savanna", "wetland").
+Additional predictors: elevation (m), precipitation (mm/year), distance_to_edge (m).
+Fit an appropriate GLM, check assumptions, select the best model by AICc,
+and report effect sizes with 95% CIs.
+Data file: data/processed/richness_data.csv
+```
+## BACI Mixed Model
+```
+Load skill: biostatistics-workbench
+Task: BACI analysis of bird abundance before and after road construction.
+Response: bird_abundance (count)
+Fixed effects: period (before/after), treatment (control/impact), period:treatment interaction
+Random effects: site (repeated measures)
+Data: data/baci_birds.csv
+Report: BACI interaction coefficient, 95% CI, p-value, Cohen's d.
+```
+## Assumption Checking
+```
+Load skill: biostatistics-workbench
+Task: I fitted a Poisson GLM (model object saved in models/glm_poisson.rds).
+Run a full assumption check using DHARMa:
+  - Uniformity test
+  - Dispersion test
+  - Zero-inflation test
+  - Outlier test
+Save all diagnostic plots to outputs/diagnostics/.
+If overdispersion detected, refit with negative binomial and compare AIC.
+```

package/skills/biostatistics-workbench/resources/effect-size-reference.md ADDED Viewed

@@ -0,0 +1,81 @@
+# Effect Size Reference for Ecological Statistics
+Effect sizes quantify the **magnitude** of an effect independently of sample size. Always report them alongside p-values.
+## Continuous Response (t-tests, ANOVA, LMM)
+| Measure | Formula | Interpretation | R function |
+|---------|---------|---------------|-----------|
+| Cohen's d | (μ₁ − μ₂) / SD_pooled | 0.2 = small, 0.5 = medium, 0.8 = large | `effectsize::cohens_d()` |
+| Hedges' g | Cohen's d × correction factor | Preferred for unequal n | `effectsize::hedges_g()` |
+| η² (eta-squared) | SS_effect / SS_total | % variance explained by factor | `effectsize::eta_squared()` |
+| ω² (omega-squared) | Bias-corrected η² | Preferred over η² for small n | `effectsize::omega_squared()` |
+| partial η² | SS_effect / (SS_effect + SS_residual) | For multiple predictors | `effectsize::eta_squared(partial=TRUE)` |
+| R² / R²_adj | Model variance explained | 0.01 = tiny, 0.09 = small, 0.25 = large | `performance::r2()` |
+| R²_m, R²_c | Marginal / conditional R² for LMMs | R²_m = fixed only; R²_c = fixed + random | `performance::r2()` |
+## Binary Response (logistic regression, GLM binomial)
+| Measure | Interpretation | How to compute |
+|---------|---------------|----------------|
+| Odds ratio (OR) | exp(β); OR = 1 means no effect | `exp(coef(model))` |
+| OR 95% CI | exp(confint(model)) | `exp(confint(model))` |
+| Risk ratio (RR) | More interpretable than OR when prevalence is high | Compute from marginal predictions |
+| Cohen's h | h = 2arcsin(√p₁) − 2arcsin(√p₂) | `effectsize::cohens_h()` |
+| Cramér's V | For chi-square tests; 0–1 | `effectsize::cramers_v()` |
+## Count Response (Poisson, negative binomial)
+| Measure | Interpretation |
+|---------|---------------|
+| Rate ratio (IRR) | exp(β); multiplicative effect on count |
+| % change | (exp(β) − 1) × 100% |
+| McFadden's pseudo-R² | 1 − LL_model/LL_null; > 0.2 = good fit |
+## Non-Parametric Tests
+| Test | Effect size | Measure | Range |
+|------|------------|---------|-------|
+| Mann-Whitney U | Rank-biserial r | r = 1 − 2U/(n₁×n₂) | -1 to 1 |
+| Wilcoxon signed-rank | r = Z/√N | | -1 to 1 |
+| Kruskal-Wallis | η²_H = (H − k + 1) / (n − k) | | 0 to 1 |
+| Spearman | ρ (rho) | | -1 to 1 |
+## Multivariate (PERMANOVA)
+| Measure | Interpretation |
+|---------|---------------|
+| R² from adonis2 | Proportion of dissimilarity explained by the factor |
+| Partial R² | For models with multiple terms |
+**Note:** PERMANOVA R² tends to be modest even for ecologically strong effects; R² = 0.15–0.30 is typical and meaningful in community ecology.
+## Benchmarks Summary
+| Size | d | r | R² | OR |
+|------|---|---|----|----|
+| Negligible | < 0.2 | < 0.10 | < 0.01 | < 1.5 |
+| Small | 0.2–0.5 | 0.10–0.30 | 0.01–0.09 | 1.5–3.0 |
+| Medium | 0.5–0.8 | 0.30–0.50 | 0.09–0.25 | 3.0–6.0 |
+| Large | > 0.8 | > 0.50 | > 0.25 | > 6.0 |
+## R Package: effectsize
+```r
+library(effectsize)
+library(lme4)
+# From a t-test
+t_result <- t.test(group1, group2)
+cohens_d(group1, group2)
+# From a linear model
+m <- lm(richness ~ land_use + elevation, data = dat)
+eta_squared(m)          # η² for each term
+omega_squared(m)        # bias-corrected
+# From a GLM
+m_glm <- glm(presence ~ forest_cover, family = binomial, data = dat)
+exp(coef(m_glm))        # odds ratios
+exp(confint(m_glm))     # 95% CI for ORs
+```

package/skills/biostatistics-workbench/resources/glm-family-link-reference.md ADDED Viewed

@@ -0,0 +1,47 @@
+# GLM Family and Link Function Reference
+## Common Families and Links
+| Family | Default link | Canonical use | R syntax |
+|--------|-------------|---------------|---------|
+| `gaussian` | identity | Continuous, normal | `glm(y ~ x, family = gaussian)` |
+| `Gamma` | inverse | Positive continuous, right-skewed | `glm(y ~ x, family = Gamma(link="log"))` |
+| `inverse.gaussian` | 1/mu² | Positive continuous, extreme skew | `glm(y ~ x, family = inverse.gaussian)` |
+| `binomial` | logit | Binary (0/1), proportion (cbind) | `glm(y ~ x, family = binomial)` |
+| `quasibinomial` | logit | Binary, overdispersed | `glm(y ~ x, family = quasibinomial)` |
+| `poisson` | log | Count data | `glm(y ~ x, family = poisson)` |
+| `quasipoisson` | log | Count, overdispersed | `glm(y ~ x, family = quasipoisson)` |
+| `nbinom2` (glmmTMB) | log | Count, overdispersed (NB) | `glmmTMB(y ~ x, family = nbinom2)` |
+| `tweedie` | log | Zero-inflated continuous (precipitation) | `glmmTMB(y ~ x, family = tweedie)` |
+| `beta_family` | logit | Proportion (0,1) exclusive | `glmmTMB(y ~ x, family = beta_family)` |
+| `ordbeta` | logit | Proportion including 0 and 1 | `glmmTMB(y ~ x, family = ordbeta)` |
+| `truncated_poisson` | log | Count with no zeros | `glmmTMB(y ~ x, family = truncated_poisson)` |
+## Checking Overdispersion (Poisson)
+```r
+# After fitting a Poisson GLM:
+dispersion_ratio <- sum(residuals(model, type = "pearson")^2) / df.residual(model)
+# If dispersion_ratio >> 1 (>1.5), switch to quasipoisson or negative binomial
+```
+## Checking Distributional Assumptions with DHARMa
+```r
+library(DHARMa)
+sim_res <- simulateResiduals(fittedModel = model, plot = TRUE)
+# Provides: QQ plot, residuals vs fitted, uniformity test
+testDispersion(sim_res)
+testZeroInflation(sim_res)
+testOutliers(sim_res)
+```
+## Link Function Interpretation
+| Link | Function | Interpretation of coefficient |
+|------|----------|-------------------------------|
+| identity | η = μ | 1-unit change in x → β change in y (additive) |
+| log | η = log(μ) | 1-unit change in x → exp(β) multiplicative change in y |
+| logit | η = log(μ/(1−μ)) | 1-unit change in x → exp(β) odds ratio |
+| inverse | η = 1/μ | Less common; log link usually preferred for Gamma |
+| sqrt | η = √μ | Intermediate between identity and log |

package/skills/biostatistics-workbench/resources/test-selection-guide.md ADDED Viewed

@@ -0,0 +1,93 @@
+# Statistical Test Selection Guide
+## Step 1 — What is your goal?
+```
+Compare groups → Step 2
+Assess relationship → Step 5
+Predict a response → Step 7
+Describe distribution → Step 9
+```
+## Step 2 — Comparing Groups: How many?
+```
+2 groups → Step 3
+3+ groups → Step 4
+```
+## Step 3 — Comparing 2 Groups
+| Data type | Independent? | Parametric? | Test |
+|-----------|-------------|-------------|------|
+| Continuous | Yes | Yes (normal, equal var) | t-test (Student's) |
+| Continuous | Yes | Yes (normal, unequal var) | t-test (Welch's) |
+| Continuous | Yes | No | Mann-Whitney U |
+| Continuous | No (paired) | Yes | Paired t-test |
+| Continuous | No (paired) | No | Wilcoxon signed-rank |
+| Binary proportion | Yes | — | Chi-square / Fisher's exact |
+| Count | Yes | — | Poisson test |
+## Step 4 — Comparing 3+ Groups
+| Data type | Design | Test | Post-hoc |
+|-----------|--------|------|---------|
+| Continuous, normal, equal var | 1 factor | One-way ANOVA | Tukey HSD |
+| Continuous, non-normal | 1 factor | Kruskal-Wallis | Dunn's |
+| Continuous | 2 factors | Two-way ANOVA | Tukey / emmeans |
+| Continuous, repeated measures | 1 factor | Repeated-measures ANOVA | emmeans |
+| Count data | Groups | GLM Poisson | emmeans on log scale |
+| Binary | Groups | GLM Binomial | emmeans |
+## Step 5 — Association / Correlation
+| Data type | Test | Coefficient |
+|-----------|------|-------------|
+| Continuous vs continuous, linear | Pearson | r |
+| Continuous vs continuous, non-linear / ranks | Spearman | ρ |
+| Ordinal vs ordinal | Kendall | τ |
+| Binary vs binary | Chi-square | φ (phi) |
+| Continuous vs binary | Point-biserial | r_pb |
+## Step 6 — Multivariate Association
+| Goal | Method |
+|------|--------|
+| Community similarity | Bray-Curtis / Jaccard dissimilarity |
+| Community differences between groups | PERMANOVA |
+| Gradient ordination | NMDS, PCA, RDA |
+| Taxon association network | Co-occurrence analysis |
+## Step 7 — Predicting a Response
+| Response type | Distribution | Model |
+|--------------|-------------|-------|
+| Continuous, normal | Gaussian | LM / LMM |
+| Continuous, positive skew | Log-normal or Gamma | GLM Gamma |
+| Count, no excess zeros | Poisson | GLM Poisson |
+| Count, overdispersed | Negative binomial | GLM NB (glmmTMB) |
+| Count, zero-inflated | Zero-inflated Poisson/NB | glmmTMB |
+| Binary (presence/absence) | Binomial | GLM Logistic |
+| Proportion (0–1) | Beta | betareg |
+| Ordinal | Ordered logistic | polr / clm |
+| Multivariate community | — | RDA, CCA, mvabund |
+## Step 8 — Random Effects?
+Add random effects if:
+- Data are nested (plots within sites within regions)
+- Data are repeated measures on the same individual/site
+- Groups are a random sample of a larger population
+Use `lme4::lmer()` (Gaussian) or `lme4::glmer()` / `glmmTMB::glmmTMB()` (non-Gaussian).
+## Step 9 — Normality Tests
+| Test | Use | Package |
+|------|-----|---------|
+| Shapiro-Wilk | n < 50 (most powerful for small n) | base R `shapiro.test()` |
+| Kolmogorov-Smirnov | Large samples | base R `ks.test()` |
+| Lilliefors | Large samples, unknown μ/σ | `nortest::lillie.test()` |
+| Q-Q plot | Visual, any n | base R `qqnorm()` |
+**Note:** Normality tests are sensitive to n. For large datasets, trivial departures become significant. Always inspect Q-Q plots in addition to test p-values.

package/skills/biostatistics-workbench/scripts/glm_pipeline.R ADDED Viewed

@@ -0,0 +1,78 @@
+# ecological-agent-skills / Copyright (C) 2026 Francisco Diego Barros Barata
+# SPDX-License-Identifier: GPL-3.0-or-later
+# Usage: Rscript glm_pipeline.R <data.csv> <response_col> <predictor_cols> <output_dir> [family]
+# Fit candidate GLMs, check assumptions, model selection
+# Usage: source this script or adapt interactively
+# Requires: glmmTMB, DHARMa, MuMIn, emmeans, dplyr
+# ── Inline logger ─────────────────────────────────────────────────────────────
+SKILL_NAME <- "biostatistics-workbench"
+.log_ts  <- function() format(Sys.time(), "[%Y-%m-%d %H:%M:%S]")
+log_info <- function(...) message(.log_ts(), " [INFO]  ", sprintf(...))
+log_warn <- function(...) message(.log_ts(), " [WARN]  ", sprintf(...))
+log_error<- function(...) message(.log_ts(), " [ERROR] ", sprintf(...))
+log_step <- function(n, d) log_info("-- STEP %d: %s", n, d)
+log_decision <- function(v, val, why) log_info("DECISION | %s = %s | %s", v, val, why)
+dir.create("logs", recursive=TRUE, showWarnings=FALSE)
+suppressPackageStartupMessages({
+  library(glmmTMB)
+  library(DHARMa)
+  library(MuMIn)
+  library(emmeans)
+  library(dplyr)
+})
+# ── Helper: fit and summarise a GLM ───────────────────────────────────────
+fit_and_check <- function(formula, data, family, label, output_dir = "outputs") {
+  log_info("Fitting model: %s", label)
+  dir.create(file.path(output_dir, "diagnostics"), recursive = TRUE, showWarnings = FALSE)
+  tryCatch({
+    m <- glmmTMB(formula, data = data, family = family,
+                 control = glmmTMBControl(optimizer = optim, optArgs = list(method = "BFGS")))
+    log_info("Model %s converged. AIC = %.2f", label, AIC(m))
+    log_info(paste(capture.output(summary(m)), collapse = "\n"))
+    log_step(1, sprintf("DHARMa residual diagnostics for %s", label))
+    sim_res <- simulateResiduals(m, plot = FALSE, n = 500)
+    png(file.path(output_dir, "diagnostics", paste0(label, "_dharma.png")),
+        width = 1200, height = 600, res = 150)
+    plot(sim_res, main = label)
+    dev.off()
+    log_info("DHARMa diagnostic plot saved for %s", label)
+    list(model = m, label = label, AIC = AIC(m))
+  }, error = function(e) {
+    log_error(
+      "Falha em fit_and_check [%s]: %s\nCausa provavel: convergencia ou dados insuficientes para a familia escolhida\nVerifique: formula, familia de distribuicao, e dados de entrada\nSkill anterior: data-cleaning",
+      label, conditionMessage(e)
+    )
+    stop(e)
+  })
+}
+# ── Example usage ─────────────────────────────────────────────────────────
+# Uncomment and adapt:
+#
+# dat <- read.csv("data/processed/richness_data.csv")
+#
+# candidates <- list(
+#   fit_and_check(richness ~ vegetation_type + elevation, dat, poisson(), "m1_poisson"),
+#   fit_and_check(richness ~ vegetation_type + elevation, dat, nbinom2(),  "m2_nbinom"),
+#   fit_and_check(richness ~ vegetation_type + elevation + precipitation, dat, nbinom2(), "m3_nbinom_full")
+# )
+#
+# # Model selection table
+# aic_table <- do.call(rbind, lapply(candidates, function(x) data.frame(
+#   model = x$label, AIC = x$AIC
+# ))) |> arrange(AIC) |> mutate(deltaAIC = AIC - min(AIC))
+# print(aic_table)
+# write.csv(aic_table, "outputs/model_selection_table.csv", row.names = FALSE)
+#
+# # Best model effects
+# best <- candidates[[which.min(sapply(candidates, function(x) x$AIC))]]$model
+# em <- emmeans(best, ~ vegetation_type)
+# print(em)
+log_info("glm_pipeline.R loaded. Adapt the example usage section for your data.")