npm - ecological-agent-skills - Versions diffs - 3.1.0 - Mend

ecological-agent-skills 3.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (217) hide show

package/skills/reproducible-ecology-pipeline/SKILL.md ADDED Viewed

@@ -0,0 +1,139 @@
+---
+name: reproducible-ecology-pipeline
+description: "Ensures full reproducibility of ecological analyses through provenance tracking, decision logging, parameter manifests, and environment documentation. Use this skill when the user mentions reproducibility, audit trails, data provenance, decision logs, file manifests, session info, renv, targets, DVC, MLflow, parameter versioning, checksums, or pipeline documentation and project initialization."
+skill_version: 1.0.0
+---
+# Skill: reproducible-ecology-pipeline
+**Domain:** Provenance · Parameter logging · Decision audit · Checklist · Reporting
+**Phase:** 1 — Foundation
+**Used by:** All workflows
+---
+## Purpose
+Ensures that every quantitative ecology project generates an auditable, reproducible record: all parameters, software versions, data sources, analytical decisions, and intermediate outputs are logged so the study can be independently replicated.
+---
+## When to Invoke
+- At project initialisation (to set up the logging structure)
+- At the end of every analytical step (to record decisions and outputs)
+- Before generating a technical report or submitting results
+- When the user asks about reproducibility, provenance, or audit trails
+---
+## Inputs
+| Input | Format | Required |
+|-------|--------|----------|
+| Project directory with analytical outputs | Directory path | Yes |
+| Analysis scripts | R, Python, bash | Recommended |
+| Model parameter files | JSON, YAML, RData | Recommended |
+| QA reports from upstream skills | Markdown, CSV | Recommended |
+---
+## Outputs
+| Output | Description |
+|--------|-------------|
+| `reproducibility_checklist.md` | Completed checklist with pass/fail per criterion |
+| `parameter_manifest.yaml` | All parameters used across all steps |
+| `decision_log.md` | Chronological log of all analytical decisions |
+| `software_environment.txt` | Package versions (`sessionInfo()` / `pip freeze`) |
+| `data_provenance.md` | Source, version, access date for every dataset |
+| `file_manifest.md` | All input/output files with checksums |
+---
+## Steps
+### 1. Project Initialisation
+- Create the standard directory structure (see template)
+- Initialise a Git repository (or equivalent version control)
+- Set a fixed random seed for all stochastic operations; document the seed
+- Create `params.yaml` as the single source of truth for all parameters
+### 2. Data Provenance Logging
+- For each dataset: record institution, URL/DOI, access date, version/release, license
+- Compute MD5/SHA256 checksums for all raw input files
+- Store checksums in `file_manifest.md`
+### 3. Software Environment Capture
+- Run `sessionInfo()` (R) or `pip freeze` + `conda list` (Python) and save to `software_environment.txt`
+- Record OS, R/Python version, key package versions
+- Prefer `renv` (R) or `conda environment.yaml` (Python) for environment reproducibility
+### 4. Parameter Manifest
+- Centralise all analysis parameters in `params.yaml`:
+  - Random seed
+  - CRS / spatial resolution
+  - Train/test split ratios
+  - Model hyperparameters
+  - Thresholds (QA, significance, etc.)
+- No hard-coded parameters in scripts; all values referenced from `params.yaml`
+### 5. Decision Log
+- After each analytical step, append an entry to `decision_log.md`:
+  - Date/time
+  - Step name
+  - Decision made
+  - Rationale
+  - Output files generated
+### 6. Reproducibility Checklist
+Evaluate each criterion as PASS / FAIL / N/A:
+- [ ] Raw data preserved unchanged
+- [ ] All input data checksums recorded
+- [ ] Random seed(s) fixed and documented
+- [ ] All parameters stored in `params.yaml` (no hard-coded values)
+- [ ] Software environment captured
+- [ ] All analytical decisions logged
+- [ ] Scripts run end-to-end without manual intervention
+- [ ] Intermediate outputs versioned or hashable
+- [ ] Final outputs match those in the report
+### 7. Pre-report Audit
+- Re-run the full pipeline from raw data; confirm outputs match
+- Cross-check all numbers in the report against the parameter manifest and output files
+- Confirm all figures can be regenerated from code
+---
+## Key Decisions to Document
+- Version control system and branching strategy
+- Random seed value(s)
+- Environment management tool (renv, conda, Docker)
+- Intermediate output storage strategy (local, cloud, Zenodo)
+---
+## Tools and Libraries
+**R:** `renv`, `targets`, `drake`, `sessionInfo()`
+**Python:** `DVC`, `MLflow`, `conda`, `pip freeze`, `hashlib`
+**General:** Git, GitHub/GitLab, Zenodo, OSF
+---
+## Resources
+- `resources/reproducibility-checklist-template.md` — blank checklist
+- `resources/params-yaml-template.yaml` — standard parameter manifest template
+- `resources/directory-structure-template.md` — recommended project layout
+- `examples/` — example decision log and provenance record
+---
+## Notes
+- Reproducibility is not optional; it is a precondition for scientific validity
+- Prefer `targets` (R) or `DVC` (Python) for pipeline orchestration in complex studies
+- Archive raw data and final outputs to a persistent repository (Zenodo, OSF) before publication

package/skills/reproducible-ecology-pipeline/examples/example-prompts.md ADDED Viewed

@@ -0,0 +1,35 @@
+# Example Invocation Prompts — reproducible-ecology-pipeline
+## Project Initialisation
+```
+Load skill: reproducible-ecology-pipeline
+Task: Initialise a reproducible project structure for a jaguar SDM study.
+Project name: "jaguar-sdm-amazon"
+Create: standard directory layout, params.yaml (pre-filled with SDM defaults),
+decision_log.md, data_provenance.md, and run `git init`.
+```
+## Pre-report Audit
+```
+Load skill: reproducible-ecology-pipeline
+Task: Run a pre-submission reproducibility audit.
+Project directory: /projects/cerrado-fire-risk/
+1. Complete the reproducibility checklist template.
+2. Verify all numbers in outputs/fire_risk_report_v3.md against outputs/*.csv.
+3. Capture current R session info.
+4. Generate file_manifest.md with SHA256 checksums for all files in outputs/.
+```
+## Decision Log Entry
+```
+Load skill: reproducible-ecology-pipeline
+Task: Add an entry to the decision log for today's predictor selection step.
+Decision: Removed bio7 (VIF = 11.2, highly collinear with bio4) from the predictor set.
+Kept: bio1, bio4, bio12, bio15, NDVI, slope (6 variables total, all VIF < 5).
+Rationale: bio4 retained over bio7 because it captures seasonality rather than range,
+more ecologically meaningful for jaguar thermoregulation.
+Append to: decision_log.md
+```

package/skills/reproducible-ecology-pipeline/resources/directory-structure-template.md ADDED Viewed

@@ -0,0 +1,94 @@
+# Recommended Project Directory Structure
+```
+my-ecology-project/
+│
+├── README.md                    ← project overview, setup instructions
+├── params.yaml                  ← ALL parameters; source of truth
+├── .gitignore
+│
+├── data/
+│   ├── raw/                     ← NEVER modified; read-only after deposit
+│   │   ├── occurrences_raw.csv
+│   │   ├── predictors/          ← original rasters
+│   │   └── spatial/             ← original shapefiles
+│   ├── processed/               ← cleaned, validated, analysis-ready
+│   │   ├── data_clean.csv
+│   │   ├── points_with_env.csv
+│   │   └── predictors_stack.tif
+│   └── spatial/                 ← derived spatial layers
+│       ├── study_area.gpkg
+│       └── M_area.gpkg
+│
+├── scripts/                     ← analysis scripts (numbered by order)
+│   ├── 00_setup.R               ← load packages, set paths, source params
+│   ├── 01_data_cleaning.R
+│   ├── 02_geoprocessing.R
+│   ├── 03_modeling.R
+│   ├── 04_validation.R
+│   └── 05_figures.R
+│
+├── models/                      ← fitted model objects
+│   ├── maxnet_tuned.rds
+│   ├── brt_tuned.rds
+│   └── ensemble_weights.csv
+│
+├── outputs/
+│   ├── figures/                 ← all plots
+│   ├── tables/                  ← all CSV results
+│   ├── maps/                    ← all raster outputs
+│   └── reports/                 ← rendered reports
+│
+├── logs/
+│   ├── decision_log.md
+│   ├── data_provenance.md
+│   ├── software_environment.txt
+│   ├── file_manifest.md         ← checksums for all outputs
+│   └── reproducibility_checklist.md
+│
+└── reports/
+    ├── technical_report.md      ← or .Rmd / .qmd for literate programming
+    └── supplementary/
+```
+## Naming Conventions
+- Scripts: `NN_descriptive_name.R` (numbered for execution order)
+- Data files: `snake_case`, no spaces, include version or date if multiple versions
+- Rasters: `variable_source_resolution_date.tif` (e.g., `ndvi_modis_1km_2023.tif`)
+- Models: `algorithm_species_version.rds`
+- Outputs: `metric_context_date.csv`
+## Version Control Rules
+- Commit after each major step (cleaning, modeling, validation)
+- Commit message format: `step: brief description` (e.g., `modeling: add BRT with spatial CV`)
+- Tag the commit used for the submitted manuscript: `git tag -a v1.0 -m "manuscript submission"`
+- Never commit raw data files (add to .gitignore); archive separately at Zenodo/OSF
+## .gitignore Template
+```
+# Data (archive separately)
+data/raw/*
+!data/raw/.gitkeep
+# Large model objects
+models/*.rds
+models/*.pkl
+# Large rasters
+data/**/*.tif
+outputs/maps/*.tif
+# Environment files
+.Rhistory
+.RData
+__pycache__/
+*.pyc
+.ipynb_checkpoints/
+# OS files
+.DS_Store
+Thumbs.db
+```

package/skills/reproducible-ecology-pipeline/resources/params-yaml-template.yaml ADDED Viewed

@@ -0,0 +1,84 @@
+# params.yaml — Standard Parameter Manifest Template
+# Copy this file to your project root and fill in all values before analysis.
+# Never hard-code parameter values in scripts; always reference this file.
+project:
+  name: "my-ecology-study"
+  version: "1.0.0"
+  created: "YYYY-MM-DD"
+  author: "Your Name"
+  repository: "https://github.com/yourrepo/yourproject"
+random_seeds:
+  global: 42
+  spatial_cv: 42
+  background_sampling: 42
+  bootstrap: 42
+data:
+  occurrence_source: "GBIF download YYYY-MM-DD"
+  occurrence_doi: "https://doi.org/10.15468/dl.XXXXX"
+  predictor_source: "WorldClim v2.1"
+  predictor_resolution: "2.5 arcmin"
+  predictor_version: "2.1"
+  study_area_source: "IBGE biome boundaries 2019"
+  baseline_period: ["2000-01-01", "2020-12-31"]
+spatial:
+  project_crs: "EPSG:4326"
+  analysis_crs: "EPSG:31982"
+  raster_resolution_m: 1000
+  resampling_method: "bilinear"
+  study_area_buffer_km: 100
+data_cleaning:
+  coordinate_uncertainty_max_m: 10000
+  spatial_thinning_distance_km: 10
+  duplicate_temporal_buffer_days: 7
+  missing_value_threshold: 0.20
+  taxonomy_backbone: "GBIF Backbone v2023"
+  coordinate_cleaner_flags:
+    - capitals
+    - centroids
+    - gbif
+    - zeros
+    - validity
+modeling:
+  algorithms:
+    - maxnet
+    - brt
+    - random_forest
+  cv_method: "spatial_block"
+  cv_folds: 5
+  cv_block_size_km: 300
+  background_n: 10000
+  background_method: "random_within_M"
+  collinearity_vif_threshold: 5
+  collinearity_r_threshold: 0.70
+  primary_metric: "TSS"
+  threshold_method: "MaxTSS"
+  ensemble_method: "weighted_mean_TSS"
+hyperparameters:
+  maxnet:
+    regularization_multiplier: [0.5, 1.0, 2.0, 3.0]
+    feature_classes: ["LQP", "LQPH"]
+  brt:
+    n_trees: [500, 1000, 2000]
+    learning_rate: [0.01, 0.001]
+    tree_complexity: [3, 5]
+    bag_fraction: 0.75
+  random_forest:
+    n_trees: 500
+    mtry: "auto"
+    min_node_size: 5
+software:
+  r_version: "4.X.X"
+  key_packages:
+    terra: "X.X.X"
+    sf: "X.X.X"
+    biomod2: "X.X.X"
+    blockCV: "X.X.X"
+    dismo: "X.X.X"

package/skills/reproducible-ecology-pipeline/resources/reproducibility-checklist-template.md ADDED Viewed

@@ -0,0 +1,66 @@
+# Reproducibility Checklist Template
+Project: ___________________________
+Date: ___________________________
+Analyst: ___________________________
+## 1. Data Management
+| Criterion | Status | Notes |
+|-----------|--------|-------|
+| Raw data preserved unchanged in `data/raw/` | PASS / FAIL | |
+| All raw files have MD5/SHA256 checksums recorded | PASS / FAIL | |
+| Data provenance (source, DOI, access date, license) documented | PASS / FAIL | |
+| No manual edits to raw files | PASS / FAIL | |
+## 2. Code and Parameters
+| Criterion | Status | Notes |
+|-----------|--------|-------|
+| All parameters in `params.yaml` (no hard-coded values in scripts) | PASS / FAIL | |
+| Random seeds fixed and documented in `params.yaml` | PASS / FAIL | |
+| Scripts run end-to-end without manual steps | PASS / FAIL | |
+| Code version-controlled (Git) | PASS / FAIL | |
+| Final commit tagged or release created | PASS / FAIL | |
+## 3. Environment
+| Criterion | Status | Notes |
+|-----------|--------|-------|
+| `sessionInfo()` or `pip freeze` output saved | PASS / FAIL | |
+| R version or Python version recorded | PASS / FAIL | |
+| Package versions locked (`renv.lock` or `environment.yaml`) | PASS / FAIL | |
+| OS and hardware noted | PASS / FAIL | |
+## 4. Decisions and Decisions Log
+| Criterion | Status | Notes |
+|-----------|--------|-------|
+| `decision_log.md` updated after each major step | PASS / FAIL | |
+| Rationale documented for: QA thresholds, predictor selection, CV strategy, threshold method | PASS / FAIL | |
+| Deviations from planned protocol documented | PASS / FAIL | |
+## 5. Outputs
+| Criterion | Status | Notes |
+|-----------|--------|-------|
+| All output files in `outputs/` with meaningful names | PASS / FAIL | |
+| File manifest with checksums for all outputs | PASS / FAIL | |
+| Figures regenerable from code | PASS / FAIL | |
+| Numbers in report cross-checked against output files | PASS / FAIL | |
+## 6. Archival
+| Criterion | Status | Notes |
+|-----------|--------|-------|
+| Raw data archived at Zenodo / OSF / institutional repository | PASS / FAIL | |
+| Code archived at GitHub / GitLab with DOI | PASS / FAIL | |
+| Data availability statement included in report | PASS / FAIL | |
+## Overall Assessment
+- Total criteria: ___
+- PASS: ___
+- FAIL: ___
+- N/A: ___
+- **Reproducibility score:** ___/___

package/skills/reproducible-ecology-pipeline/scripts/generate_file_manifest.py ADDED Viewed

@@ -0,0 +1,110 @@
+#!/usr/bin/env python3
+# ecological-agent-skills / Copyright (C) 2026 Francisco Diego Barros Barata
+# SPDX-License-Identifier: GPL-3.0-or-later
+"""
+generate_file_manifest.py
+Generate SHA256 checksums for all files in a directory and write a manifest.
+Usage: python generate_file_manifest.py <directory> [output_file]
+"""
+import logging
+import sys
+import hashlib
+from datetime import datetime
+from pathlib import Path
+SKILL_NAME = "reproducible-ecology-pipeline"
+_LOG_DIR   = Path("logs")
+_LOG_DIR.mkdir(parents=True, exist_ok=True)
+_log_file  = _LOG_DIR / f"skill_{SKILL_NAME}_{datetime.now().strftime('%Y%m%d_%H%M%S')}.log"
+logging.basicConfig(
+    level=logging.INFO,
+    format="[%(asctime)s] [%(levelname)s] [" + SKILL_NAME + "] %(message)s",
+    datefmt="%Y-%m-%d %H:%M:%S",
+    handlers=[
+        logging.StreamHandler(sys.stdout),
+        logging.FileHandler(_log_file, encoding="utf-8"),
+    ],
+)
+logger = logging.getLogger(SKILL_NAME)
+def log_step(n: int, desc: str) -> None:
+    logger.info("-- STEP %d: %s", n, desc)
+def log_decision(var: str, val, why: str) -> None:
+    logger.info("DECISION | %s = %s | %s", var, val, why)
+def sha256(path: Path) -> str:
+    h = hashlib.sha256()
+    with open(path, "rb") as f:
+        for chunk in iter(lambda: f.read(65536), b""):
+            h.update(chunk)
+    return h.hexdigest()
+def main():
+    target_dir  = Path(sys.argv[1]) if len(sys.argv) > 1 else Path("outputs")
+    output_file = Path(sys.argv[2]) if len(sys.argv) > 2 else Path("logs/file_manifest.md")
+    output_file.parent.mkdir(parents=True, exist_ok=True)
+    log_decision("target_dir", str(target_dir),
+                 "Directory to scan for files and compute checksums")
+    log_decision("output_file", str(output_file),
+                 "Markdown manifest output path")
+    if not target_dir.exists():
+        logger.error(
+            "Input nao encontrado: %s\n"
+            "  Causa provavel: passo anterior nao concluiu.\n"
+            "  Skill anterior que deveria ter produzido este input: geoprocessing-for-ecology",
+            target_dir
+        )
+        sys.exit(1)
+    try:
+        log_step(1, "Discovering files in target directory")
+        files = sorted(f for f in target_dir.rglob("*") if f.is_file())
+        logger.info("Files found: %d in %s", len(files), target_dir)
+        if len(files) == 0:
+            logger.warning(
+                "No files found in %s. The directory may be empty or outputs not yet produced.",
+                target_dir
+            )
+        log_step(2, "Computing SHA256 checksums and building manifest")
+        lines = [
+            "# File Manifest",
+            f"Directory: `{target_dir}`",
+            f"Files: {len(files)}",
+            "",
+            "| File | Size (KB) | SHA256 |",
+            "|------|----------|--------|",
+        ]
+        for f in files:
+            try:
+                size_kb = round(f.stat().st_size / 1024, 2)
+                checksum = sha256(f)
+                rel = f.relative_to(target_dir)
+                lines.append(f"| `{rel}` | {size_kb} | `{checksum[:16]}...` |")
+                logger.info("  Checksummed: %s (%s KB)", rel, size_kb)
+            except (OSError, PermissionError) as e:
+                logger.warning("Could not process file %s: %s", f, e)
+        log_step(3, "Writing manifest file")
+        output_file.write_text("\n".join(lines))
+        logger.info("Manifest written: %s (%d files)", output_file, len(files))
+    except FileNotFoundError as e:
+        logger.error(
+            "Input file not found: %s\n"
+            "  Expected output from: geoprocessing-for-ecology\n"
+            "  Check that previous step completed.",
+            e
+        )
+        raise
+    except Exception as e:
+        logger.error("Unexpected error in generate_file_manifest: %s", e)
+        raise
+if __name__ == "__main__":
+    main()

package/skills/reproducible-ecology-pipeline/scripts/init_project.sh ADDED Viewed

@@ -0,0 +1,53 @@
+#!/bin/bash
+# init_project.sh
+# Initialise a reproducible ecology project structure
+# Usage: bash init_project.sh <project_name>
+PROJECT="${1:-my-ecology-project}"
+echo "Initialising project: $PROJECT"
+mkdir -p "$PROJECT"/{data/{raw,processed,spatial},models,outputs/{figures,tables,maps},reports,scripts,logs}
+touch "$PROJECT/data/raw/.gitkeep"
+touch "$PROJECT/logs/decision_log.md"
+# Create params.yaml from template
+cp "$(dirname "$0")/../resources/params-yaml-template.yaml" "$PROJECT/params.yaml" 2>/dev/null || \
+  echo "# params.yaml — fill in values" > "$PROJECT/params.yaml"
+# Git init
+cd "$PROJECT"
+git init -q
+# .gitignore
+cat > .gitignore << 'GITIGNORE'
+data/raw/*
+!data/raw/.gitkeep
+*.Rhistory
+.Rdata
+__pycache__/
+*.pyc
+.DS_Store
+*.log
+GITIGNORE
+# decision_log.md header
+cat > logs/decision_log.md << 'DLOG'
+# Decision Log
+| Date | Step | Decision | Rationale | Output files |
+|------|------|----------|-----------|-------------|
+DLOG
+# data_provenance.md
+cat > logs/data_provenance.md << 'PROV'
+# Data Provenance
+| Dataset | Source | Version | Access date | DOI/URL | License | Checksum |
+|---------|--------|---------|-------------|---------|---------|---------|
+PROV
+echo "Project structure created in: $PROJECT/"
+echo "Next steps:"
+echo "  1. Fill in params.yaml"
+echo "  2. Add raw data to $PROJECT/data/raw/"
+echo "  3. Record provenance in logs/data_provenance.md"