npm - ecological-agent-skills - Versions diffs - 3.1.0 - Mend

ecological-agent-skills 3.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (217) hide show

package/skills/predictive-modeling-best-practices/resources/sampling-bias-correction.md ADDED Viewed

@@ -0,0 +1,267 @@
+# Sampling Bias Correction Guide
+Detecting and correcting geographic sampling bias in occurrence records before SDM fitting.
+---
+## 1. Why Sampling Bias Distorts SDMs
+Occurrence data from aggregated databases (GBIF, VertNet, iNaturalist) rarely represent
+true species distributions. They reflect **where people search**, not where species occur.
+### Sources of geographic bias
+| Bias source | Mechanism | Effect on SDM |
+|---|---|---|
+| **Road/trail proximity** | Collectors follow accessible paths | Roadsides over-represented in environmental space |
+| **Urban proximity** | Amateur naturalists concentrate near cities | Urban bioclimates over-represented |
+| **Research institutions** | Field stations generate intense local records | Hyper-local clusters in training data |
+| **National boundaries** | Data sharing policies differ by country | Abrupt density gradients at borders |
+| **Language barriers** | Non-English-speaking regions under-sampled in GBIF | Geographic gaps unrelated to species ecology |
+### Consequence for model fitting
+When occurrence records are biased toward roadsides, MaxEnt or BRT will learn that
+road-adjacent environments predict presence. Background points sampled randomly will
+under-represent those environments. The model will incorrectly associate road proximity
+with habitat suitability.
+**Key reference:** Phillips et al. 2009. Sample selection bias and presence-only distribution models.
+*Ecological Applications* 19: 181–197. DOI: [10.1890/07-2153.1](https://doi.org/10.1890/07-2153.1)
+---
+## 2. Detecting Sampling Bias
+### Spatial thinning (standard — already in `ecological-data-foundation`)
+Remove records closer than a specified distance to reduce spatial clustering:
+```r
+suppressPackageStartupMessages(library(spThin))
+thinned <- thin(loc.data = occ_df,
+                lat.col = "decimalLatitude",
+                long.col = "decimalLongitude",
+                spec.col = "species",
+                thin.par = 10,        # minimum distance in km
+                reps = 10,
+                locs.thinned.list.return = TRUE,
+                write.files = FALSE)
+```
+### Kernel density bias map
+Visualise sampling density to diagnose the pattern of bias:
+```r
+suppressPackageStartupMessages(library(MASS))
+suppressPackageStartupMessages(library(terra))
+# Estimate 2D kernel density of occurrences
+kde <- kde2d(occ_df$decimalLongitude, occ_df$decimalLatitude,
+             n = 200,
+             lims = c(range(occ_df$decimalLongitude) + c(-2, 2),
+                      range(occ_df$decimalLatitude)  + c(-2, 2)))
+# Convert to SpatRaster
+bias_rast <- rast(list(x = kde$x, y = kde$y, z = kde$z))
+plot(bias_rast, main = "Sampling density kernel (darker = more sampled)")
+```
+### Environmental distribution comparison
+Compare environmental space of occurrences vs. background to detect bias:
+```r
+# Extract env values at occurrence and background points
+occ_env <- extract(env_stack, occ_pts)
+bg_env  <- extract(env_stack, bg_pts)
+# Kolmogorov-Smirnov test per variable
+for (v in names(env_stack)) {
+  ks_result <- ks.test(occ_env[[v]], bg_env[[v]])
+  cat(v, ": D =", round(ks_result$statistic, 3),
+      "p =", round(ks_result$p.value, 4), "\n")
+}
+# Significant D (p < 0.05) suggests occurrence records do NOT represent
+# the available environmental space — evidence of sampling bias
+```
+---
+## 3. Correction Methods
+### Method 1 — Target-Group Background
+**Concept:** Use occurrence records of all species from the same taxonomic group (e.g.,
+all mammals, all birds) as the background. This approximates the sampling effort:
+if a site was visited (evidenced by other species records), it should appear in the background.
+**When to use:**
+- When sampling effort tracks detectability (organised surveys, citizen science platforms)
+- When a large pool of co-occurring taxonomic relatives exists in GBIF
+**Limitations:**
+- Assumes all species in the group have similar detectability
+- Fails if the focal species was specifically targeted (e.g., camera trap targeted at jaguars)
+```r
+suppressPackageStartupMessages(library(rgbif))
+# Download all mammal records in the study region as target-group background
+tg_bg <- occ_search(
+  orderKey = 732,           # Carnivora taxon key
+  hasCoordinate = TRUE,
+  occurrenceStatus = "PRESENT",
+  country = "BR",
+  limit = 50000
+)$data
+# Remove focal species from background
+tg_bg <- tg_bg[tg_bg$species != "Panthera onca", ]
+bg_pts <- tg_bg[, c("decimalLongitude", "decimalLatitude")]
+bg_pts <- na.omit(bg_pts)
+```
+**Report in ODMAP field O4:** "Background sampled from target-group (Carnivora) GBIF records (n = X) to account for collector bias."
+---
+### Method 2 — Kernel Density Weighting
+**Concept:** Weight background points inversely proportional to sampling density.
+Areas that are densely sampled get low weight; areas rarely visited get high weight.
+**Formula:**
+```
+weight(bg_i) = 1 / KDE(bg_i)
+```
+where KDE is the kernel density estimate of occurrence records at the background point location.
+```r
+suppressPackageStartupMessages(library(MASS))
+suppressPackageStartupMessages(library(terra))
+# Compute KDE of occurrence points
+kde <- kde2d(occ_df$decimalLongitude, occ_df$decimalLatitude,
+             n = 200,
+             lims = c(range(occ_df$decimalLongitude) + c(-5, 5),
+                      range(occ_df$decimalLatitude)  + c(-5, 5)))
+# Interpolate KDE values at background point locations
+bias_rast <- rast(list(x = kde$x, y = kde$y, z = kde$z))
+bg_density <- extract(bias_rast, bg_pts)[[1]]
+# Invert density to get weights (add small constant to avoid division by zero)
+bg_weights <- 1 / (bg_density + 1e-6)
+bg_weights  <- bg_weights / sum(bg_weights, na.rm = TRUE)
+# Pass bg_weights to maxnet or ENMeval via:
+# ENMevaluate(..., bg = bg_pts, bg.grp = NULL)
+# maxnet::maxnet(p, data, bg.weights = bg_weights)
+```
+---
+### Method 3 — Environmental Filtering (thin in environmental space)
+**Concept:** Instead of geographic thinning (by distance), thin occurrence records in
+environmental space to reduce over-representation of common bioclimates.
+**When to use:**
+- When geographic thinning removes records in distinct habitats that happen to be close
+- When most records cluster in one bioclimatic region
+```r
+suppressPackageStartupMessages(library(terra))
+# Extract env values at occurrences
+occ_env <- extract(env_stack, occ_pts, ID = FALSE)
+# Create a raster in environmental space (first two PCA axes)
+pca_res <- prcomp(na.omit(occ_env), scale. = TRUE)
+occ_pca <- predict(pca_res, occ_env)[, 1:2]
+# Grid sampling in environmental space (keep 1 record per cell)
+env_grid_size <- 0.5  # in PC1/PC2 units — adjust based on variance explained
+occ_pca_df <- as.data.frame(occ_pca)
+occ_pca_df$cell_id <- paste(
+  floor(occ_pca_df$PC1 / env_grid_size),
+  floor(occ_pca_df$PC2 / env_grid_size)
+)
+# Keep one record per environmental cell (random)
+set.seed(42)
+occ_thinned_idx <- occ_pca_df %>%
+  dplyr::group_by(cell_id) %>%
+  dplyr::slice_sample(n = 1) %>%
+  dplyr::pull(.I)
+occ_thinned <- occ_df[occ_thinned_idx, ]
+message("Environmental thinning: ", nrow(occ_df), " → ", nrow(occ_thinned), " records")
+```
+---
+### Method 4 — Checkerboard / Spatial Partitioning
+See `resources/spatial-cv-guide.md` for detailed implementation. Spatial partitioning
+(checkerboard or block CV) does not correct bias but ensures that validation data are
+geographically independent from training data, making model evaluation more realistic
+under spatially biased training.
+---
+## 4. Decision Table — Which Method to Use
+| Type of bias | Data available | Recommended method |
+|---|---|---|
+| Road/city proximity, known from KDE | Target-group records in GBIF | Target-group background |
+| General sampling density gradient | Any | Kernel density weighting |
+| Bioclimatic clustering (few env types over-sampled) | Environmental predictors | Environmental filtering |
+| All of the above | Any | Combine: KDE weighting + environmental filtering |
+| Bias unknown but suspicious | Any | Spatial thinning (1 record per grid cell) as minimum |
+---
+## 5. Reporting in ODMAP (Field O4)
+ODMAP field **O4 (Biases and sampling artefacts)** must state:
+1. Which bias was detected and how (KDE, KS test, visual inspection)
+2. Which correction method was applied
+3. Key parameters (target group taxon, KDE bandwidth, grid cell size)
+**Example ODMAP O4 text:**
+> "Sampling bias was detected by comparing the environmental distribution of occurrence
+> records vs. random background (KS test, p < 0.001 for bio1 and bio12). Bias was
+> corrected using target-group background (all Mammalia records from GBIF within the
+> study area, n = 12,450), excluding the focal species."
+---
+## 6. Common Pitfalls
+- **Spatial thinning alone is not bias correction:** it reduces spatial clustering but
+  does not address the underlying pattern of collector effort.
+- **KDE bandwidth selection is subjective:** use cross-validated bandwidth (`MASS::bandwidth.nrd`
+  or `ks::hpi`) rather than default.
+- **Target-group background can introduce new bias:** if all Carnivora records are
+  also biased toward roads, the background will still be biased. Always visualise.
+- **Environmental filtering removes rare habitats:** if a unique habitat is represented
+  by only a few records, environmental filtering will preferentially discard it. Check
+  that important biomes are still represented after filtering.
+---
+## 7. References
+| Citation | DOI |
+|---|---|
+| Phillips et al. 2009. Ecol. Apps. 19:181–197 | [10.1890/07-2153.1](https://doi.org/10.1890/07-2153.1) |
+| Fourcade et al. 2014. PLoS ONE 9:e97122 | [10.1371/journal.pone.0097122](https://doi.org/10.1371/journal.pone.0097122) |
+| Kramer-Schadt et al. 2013. Ecography 36:1044 | [10.1111/j.1600-0587.2013.00159.x](https://doi.org/10.1111/j.1600-0587.2013.00159.x) |
+| Warton & Shepherd 2010. Ann. Appl. Stat. | [10.1214/10-AOAS331](https://doi.org/10.1214/10-AOAS331) |

package/skills/predictive-modeling-best-practices/resources/spatial-cv-guide.md ADDED Viewed

@@ -0,0 +1,73 @@
+# Spatial Cross-Validation Guide
+## Why Spatial CV?
+Standard random k-fold CV assumes independence between train and test folds. Ecological and SDM data are almost always spatially autocorrelated — nearby sites share similar environments and species composition. Random splits leak spatial information across folds, producing optimistically biased performance estimates.
+**Rule:** If the response variable is spatially structured (which is almost always true for ecological data), use spatial CV.
+## Block CV (Checkerboard / Grid)
+Divides the study area into rectangular blocks. Points within the same block are assigned to the same fold.
+```r
+library(blockCV)
+# Auto-select block size based on spatial autocorrelation range
+sac <- cv_spatial_autocor(
+  x   = occ_sf,             # sf object with presence/background
+  column = "response",
+  plot = TRUE
+)
+blocks <- cv_spatial(
+  x          = occ_sf,
+  column     = "response",
+  k          = 5,
+  size       = sac$range,   # use autocorrelation range as block size
+  hexagon    = FALSE,
+  report     = TRUE,
+  plot       = TRUE
+)
+```
+## Buffered Leave-One-Out (spatial LOO)
+For each test point, exclude all training points within a buffer distance. Computationally expensive but most rigorous for small datasets.
+```r
+# In ENMeval:
+library(ENMeval)
+e <- ENMevaluate(
+  occs      = occ_coords,
+  envs      = predictor_stack,
+  bg        = bg_coords,
+  algorithm = "maxnet",
+  partitions = "block"      # or "checkerboard1", "checkerboard2"
+)
+```
+## Recommended Block Size
+The block size should be at least as large as the spatial autocorrelation range of the response variable.
+| Spatial resolution | Suggested starting block size |
+|--------------------|-------------------------------|
+| 1 km | 50–100 km |
+| 5 km | 100–200 km |
+| 10 km | 200–400 km |
+| 25 km (continental) | 500–1000 km |
+Always run `cv_spatial_autocor()` to confirm the empirical range for your specific dataset.
+## Number of Folds
+- **k = 4 or 5:** Standard. Balances bias and variance.
+- **k = 10:** For larger datasets (> 500 occurrences).
+- **Leave-one-out (LOO):** For very small datasets (< 30 occurrences).
+## Checklist Before Running CV
+- [ ] Confirmed spatial autocorrelation in response variable
+- [ ] Block size ≥ autocorrelation range
+- [ ] Each fold has presence AND background points
+- [ ] No fold is empty or severely imbalanced (< 10 presences per fold)
+- [ ] Same CV folds used across all candidate algorithms for fair comparison

package/skills/predictive-modeling-best-practices/scripts/__pycache__/spatial_cv.cpython-311.pyc ADDED Viewed

Binary file

package/skills/predictive-modeling-best-practices/scripts/collinearity_check.R ADDED Viewed

@@ -0,0 +1,112 @@
+# ecological-agent-skills / Copyright (C) 2026 Francisco Diego Barros Barata
+# SPDX-License-Identifier: GPL-3.0-or-later
+# Usage: Rscript collinearity_check.R <predictors.csv> <output_dir> [vif_threshold] [cor_threshold]
+# Assess and reduce predictor collinearity
+# Usage: Rscript collinearity_check.R <env_matrix_csv> <output_dir> [vif_threshold]
+# Requires: usdm, corrplot, dplyr
+# ── Inline logger ─────────────────────────────────────────────────────────────
+SKILL_NAME <- "predictive-modeling-best-practices"
+.log_ts  <- function() format(Sys.time(), "[%Y-%m-%d %H:%M:%S]")
+log_info <- function(...) message(.log_ts(), " [INFO]  ", sprintf(...))
+log_warn <- function(...) message(.log_ts(), " [WARN]  ", sprintf(...))
+log_error<- function(...) message(.log_ts(), " [ERROR] ", sprintf(...))
+log_step <- function(n, d) log_info("-- STEP %d: %s", n, d)
+log_decision <- function(v, val, why) log_info("DECISION | %s = %s | %s", v, val, why)
+dir.create("logs", recursive=TRUE, showWarnings=FALSE)
+suppressPackageStartupMessages({
+  library(usdm)
+  library(dplyr)
+})
+args          <- commandArgs(trailingOnly = TRUE)
+env_file      <- ifelse(length(args) >= 1, args[1], "data/processed/env_matrix.csv")
+output_dir    <- ifelse(length(args) >= 2, args[2], "outputs")
+vif_threshold <- ifelse(length(args) >= 3, as.numeric(args[3]), 5)
+# ── Input precondition checks ─────────────────────────────────────────────────
+if (!file.exists(env_file)) {
+  log_error("Input nao encontrado: %s\nCausa provavel: passo anterior nao concluiu.\nVerifique: outputs do skill anterior.\nSkill anterior: species-distribution-modeling", env_file)
+  stop("Missing input: ", env_file)
+}
+log_decision("vif_threshold", vif_threshold, "VIF threshold for stepwise predictor exclusion; standard ecological threshold is 5 or 10")
+dir.create(output_dir, recursive = TRUE, showWarnings = FALSE)
+# ── Load ───────────────────────────────────────────────────────────────────
+log_step(1, "Load environmental predictor matrix")
+tryCatch({
+  log_info("Loading: %s", env_file)
+  env <- read.csv(env_file) |> na.omit()
+  log_info("Variables: %d | Rows: %d", ncol(env), nrow(env))
+  if (nrow(env) < 30) {
+    log_warn("Numero de linhas baixo (%d). Estimativas de correlacao podem ser instáveis com n < 30.", nrow(env))
+  }
+  if (ncol(env) < 2) {
+    log_error("Apenas %d variavel encontrada. Analise de colinearidade requer pelo menos 2 preditores.\nCausa provavel: CSV incorreto ou sem preditores numericos.\nVerifique: formato do arquivo env_matrix_csv.\nSkill anterior: species-distribution-modeling", ncol(env))
+    stop("At least 2 predictor columns required for collinearity analysis.")
+  }
+}, error = function(e) {
+  log_error("Falha em load_env_matrix: %s\nCausa provavel: arquivo CSV ausente, malformado ou sem colunas numericas.\nVerifique: caminho e formato do CSV de preditores.\nSkill anterior: species-distribution-modeling", conditionMessage(e))
+  stop(e)
+})
+# ── Pairwise correlation ───────────────────────────────────────────────────
+log_step(2, "Compute pairwise Pearson correlations")
+tryCatch({
+  cor_mat <- cor(env, method = "pearson")
+  high_cor <- which(abs(cor_mat) > 0.7 & cor_mat != 1, arr.ind = TRUE)
+  high_cor_pairs <- data.frame(
+    var1 = rownames(high_cor),
+    var2 = colnames(cor_mat)[high_cor[, 2]],
+    r    = cor_mat[high_cor]
+  ) |> filter(var1 < var2) |> arrange(desc(abs(r)))
+  log_info("Highly correlated pairs (|r| > 0.7): %d pairs found.", nrow(high_cor_pairs))
+  if (nrow(high_cor_pairs) > 0) {
+    log_warn("%d pares de preditores altamente correlacionados (|r| > 0.70) detectados. Reducao de colinearidade necessaria.", nrow(high_cor_pairs))
+    log_info("Highly correlated pairs:\n%s",
+             paste(capture.output(print(high_cor_pairs)), collapse = "\n"))
+  }
+}, error = function(e) {
+  log_error("Falha em pairwise_correlation: %s\nCausa provavel: colunas nao numericas ou valores NA remanescentes.\nVerifique: tipos de dados do CSV e resultado do na.omit.\nSkill anterior: species-distribution-modeling", conditionMessage(e))
+  stop(e)
+})
+# ── VIF stepwise reduction ─────────────────────────────────────────────────
+log_step(3, "VIF stepwise predictor reduction")
+tryCatch({
+  log_info("Running VIF stepwise reduction (threshold: %g)...", vif_threshold)
+  vif_result <- vifstep(env, th = vif_threshold)
+  log_info("Variables retained after VIF reduction:\n%s",
+           paste(capture.output(print(vif_result)), collapse = "\n"))
+  selected <- vif_result@results$Variables
+  log_info("Final selected predictors (%d): %s", length(selected), paste(selected, collapse = ", "))
+  log_decision("selected_predictors", paste(selected, collapse = ", "),
+               paste0("VIF stepwise retained these predictors below threshold ", vif_threshold))
+  n_removed <- ncol(env) - length(selected)
+  if (n_removed > 0) {
+    log_warn("%d preditores removidos por VIF > %g. Revise se variaveis ecologicamente importantes foram excluidas.", n_removed, vif_threshold)
+  }
+}, error = function(e) {
+  log_error("Falha em vif_stepwise_reduction: %s\nCausa provavel: matriz singular, preditores constantes, ou falha no pacote usdm.\nVerifique: variancia de cada preditor e instalacao do pacote usdm.\nSkill anterior: species-distribution-modeling", conditionMessage(e))
+  stop(e)
+})
+# ── Outputs ────────────────────────────────────────────────────────────────
+log_step(4, "Write collinearity outputs")
+tryCatch({
+  write.csv(high_cor_pairs, file.path(output_dir, "high_correlation_pairs.csv"), row.names = FALSE)
+  write.csv(vif_result@results, file.path(output_dir, "vif_results.csv"), row.names = FALSE)
+  writeLines(selected, file.path(output_dir, "selected_predictors.txt"))
+  log_info("Outputs written to: %s", output_dir)
+}, error = function(e) {
+  log_error("Falha em write_outputs: %s\nCausa provavel: permissoes de escrita ou diretorio de saida inexistente.\nVerifique: output_dir e permissoes do sistema de arquivos.\nSkill anterior: species-distribution-modeling", conditionMessage(e))
+  stop(e)
+})

package/skills/predictive-modeling-best-practices/scripts/spatial_cv.py ADDED Viewed

@@ -0,0 +1,182 @@
+#!/usr/bin/env python3
+# ecological-agent-skills / Copyright (C) 2026 Francisco Diego Barros Barata
+# SPDX-License-Identifier: GPL-3.0-or-later
+"""
+spatial_cv.py
+Spatial block cross-validation for ecological models.
+Usage: python spatial_cv.py <points_with_env_csv> <output_dir> [n_folds] [block_size_km]
+Requires: pandas, numpy, sklearn, geopandas, matplotlib
+"""
+import logging
+import sys
+from datetime import datetime
+from pathlib import Path
+SKILL_NAME = "predictive-modeling-best-practices"
+_LOG_DIR   = Path("logs")
+_LOG_DIR.mkdir(parents=True, exist_ok=True)
+_log_file  = _LOG_DIR / f"skill_{SKILL_NAME}_{datetime.now().strftime('%Y%m%d_%H%M%S')}.log"
+logging.basicConfig(
+    level=logging.INFO,
+    format="[%(asctime)s] [%(levelname)s] [" + SKILL_NAME + "] %(message)s",
+    datefmt="%Y-%m-%d %H:%M:%S",
+    handlers=[
+        logging.StreamHandler(sys.stdout),
+        logging.FileHandler(_log_file, encoding="utf-8"),
+    ],
+)
+logger = logging.getLogger(SKILL_NAME)
+def log_step(n: int, desc: str) -> None:
+    logger.info("-- STEP %d: %s", n, desc)
+def log_decision(var: str, val, why: str) -> None:
+    logger.info("DECISION | %s = %s | %s", var, val, why)
+import numpy as np
+import pandas as pd
+import matplotlib.pyplot as plt
+def assign_spatial_blocks(df, lon_col, lat_col, block_size_deg, n_folds):
+    """Assign spatial blocks by grid, then allocate to CV folds."""
+    lon_blocks = np.floor(df[lon_col] / block_size_deg).astype(int)
+    lat_blocks = np.floor(df[lat_col] / block_size_deg).astype(int)
+    block_ids  = (lon_blocks.astype(str) + "_" + lat_blocks.astype(str))
+    unique_blocks = block_ids.unique()
+    np.random.shuffle(unique_blocks)
+    fold_map = {blk: (i % n_folds) + 1 for i, blk in enumerate(unique_blocks)}
+    return block_ids.map(fold_map)
+def collinearity_report(df: pd.DataFrame, predictors: list, r_thresh=0.7) -> pd.DataFrame:
+    cor = df[predictors].corr(method="spearman").abs()
+    pairs = []
+    for i in range(len(predictors)):
+        for j in range(i+1, len(predictors)):
+            r = cor.iloc[i, j]
+            if r > r_thresh:
+                pairs.append({"var1": predictors[i], "var2": predictors[j], "spearman_r": round(r, 4)})
+    return pd.DataFrame(pairs).sort_values("spearman_r", ascending=False)
+def main():
+    data_file     = sys.argv[1] if len(sys.argv) > 1 else "data/processed/points_with_env.csv"
+    output_dir    = Path(sys.argv[2]) if len(sys.argv) > 2 else Path("outputs/cv")
+    n_folds       = int(sys.argv[3]) if len(sys.argv) > 3 else 5
+    block_size_km = float(sys.argv[4]) if len(sys.argv) > 4 else 300.0
+    output_dir.mkdir(parents=True, exist_ok=True)
+    log_decision("data_file", data_file,
+                 "Input CSV of occurrence points with extracted environmental predictors")
+    log_decision("n_folds", n_folds,
+                 "Number of spatial CV folds for model evaluation")
+    log_decision("block_size_km", block_size_km,
+                 "Spatial block size in km; should exceed spatial autocorrelation range")
+    log_decision("output_dir", str(output_dir), "Directory for CV fold assignments and plots")
+    if not Path(data_file).exists():
+        logger.error(
+            "Input nao encontrado: %s\n"
+            "  Causa provavel: passo anterior nao concluiu.\n"
+            "  Skill anterior que deveria ter produzido este input: geoprocessing-for-ecology",
+            data_file
+        )
+        sys.exit(1)
+    try:
+        log_step(1, "Loading point data with environmental predictors")
+        dat = pd.read_csv(data_file)
+        logger.info("Loaded %d records with %d columns", len(dat), len(dat.columns))
+        lon_col = next((c for c in dat.columns if "lon" in c.lower()), None)
+        lat_col = next((c for c in dat.columns if "lat" in c.lower()), None)
+        if not lon_col or not lat_col:
+            raise ValueError(
+                "Cannot find lon/lat columns. Name them decimalLongitude/decimalLatitude."
+            )
+        logger.info("Coordinate columns identified: lon='%s', lat='%s'", lon_col, lat_col)
+        n_missing_coords = dat[[lon_col, lat_col]].isna().any(axis=1).sum()
+        if n_missing_coords > 0:
+            logger.warning(
+                "%d records have missing coordinates and will produce NaN fold assignments.",
+                n_missing_coords
+            )
+        log_step(2, "Assigning spatial blocks and CV fold labels")
+        block_size_deg = block_size_km / 111.0  # approx degrees
+        log_decision("block_size_deg", round(block_size_deg, 4),
+                     "Converted from km using 1 degree ~ 111 km (approximate)")
+        np.random.seed(42)
+        log_decision("random_seed", 42, "Fixed seed for reproducible fold assignment")
+        dat["cv_fold"] = assign_spatial_blocks(dat, lon_col, lat_col, block_size_deg, n_folds)
+        log_step(3, "Summarising fold composition")
+        # Fold summary
+        fold_summary = dat.groupby("cv_fold").agg(n=("cv_fold","count"))
+        if "pa" in dat.columns or "presence" in dat.columns:
+            resp_col = "pa" if "pa" in dat.columns else "presence"
+            fold_summary["n_presence"] = dat.groupby("cv_fold")[resp_col].sum().values
+        # Check fold balance
+        fold_counts = fold_summary["n"].values
+        min_fold = fold_counts.min()
+        max_fold = fold_counts.max()
+        if max_fold > 3 * min_fold:
+            logger.warning(
+                "Fold sizes are highly imbalanced (min=%d, max=%d). "
+                "Consider adjusting block_size_km.",
+                min_fold, max_fold
+            )
+        logger.info("CV Fold summary (block_size ~%s km):\n%s",
+                    block_size_km, fold_summary.to_string())
+        dat.to_csv(output_dir / "data_with_cv_folds.csv", index=False)
+        log_step(4, "Running collinearity screening on predictors")
+        # Collinearity
+        skip_cols = {lon_col, lat_col, "pa", "presence", "cv_fold", "QA_status", "species"}
+        predictors = [c for c in dat.select_dtypes(include=np.number).columns if c not in skip_cols]
+        if predictors:
+            cor_pairs = collinearity_report(dat, predictors)
+            cor_pairs.to_csv(output_dir / "high_correlation_pairs.csv", index=False)
+            logger.info("Highly correlated pairs (|r| > 0.7): %d", len(cor_pairs))
+            if len(cor_pairs) > 0:
+                logger.warning(
+                    "%d predictor pairs exceed |r| = 0.7 Spearman correlation threshold. "
+                    "Consider removing redundant variables before modelling.",
+                    len(cor_pairs)
+                )
+                logger.info("%s", cor_pairs.to_string(index=False))
+            else:
+                logger.info("No highly correlated pairs found.")
+        else:
+            logger.warning("No numeric predictor columns found for collinearity screening.")
+        log_step(5, "Generating spatial CV fold map plot")
+        # Spatial plot
+        fig, ax = plt.subplots(figsize=(8, 6))
+        scatter = ax.scatter(dat[lon_col], dat[lat_col], c=dat["cv_fold"],
+                             cmap="Set1", s=15, alpha=0.7)
+        plt.colorbar(scatter, ax=ax, label="CV Fold")
+        ax.set_xlabel("Longitude"); ax.set_ylabel("Latitude")
+        ax.set_title(f"Spatial CV — {n_folds} folds, block ~{block_size_km} km")
+        plt.tight_layout()
+        plt.savefig(output_dir / "cv_fold_map.png", dpi=150)
+        plt.close()
+        logger.info("Outputs written to: %s", output_dir)
+    except FileNotFoundError as e:
+        logger.error(
+            "Input file not found: %s\n"
+            "  Expected output from: geoprocessing-for-ecology\n"
+            "  Check that previous step completed.",
+            e
+        )
+        raise
+    except Exception as e:
+        logger.error("Unexpected error in spatial CV: %s", e)
+        raise
+if __name__ == "__main__":
+    main()