PyPI - openforis-whisp - Versions diffs - 3.0.0a6__tar.gz → 3.0.0a7__tar.gz - Mend

openforis-whisp 3.0.0a6tar.gz → 3.0.0a7tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (20) hide show

{openforis_whisp-3.0.0a6 → openforis_whisp-3.0.0a7}/PKG-INFO RENAMED Viewed

@@ -1,8 +1,9 @@
-Metadata-Version: 2.3
+Metadata-Version: 2.4
 Name: openforis-whisp
-Version: 3.0.0a6
+Version: 3.0.0a7
 Summary: Whisp (What is in that plot) is an open-source solution which helps to produce relevant forest monitoring information and support compliance with deforestation-related regulations.
 License: MIT
+License-File: LICENSE
 Keywords: whisp,geospatial,data-processing
 Author: Andy Arnell
 Author-email: andrew.arnell@fao.org
@@ -16,6 +17,7 @@ Classifier: Programming Language :: Python :: 3.10
 Classifier: Programming Language :: Python :: 3.11
 Classifier: Programming Language :: Python :: 3.12
 Classifier: Programming Language :: Python :: 3.13
+Classifier: Programming Language :: Python :: 3.14
 Classifier: Topic :: Software Development :: Libraries :: Python Modules
 Requires-Dist: country_converter (>=0.7,<2.0.0)
 Requires-Dist: earthengine-api
@@ -69,11 +71,11 @@ Description-Content-Type: text/markdown
   ***Whisp*** can currently be used directly or implemented in your own code through three different pathways:
-  1. The Whisp App with its simple interface can be used [right here](https://whisp.openforis.org/) or called from other software by [API](https://whisp.openforis.org/documentation/api-guide). The Whisp App currently supports the processing of up to 1000 geometries per job. The original JS & Python code behind the Whisp App and API can be found [here](https://github.com/forestdatapartnership/whisp-app).
+  1. The Whisp App with its simple interface can be used [right here](https://whisp.openforis.org/) or called from other software by [API](https://whisp.openforis.org/documentation/api-guide). The Whisp App currently supports the processing of up to 3,000 geometries per job. The original JS & Python code behind the Whisp App and API can be found [here](https://github.com/forestdatapartnership/whisp-app).
   2. [Whisp in Earthmap](https://whisp.earthmap.org/?aoi=WHISP&boundary=plot1&layers=%7B%22CocoaETH%22%3A%7B%22opacity%22%3A1%7D%2C%22JRCForestMask%22%3A%7B%22opacity%22%3A1%7D%2C%22planet_rgb%22%3A%7B%22opacity%22%3A1%2C%22date%22%3A%222020-12%22%7D%7D&map=%7B%22center%22%3A%7B%22lat%22%3A7%2C%22lng%22%3A4%7D%2C%22zoom%22%3A3%2C%22mapType%22%3A%22satellite%22%7D&statisticsOpen=true) supports the visualization of geometries on actual maps with the possibility to toggle different relevant map products around tree cover, commodities and deforestation. It is practical for demonstration purposes and spot checks of single geometries but not recommended for larger datasets.
-  3. Datasets of any size, especially when holding more than 1000 geometries, can be analyzed with Whisp through the [python package on pip](https://pypi.org/project/openforis-whisp/). See example [Colab Notebook](https://github.com/forestdatapartnership/whisp/blob/main/notebooks/Colab_whisp_geojson_to_csv.ipynb) for implementation with a geojson input. For the detailed procedure please go to the section [Whisp notebooks](#whisp_notebooks).
+  3. Datasets of any size, especially when holding more than 3,000 geometries, can be analyzed with Whisp through the [python package on pip](https://pypi.org/project/openforis-whisp/). See example [Colab Notebook](https://github.com/forestdatapartnership/whisp/blob/main/notebooks/Colab_whisp_geojson_to_csv.ipynb) for implementation with a geojson input. For the detailed procedure please go to the section [Whisp notebooks](#whisp_notebooks).
   ## Whisp datasets <a name="whisp_datasets"></a>
@@ -365,3 +367,4 @@ Please read the [contributing guidelines](contributing_guidelines.md) for good p
   Users can report violations directly to us by emailing the address listed in the "Contact Us" section of the website:
   https://openforis.org/solutions/whisp/

{openforis_whisp-3.0.0a6 → openforis_whisp-3.0.0a7}/README.md RENAMED Viewed

@@ -32,11 +32,11 @@
   ***Whisp*** can currently be used directly or implemented in your own code through three different pathways:
-  1. The Whisp App with its simple interface can be used [right here](https://whisp.openforis.org/) or called from other software by [API](https://whisp.openforis.org/documentation/api-guide). The Whisp App currently supports the processing of up to 1000 geometries per job. The original JS & Python code behind the Whisp App and API can be found [here](https://github.com/forestdatapartnership/whisp-app).
+  1. The Whisp App with its simple interface can be used [right here](https://whisp.openforis.org/) or called from other software by [API](https://whisp.openforis.org/documentation/api-guide). The Whisp App currently supports the processing of up to 3,000 geometries per job. The original JS & Python code behind the Whisp App and API can be found [here](https://github.com/forestdatapartnership/whisp-app).
   2. [Whisp in Earthmap](https://whisp.earthmap.org/?aoi=WHISP&boundary=plot1&layers=%7B%22CocoaETH%22%3A%7B%22opacity%22%3A1%7D%2C%22JRCForestMask%22%3A%7B%22opacity%22%3A1%7D%2C%22planet_rgb%22%3A%7B%22opacity%22%3A1%2C%22date%22%3A%222020-12%22%7D%7D&map=%7B%22center%22%3A%7B%22lat%22%3A7%2C%22lng%22%3A4%7D%2C%22zoom%22%3A3%2C%22mapType%22%3A%22satellite%22%7D&statisticsOpen=true) supports the visualization of geometries on actual maps with the possibility to toggle different relevant map products around tree cover, commodities and deforestation. It is practical for demonstration purposes and spot checks of single geometries but not recommended for larger datasets.
-  3. Datasets of any size, especially when holding more than 1000 geometries, can be analyzed with Whisp through the [python package on pip](https://pypi.org/project/openforis-whisp/). See example [Colab Notebook](https://github.com/forestdatapartnership/whisp/blob/main/notebooks/Colab_whisp_geojson_to_csv.ipynb) for implementation with a geojson input. For the detailed procedure please go to the section [Whisp notebooks](#whisp_notebooks).
+  3. Datasets of any size, especially when holding more than 3,000 geometries, can be analyzed with Whisp through the [python package on pip](https://pypi.org/project/openforis-whisp/). See example [Colab Notebook](https://github.com/forestdatapartnership/whisp/blob/main/notebooks/Colab_whisp_geojson_to_csv.ipynb) for implementation with a geojson input. For the detailed procedure please go to the section [Whisp notebooks](#whisp_notebooks).
   ## Whisp datasets <a name="whisp_datasets"></a>
@@ -327,3 +327,4 @@ Please read the [contributing guidelines](contributing_guidelines.md) for good p
   **Reporting**
   Users can report violations directly to us by emailing the address listed in the "Contact Us" section of the website:
   https://openforis.org/solutions/whisp/

{openforis_whisp-3.0.0a6 → openforis_whisp-3.0.0a7}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "poetry.core.masonry.api"
 [tool.poetry]
 name = "openforis-whisp"
-version = "3.0.0a6"
+version = "3.0.0a7"
 description = "Whisp (What is in that plot) is an open-source solution which helps to produce relevant forest monitoring information and support compliance with deforestation-related regulations."
 repository = "https://github.com/forestdatapartnership/whisp"
 authors = ["Andy Arnell <andrew.arnell@fao.org>"]

{openforis_whisp-3.0.0a6 → openforis_whisp-3.0.0a7}/src/openforis_whisp/__init__.py RENAMED Viewed

@@ -101,6 +101,8 @@ from openforis_whisp.utils import (
 from openforis_whisp.data_checks import (
     analyze_geojson,
-    validate_geojson_constraints,
+    check_geojson_limits,
+    screen_geojson,  # Backward compatibility alias
     suggest_processing_mode,
+    validate_geojson_constraints,  # Backward compatibility alias
 )

{openforis_whisp-3.0.0a6 → openforis_whisp-3.0.0a7}/src/openforis_whisp/advanced_stats.py RENAMED Viewed

@@ -510,253 +510,102 @@ def join_admin_codes(
         return df
-class ProgressTracker:
-    """
-    Track batch processing progress with time estimation.
+def _format_time(seconds: float) -> str:
+    """Format seconds as human-readable string."""
+    if seconds < 60:
+        return f"{seconds:.0f}s"
+    elif seconds < 3600:
+        mins = seconds / 60
+        return f"{mins:.1f}m"
+    else:
+        hours = seconds / 3600
+        return f"{hours:.1f}h"
-    Shows progress at adaptive milestones (more frequent for small datasets,
-    less frequent for large datasets) with estimated time remaining based on
-    processing speed. Includes time-based heartbeat to prevent long silences.
-    """
-    def __init__(
-        self,
-        total: int,
-        logger: logging.Logger = None,
-        heartbeat_interval: int = 180,
-        status_file: str = None,
-    ):
-        """
-        Initialize progress tracker.
-        Parameters
-        ----------
-        total : int
-            Total number of items to process
-        logger : logging.Logger, optional
-            Logger for output
-        heartbeat_interval : int, optional
-            Seconds between heartbeat messages (default: 180 = 3 minutes)
-        status_file : str, optional
-            Path to JSON status file for API/web app consumption.
-            Checkpoints auto-save to same directory as status_file.
-        """
-        self.total = total
-        self.completed = 0
-        self.lock = threading.Lock()
-        self.logger = logger or logging.getLogger("whisp")
-        self.heartbeat_interval = heartbeat_interval
-        # Handle status_file: if directory passed, auto-generate filename
-        if status_file:
-            import os
-            if os.path.isdir(status_file):
-                self.status_file = os.path.join(
-                    status_file, "whisp_processing_status.json"
-                )
-            else:
-                # Validate that parent directory exists
-                parent_dir = os.path.dirname(status_file)
-                if parent_dir and not os.path.isdir(parent_dir):
-                    self.logger.warning(
-                        f"Status file directory does not exist: {parent_dir}"
-                    )
-                    self.status_file = None
-                else:
-                    self.status_file = status_file
-        else:
-            self.status_file = None
-        # Adaptive milestones based on dataset size
-        # Small datasets (< 50): show every 25% (not too spammy)
-        # Medium (50-500): show every 20%
-        # Large (500-1000): show every 10%
-        # Very large (1000+): show every 5% (cleaner for long jobs)
-        if total < 50:
-            self.milestones = {25, 50, 75, 100}
-        elif total < 500:
-            self.milestones = {20, 40, 60, 80, 100}
-        elif total < 1000:
-            self.milestones = {10, 20, 30, 40, 50, 60, 70, 80, 90, 100}
-        else:
-            self.milestones = {
-                5,
-                10,
-                15,
-                20,
-                25,
-                30,
-                35,
-                40,
-                45,
-                50,
-                55,
-                60,
-                65,
-                70,
-                75,
-                80,
-                85,
-                90,
-                95,
-                100,
-            }
+def _get_progress_milestones(total_features: int) -> set:
+    """
+    Get progress milestones based on dataset size.
-        self.shown_milestones = set()
-        self.start_time = time.time()
-        self.last_update_time = self.start_time
-        self.heartbeat_stop = threading.Event()
-        self.heartbeat_thread = None
+    Parameters
+    ----------
+    total_features : int
+        Total number of features being processed
-    def _write_status_file(self, status: str = "processing") -> None:
-        """Write current progress to JSON status file using atomic write."""
-        if not self.status_file:
-            return
+    Returns
+    -------
+    set
+        Set of percentage milestones to show
+    """
+    # Set milestones based on feature count
+    if total_features < 250:
+        return set(range(20, 101, 20))  # Every 20%: {20, 40, 60, 80, 100}
+    elif total_features < 1000:
+        return set(range(10, 101, 10))  # Every 10%
+    elif total_features < 10000:
+        return set(range(5, 101, 5))  # Every 5%
+    elif total_features < 50000:
+        return set(range(2, 101, 2))  # Every 2%
+    else:
+        return set(range(1, 101))  # Every 1%
-        try:
-            import json
-            import os
-            elapsed = time.time() - self.start_time
-            percent = (self.completed / self.total * 100) if self.total > 0 else 0
-            rate = self.completed / elapsed if elapsed > 0 else 0
-            eta = (
-                (self.total - self.completed) / rate * 1.15
-                if rate > 0 and percent >= 5
-                else None
-            )
-            # Write to temp file then atomic rename to prevent partial reads
-            from datetime import datetime
-            temp_file = self.status_file + ".tmp"
-            with open(temp_file, "w") as f:
-                json.dump(
-                    {
-                        "status": status,
-                        "progress": f"{self.completed}/{self.total}",
-                        "percent": round(percent, 1),
-                        "elapsed_sec": round(elapsed),
-                        "eta_sec": round(eta) if eta else None,
-                        "updated_at": datetime.now().isoformat(),
-                    },
-                    f,
-                )
-            os.replace(temp_file, self.status_file)
-        except Exception:
-            pass
-    def start_heartbeat(self) -> None:
-        """Start background heartbeat thread for time-based progress updates."""
-        if self.heartbeat_thread is None or not self.heartbeat_thread.is_alive():
-            self.heartbeat_stop.clear()
-            self.heartbeat_thread = threading.Thread(
-                target=self._heartbeat_loop, daemon=True
-            )
-            self.heartbeat_thread.start()
-            # Write initial status
-            self._write_status_file(status="processing")
-    def _heartbeat_loop(self) -> None:
-        """Background loop that logs progress at time intervals."""
-        while not self.heartbeat_stop.wait(self.heartbeat_interval):
-            with self.lock:
-                # Only log if we haven't shown a milestone recently
-                time_since_update = time.time() - self.last_update_time
-                if (
-                    time_since_update >= self.heartbeat_interval
-                    and self.completed < self.total
-                ):
-                    elapsed = time.time() - self.start_time
-                    percent = int((self.completed / self.total) * 100)
-                    elapsed_str = self._format_time(elapsed)
-                    self.logger.info(
-                        f"[Processing] {self.completed:,}/{self.total:,} batches ({percent}%) | "
-                        f"Elapsed: {elapsed_str}"
-                    )
-                    self.last_update_time = time.time()
-    def update(self, n: int = 1) -> None:
-        """
-        Update progress count.
-        Parameters
-        ----------
-        n : int
-            Number of items completed
-        """
-        with self.lock:
-            self.completed += n
-            percent = int((self.completed / self.total) * 100)
-            # Show milestone messages (5%, 10%, 15%... for large datasets)
-            for milestone in sorted(self.milestones):
-                if percent >= milestone and milestone not in self.shown_milestones:
-                    self.shown_milestones.add(milestone)
-                    # Calculate time metrics
-                    elapsed = time.time() - self.start_time
-                    rate = self.completed / elapsed if elapsed > 0 else 0
-                    remaining_items = self.total - self.completed
-                    # Calculate ETA with padding for overhead (loading, joins, etc.)
-                    # Don't show ETA until we have some samples (at least 5% complete)
-                    if rate > 0 and self.completed >= max(5, self.total * 0.05):
-                        eta_seconds = (
-                            remaining_items / rate
-                        ) * 1.15  # Add 15% padding for overhead
-                    else:
-                        eta_seconds = 0
-                    # Format time strings
-                    eta_str = (
-                        self._format_time(eta_seconds)
-                        if eta_seconds > 0
-                        else "calculating..."
-                    )
-                    elapsed_str = self._format_time(elapsed)
+def _log_progress(
+    completed: int,
+    total: int,
+    milestones: set,
+    shown_milestones: set,
+    start_time: float,
+    logger: logging.Logger,
+) -> None:
+    """
+    Log progress at milestone percentages.
-                    # Build progress message
-                    msg = f"Progress: {self.completed:,}/{self.total:,} batches ({percent}%)"
-                    if percent < 100:
-                        msg += f" | Elapsed: {elapsed_str} | ETA: {eta_str}"
-                    else:
-                        msg += f" | Total time: {elapsed_str}"
-                    self.logger.info(msg)
-                    self.last_update_time = time.time()
-        # Update status file for API consumption
-        self._write_status_file()
-    @staticmethod
-    def _format_time(seconds: float) -> str:
-        """Format seconds as human-readable string."""
-        if seconds < 60:
-            return f"{seconds:.0f}s"
-        elif seconds < 3600:
-            mins = seconds / 60
-            return f"{mins:.1f}m"
-        else:
-            hours = seconds / 3600
-            return f"{hours:.1f}h"
+    Parameters
+    ----------
+    completed : int
+        Number of batches completed
+    total : int
+        Total number of batches
+    milestones : set
+        Set of percentage milestones to show
+    shown_milestones : set
+        Set of milestones already shown (modified in place)
+    start_time : float
+        Start time from time.time()
+    logger : logging.Logger
+        Logger for output
+    """
+    percent = int((completed / total) * 100)
+    # Check for new milestones reached
+    for milestone in sorted(milestones):
+        if percent >= milestone and milestone not in shown_milestones:
+            shown_milestones.add(milestone)
+            # Calculate time metrics
+            elapsed = time.time() - start_time
+            rate = completed / elapsed if elapsed > 0 else 0
+            remaining_items = total - completed
+            # Calculate ETA with padding for overhead (loading, joins, etc.)
+            # Don't show ETA until we have some samples (at least 5% complete)
+            if rate > 0 and completed >= max(5, total * 0.05):
+                eta_seconds = (remaining_items / rate) * 1.15  # Add 15% padding
+            else:
+                eta_seconds = 0
-    def finish(self, output_file: str = None) -> None:
-        """Stop heartbeat and log completion."""
-        # Stop heartbeat thread
-        self.heartbeat_stop.set()
-        if self.heartbeat_thread and self.heartbeat_thread.is_alive():
-            self.heartbeat_thread.join(timeout=1)
+            # Format time strings
+            eta_str = _format_time(eta_seconds) if eta_seconds > 0 else "calculating..."
+            elapsed_str = _format_time(elapsed)
-        with self.lock:
-            total_time = time.time() - self.start_time
-            time_str = self._format_time(total_time)
-            msg = f"Processing complete: {self.completed:,}/{self.total:,} batches in {time_str}"
-            self.logger.info(msg)
+            # Build progress message
+            msg = f"Progress: {completed:,}/{total:,} batches ({percent}%)"
+            if percent < 100:
+                msg += f" | Elapsed: {elapsed_str} | ETA: {eta_str}"
+            else:
+                msg += f" | Total time: {elapsed_str}"
-        # Write final status
-        self._write_status_file(status="completed")
+            logger.info(msg)
 # ============================================================================
@@ -1218,7 +1067,6 @@ def whisp_stats_geojson_to_df_concurrent(
     logger: logging.Logger = None,
     # Format parameters (auto-detect from config if not provided)
     decimal_places: int = None,
-    status_file: str = None,
 ) -> pd.DataFrame:
     """
     Process GeoJSON concurrently to compute Whisp statistics with automatic formatting.
@@ -1359,11 +1207,12 @@ def whisp_stats_geojson_to_df_concurrent(
     # Setup semaphore for EE concurrency control
     ee_semaphore = threading.BoundedSemaphore(max_concurrent)
-    # Progress tracker with heartbeat for long-running jobs
-    progress = ProgressTracker(
-        len(batches), logger=logger, heartbeat_interval=180, status_file=status_file
-    )
-    progress.start_heartbeat()
+    # Progress tracking setup
+    progress_lock = threading.Lock()
+    completed_batches = 0
+    milestones = _get_progress_milestones(len(gdf_for_ee))
+    shown_milestones = set()
+    start_time = time.time()
     results = []
@@ -1477,7 +1326,18 @@ def whisp_stats_geojson_to_df_concurrent(
                         suffixes=("_ee", "_client"),
                     )
                     results.append(merged)
-                    progress.update()
+                    # Update progress
+                    with progress_lock:
+                        completed_batches += 1
+                        _log_progress(
+                            completed_batches,
+                            len(batches),
+                            milestones,
+                            shown_milestones,
+                            start_time,
+                            logger,
+                        )
                 except Exception as e:
                     # Batch failed - fail fast with clear guidance
@@ -1492,15 +1352,18 @@ def whisp_stats_geojson_to_df_concurrent(
                     batch_errors.append((batch_idx, original_batch, error_msg))
     except (KeyboardInterrupt, SystemExit) as interrupt:
         logger.warning("Processing interrupted by user")
-        # Update status file with interrupted state
-        progress._write_status_file(status="interrupted")
         raise interrupt
     finally:
         # Restore logger levels
         fiona_logger.setLevel(old_fiona_level)
         pyogrio_logger.setLevel(old_pyogrio_level)
-    progress.finish()
+    # Log completion
+    total_time = time.time() - start_time
+    time_str = _format_time(total_time)
+    logger.info(
+        f"Processing complete: {completed_batches:,}/{len(batches):,} batches in {time_str}"
+    )
     # If we have batch errors after retry attempts, fail the entire process
     if batch_errors:
@@ -1577,7 +1440,9 @@ def whisp_stats_geojson_to_df_concurrent(
                 # Retry batch processing with validated image
                 results = []
-                progress = ProgressTracker(len(batches), logger=logger)
+                retry_completed = 0
+                retry_shown = set()
+                retry_start = time.time()
                 # Suppress fiona logging during batch processing (threads create new loggers)
                 fiona_logger = logging.getLogger("fiona")
@@ -1609,13 +1474,28 @@ def whisp_stats_geojson_to_df_concurrent(
                                     suffixes=("", "_client"),
                                 )
                                 results.append(merged)
-                                progress.update()
+                                # Update retry progress
+                                with progress_lock:
+                                    retry_completed += 1
+                                    _log_progress(
+                                        retry_completed,
+                                        len(batches),
+                                        milestones,
+                                        retry_shown,
+                                        retry_start,
+                                        logger,
+                                    )
                             except Exception as e:
                                 logger.error(
                                     f"Batch processing error (retry): {str(e)[:100]}"
                                 )
-                    progress.finish()
+                    # Log retry completion
+                    retry_time = time.time() - retry_start
+                    logger.info(
+                        f"Retry complete: {retry_completed:,}/{len(batches):,} batches in {_format_time(retry_time)}"
+                    )
                 finally:
                     # Restore logger levels
                     fiona_logger.setLevel(old_fiona_level)
@@ -2138,7 +2018,6 @@ def whisp_formatted_stats_geojson_to_df_concurrent(
     water_flag_threshold: float = 0.5,
     sort_column: str = "plotId",
     geometry_audit_trail: bool = False,
-    status_file: str = None,
 ) -> pd.DataFrame:
     """
     Process GeoJSON concurrently with automatic formatting and validation.
@@ -2231,7 +2110,6 @@ def whisp_formatted_stats_geojson_to_df_concurrent(
         max_retries=max_retries,
         add_metadata_server=add_metadata_server,
         logger=logger,
-        status_file=status_file,
     )
     # Step 2: Format the output
@@ -2347,7 +2225,6 @@ def whisp_formatted_stats_geojson_to_df_sequential(
     water_flag_threshold: float = 0.5,
     sort_column: str = "plotId",
     geometry_audit_trail: bool = False,
-    status_file: str = None,
 ) -> pd.DataFrame:
     """
     Process GeoJSON sequentially with automatic formatting and validation.
@@ -2552,7 +2429,6 @@ def whisp_formatted_stats_geojson_to_df_fast(
     water_flag_threshold: float = 0.5,
     sort_column: str = "plotId",
     geometry_audit_trail: bool = False,
-    status_file: str = None,
 ) -> pd.DataFrame:
     """
     Process GeoJSON to Whisp statistics with optimized fast processing.
@@ -2654,7 +2530,6 @@ def whisp_formatted_stats_geojson_to_df_fast(
             water_flag_threshold=water_flag_threshold,
             sort_column=sort_column,
             geometry_audit_trail=geometry_audit_trail,
-            status_file=status_file,
         )
     else:  # sequential
         logger.debug("Routing to sequential processing...")
@@ -2672,5 +2547,4 @@ def whisp_formatted_stats_geojson_to_df_fast(
             water_flag_threshold=water_flag_threshold,
             sort_column=sort_column,
             geometry_audit_trail=geometry_audit_trail,
-            status_file=status_file,
         )

{openforis_whisp-3.0.0a6 → openforis_whisp-3.0.0a7}/src/openforis_whisp/data_checks.py RENAMED Viewed

@@ -1,8 +1,9 @@
 """
 Data validation and constraint checking functions for WHISP.
-Provides validation functions to check GeoJSON data against defined limits
+Provides validation functions to check GeoJSON data against user defined limits
 and thresholds, raising informative errors when constraints are violated.
+Note: Defaults in each function are not necessarily enforced.
 """
 import json
@@ -13,26 +14,6 @@ from shapely.geometry import Polygon as ShapelyPolygon, shape as shapely_shape
 # (estimation preferred here as allows efficient processing speed and limits overhead of checking file)
-def _convert_projected_area_to_ha(area_sq_units: float, crs: str = None) -> float:
-    """
-    Convert area from projected CRS units to hectares.
-    Most projected CRS use meters as units, so:
-    - area_sq_units is in square meters
-    - 1 hectare = 10,000 m²
-    Args:
-        area_sq_units: Area in square units of the projection (typically square meters)
-        crs: CRS string for reference (e.g., 'EPSG:3857'). Used for validation.
-    Returns:
-        Area in hectares
-    """
-    # Standard conversion: 1 hectare = 10,000 m²
-    # Most projected CRS use meters, so this works universally
-    return area_sq_units / 10000
 def _estimate_area_from_bounds(coords, area_conversion_factor: float) -> float:
     """
     Estimate area from bounding box when actual area calculation fails.
@@ -75,6 +56,8 @@ def analyze_geojson(
     metrics=[
         "count",
         "geometry_types",
+        "crs",
+        "file_size_mb",
         "min_area_ha",
         "mean_area_ha",
         "median_area_ha",
@@ -107,6 +90,8 @@ def analyze_geojson(
         Which metrics to return. Available metrics:
         - 'count': number of polygons
         - 'geometry_types': dict of geometry type counts (e.g., {'Polygon': 95, 'MultiPolygon': 5})
+        - 'crs': coordinate reference system (e.g., 'EPSG:4326') - only available when geojson_data is a file path
+        - 'file_size_mb': file size in megabytes (only available when geojson_data is a file path)
         - 'min_area_ha', 'mean_area_ha', 'median_area_ha', 'max_area_ha': area statistics (hectares) (accurate only at equator)
         - 'area_percentiles': dict with p25, p50 (median), p75, p90 area values (accurate only at equator)
         - 'min_vertices', 'mean_vertices', 'median_vertices', 'max_vertices': vertex count statistics
@@ -123,6 +108,8 @@ def analyze_geojson(
     dict with requested metrics:
         - 'count': number of polygons
         - 'geometry_types': {'Polygon': int, 'MultiPolygon': int, ...}
+        - 'crs': coordinate reference system string (e.g., 'EPSG:4326', only when geojson_data is a file path)
+        - 'file_size_mb': file size in megabytes (float, only when geojson_data is a file path)
         - 'min_area_ha': minimum area among all polygons in hectares
         - 'mean_area_ha': mean area per polygon in hectares (calculated from coordinates)
         - 'median_area_ha': median area among all polygons in hectares
@@ -134,8 +121,28 @@ def analyze_geojson(
         - 'max_vertices': maximum number of vertices among all polygons
         - 'vertex_percentiles': {'p25': int, 'p50': int, 'p75': int, 'p90': int}
     """
+    # Handle None metrics (use all default metrics)
+    if metrics is None:
+        metrics = [
+            "count",
+            "geometry_types",
+            "crs",
+            "file_size_mb",
+            "min_area_ha",
+            "mean_area_ha",
+            "median_area_ha",
+            "max_area_ha",
+            "area_percentiles",
+            "min_vertices",
+            "mean_vertices",
+            "median_vertices",
+            "max_vertices",
+            "vertex_percentiles",
+        ]
     results = {}
     crs_warning = None
+    detected_crs = None
     file_path = None
     try:
@@ -145,6 +152,35 @@ def analyze_geojson(
             if not file_path.exists():
                 raise FileNotFoundError(f"GeoJSON file not found: {file_path}")
+            # Quick CRS detection BEFORE loading full file (if requested)
+            if "crs" in metrics:
+                try:
+                    # Use fiona which only reads file metadata (fast, doesn't load features)
+                    import fiona
+                    with fiona.open(file_path) as src:
+                        if src.crs:
+                            # Convert fiona CRS dict to EPSG string
+                            crs_dict = src.crs
+                            if "init" in crs_dict:
+                                # Old format: {'init': 'epsg:4326'}
+                                detected_crs = (
+                                    crs_dict["init"].upper().replace("EPSG:", "EPSG:")
+                                )
+                            elif isinstance(crs_dict, dict) and crs_dict:
+                                # Try to extract EPSG from dict (json already imported at top)
+                                detected_crs = json.dumps(crs_dict)
+                        else:
+                            # No CRS means WGS84 by GeoJSON spec
+                            detected_crs = "EPSG:4326"
+                    # Check if CRS is WGS84
+                    if detected_crs and detected_crs != "EPSG:4326":
+                        crs_warning = f"⚠️  CRS is {detected_crs}, not EPSG:4326. Area metrics will be inaccurate. Data will be auto-reprojected during processing."
+                except Exception as e:
+                    # If fiona fails, assume WGS84 (GeoJSON default)
+                    detected_crs = "EPSG:4326"
             # Try UTF-8 first (most common), then fall back to auto-detection
             try:
                 with open(file_path, "r", encoding="utf-8") as f:
@@ -166,26 +202,29 @@ def analyze_geojson(
                     with open(file_path, "r", encoding="latin-1") as f:
                         geojson_data = json.load(f)
-            # Detect CRS from file if available
-            try:
-                import geopandas as gpd
-                gdf = gpd.read_file(file_path)
-                if gdf.crs and gdf.crs != "EPSG:4326":
-                    crs_warning = f"⚠️  CRS is {gdf.crs}, not EPSG:4326. Area metrics will be inaccurate. Data will be auto-reprojected during processing."
-            except Exception:
-                pass  # If we can't detect CRS, continue without warning
         features = geojson_data.get("features", [])
-        # Add CRS warning to results if detected
-        if crs_warning:
-            results["crs_warning"] = crs_warning
-            print(crs_warning)
+        # Add file size if requested and available
+        if "file_size_mb" in metrics and file_path is not None:
+            size_bytes = file_path.stat().st_size
+            results["file_size_mb"] = round(size_bytes / (1024 * 1024), 2)
+        # Add CRS info if requested and detected
+        if "crs" in metrics and detected_crs:
+            results["crs"] = detected_crs
+            # Add warning if not WGS84
+            if crs_warning:
+                results["crs_warning"] = crs_warning
+                print(crs_warning)
         if "count" in metrics:
             results["count"] = len(features)
+        # Initialize tracking variables (used in quality logging later)
+        bbox_fallback_count = 0
+        geometry_skip_count = 0
+        polygon_type_stats = {}
         # Single sweep through features - compute all area/vertex metrics at once
         if any(
             m in metrics
@@ -208,11 +247,6 @@ def analyze_geojson(
             geometry_type_counts = {}
             valid_polygons = 0
-            # Tracking for fallback geometries
-            bbox_fallback_count = 0  # Geometries that used bounding box estimate
-            geometry_skip_count = 0  # Geometries completely skipped
-            polygon_type_stats = {}  # Track stats by geometry type
             # Detect CRS to determine area conversion factor
             area_conversion_factor = 1232100  # Default: WGS84 (degrees to ha)
             detected_crs = None
@@ -489,6 +523,7 @@ def _check_metric_constraints(
     max_max_area_ha=None,
     max_mean_vertices=None,
     max_max_vertices=10_000,
+    max_file_size_mb=None,
 ):
     """
     Check if computed metrics violate any constraints.
@@ -499,7 +534,7 @@ def _check_metric_constraints(
     -----------
     metrics : dict
         Dictionary of computed metrics with keys: count, mean_area_ha, max_area_ha,
-        mean_vertices, max_vertices
+        mean_vertices, max_vertices, file_size_mb (optional)
     max_polygon_count : int
         Maximum allowed number of polygons
     max_mean_area_ha : float
@@ -510,6 +545,8 @@ def _check_metric_constraints(
         Maximum allowed mean vertices per polygon
     max_max_vertices : int, optional
         Maximum allowed vertices per polygon
+    max_file_size_mb : float, optional
+        Maximum allowed file size in megabytes
     Returns:
     --------
@@ -523,6 +560,7 @@ def _check_metric_constraints(
     max_area = metrics["max_area_ha"]
     mean_vertices = metrics["mean_vertices"]
     max_vertices_value = metrics["max_vertices"]
+    file_size_mb = metrics.get("file_size_mb")
     if polygon_count > max_polygon_count:
         violations.append(
@@ -549,41 +587,63 @@ def _check_metric_constraints(
             f"Max vertices ({max_vertices_value:,}) exceeds limit ({max_max_vertices:,})"
         )
+    if (
+        max_file_size_mb is not None
+        and file_size_mb is not None
+        and file_size_mb > max_file_size_mb
+    ):
+        violations.append(
+            f"File size ({file_size_mb:.2f} MB) exceeds limit ({max_file_size_mb:.2f} MB)"
+        )
     return violations
-def validate_geojson_constraints(
-    geojson_data: Path | str | dict,
+def check_geojson_limits(
+    geojson_data: Path | str | dict = None,
+    analysis_results: dict = None,
     max_polygon_count=250_000,
-    max_mean_area_ha=10_000,
-    max_max_area_ha=None,
-    max_mean_vertices=None,
-    max_max_vertices=10_000,
+    max_mean_area_ha=50_000,
+    max_max_area_ha=50_000,
+    max_mean_vertices=50_000,
+    max_max_vertices=50_000,
+    max_file_size_mb=None,
+    allowed_crs=["EPSG:4326"],
     verbose=True,
 ):
     """
-    Validate GeoJSON data against defined constraints.
+    Check GeoJSON data against defined limits for processing readiness.
     Raises ValueError if any metrics exceed the specified limits.
     Uses analyze_geojson to compute metrics efficiently in a single sweep.
     Parameters:
     -----------
-    geojson_data : Path | str | dict
+    geojson_data : Path | str | dict, optional
         GeoJSON FeatureCollection to validate. Can be:
         - dict: GeoJSON FeatureCollection dictionary
         - str: Path to GeoJSON file as string
         - Path: pathlib.Path to GeoJSON file
+        Note: Cannot be used together with analysis_results
+    analysis_results : dict, optional
+        Pre-computed results from analyze_geojson(). Must contain keys:
+        'count', 'mean_area_ha', 'max_area_ha', 'mean_vertices', 'max_vertices'
+        Note: Cannot be used together with geojson_data
     max_polygon_count : int, optional
         Maximum allowed number of polygons (default: 250,000)
     max_mean_area_ha : float, optional
-        Maximum allowed mean area per polygon in hectares (default: 10,000)
+        Maximum allowed mean area per polygon in hectares (default: 50,000)
     max_max_area_ha : float, optional
-        Maximum allowed maximum area per polygon in hectares (default: None, no limit)
+        Maximum allowed maximum area per polygon in hectares (default: 50,000)
     max_mean_vertices : float, optional
-        Maximum allowed mean vertices per polygon (default: None, no limit)
+        Maximum allowed mean vertices per polygon (default: 50,000)
     max_max_vertices : int, optional
-        Maximum allowed vertices per polygon (default: 10,000)
+        Maximum allowed vertices per polygon (default: 50,000)
+    max_file_size_mb : float, optional
+        Maximum allowed file size in megabytes (default: None, no limit)
+    allowed_crs : list, optional
+        List of allowed coordinate reference systems (default: ["EPSG:4326"])
+        Set to None to skip CRS validation
     verbose : bool
         Print validation results (default: True)
@@ -603,22 +663,25 @@ def validate_geojson_constraints(
     Raises:
     -------
     ValueError
-        If any constraint is violated
+        If any constraint is violated, or if both geojson_data and analysis_results are provided,
+        or if neither is provided
     """
-    from openforis_whisp.data_conversion import convert_geojson_to_ee
-    from shapely.geometry import Polygon as ShapelyPolygon
+    # Validate input parameters
+    if geojson_data is not None and analysis_results is not None:
+        raise ValueError(
+            "Cannot provide both 'geojson_data' and 'analysis_results'. "
+            "Please provide only one input source."
+        )
-    # Load GeoJSON from file if path provided
-    if isinstance(geojson_data, (str, Path)):
-        file_path = Path(geojson_data)
-        if not file_path.exists():
-            raise FileNotFoundError(f"GeoJSON file not found: {file_path}")
-        with open(file_path, "r") as f:
-            geojson_data = json.load(f)
+    if geojson_data is None and analysis_results is None:
+        raise ValueError(
+            "Must provide either 'geojson_data' or 'analysis_results'. "
+            "Both cannot be None."
+        )
     if verbose:
         print("\n" + "=" * 80)
-        print("GEOJSON CONSTRAINT VALIDATION")
+        print("GEOJSON LIMITS CHECK")
         print("=" * 80)
         print("\nConstraint Limits:")
         print(f"  - Max polygon count:     {max_polygon_count:,}")
@@ -629,90 +692,47 @@ def validate_geojson_constraints(
             print(f"  - Max mean vertices:     {max_mean_vertices:,}")
         if max_max_vertices is not None:
             print(f"  - Max vertices per polygon: {max_max_vertices:,}")
-    # Collect all metrics we need to compute
-    metrics_to_compute = [
-        "count",
-        "mean_area_ha",
-        "max_area_ha",
-        "mean_vertices",
-        "max_vertices",
-    ]
-    # Import analyze_geojson (will be available after function is defined elsewhere)
-    # For now, we'll compute it here efficiently in a single sweep
-    features = geojson_data.get("features", [])
-    # Single sweep computation
-    total_area = 0
-    total_vertices = 0
-    max_area = 0
-    max_vertices_value = 0
-    valid_polygons = 0
-    for feature in features:
-        try:
-            coords = feature["geometry"]["coordinates"]
-            geom_type = feature["geometry"]["type"]
-            if geom_type == "Polygon":
-                # Count vertices
-                feature_vertices = 0
-                for ring in coords:
-                    feature_vertices += len(ring)
-                total_vertices += feature_vertices
-                max_vertices_value = max(max_vertices_value, feature_vertices)
-                # Calculate area
-                try:
-                    poly = ShapelyPolygon(coords[0])
-                    area_ha = abs(poly.area) * 1232100
-                    total_area += area_ha
-                    max_area = max(max_area, area_ha)
-                except:
-                    pass
-                valid_polygons += 1
-            elif geom_type == "MultiPolygon":
-                # Count vertices
-                feature_vertices = 0
-                for polygon in coords:
-                    for ring in polygon:
-                        feature_vertices += len(ring)
-                total_vertices += feature_vertices
-                max_vertices_value = max(max_vertices_value, feature_vertices)
-                # Calculate area
-                try:
-                    for polygon in coords:
-                        poly = ShapelyPolygon(polygon[0])
-                        area_ha = abs(poly.area) * 1232100
-                        total_area += area_ha
-                        max_area = max(max_area, area_ha)
-                except:
-                    pass
-                valid_polygons += 1
-        except:
-            continue
-    # Compute means
-    polygon_count = len(features)
-    mean_area = total_area / valid_polygons if valid_polygons > 0 else 0
-    mean_vertices = total_vertices / valid_polygons if valid_polygons > 0 else 0
+        if max_file_size_mb is not None:
+            print(f"  - Max file size (MB):    {max_file_size_mb:.2f}")
+    # Get metrics either from analysis_results or by analyzing geojson_data
+    if analysis_results is not None:
+        # Use pre-computed analysis results
+        metrics = analysis_results
+    else:
+        # Use analyze_geojson to compute all required metrics in a single sweep
+        metrics_to_compute = [
+            "count",
+            "file_size_mb",
+            "mean_area_ha",
+            "max_area_ha",
+            "mean_vertices",
+            "max_vertices",
+        ]
+        # Add CRS if validation is requested
+        if allowed_crs is not None:
+            metrics_to_compute.append("crs")
+        metrics = analyze_geojson(geojson_data, metrics=metrics_to_compute)
+    # Build results dict with required keys
     results = {
-        "count": polygon_count,
-        "mean_area_ha": round(mean_area, 2),
-        "max_area_ha": round(max_area, 2),
-        "mean_vertices": round(mean_vertices, 2),
-        "max_vertices": max_vertices_value,
+        "count": metrics.get("count", 0),
+        "file_size_mb": metrics.get("file_size_mb"),
+        "mean_area_ha": metrics.get("mean_area_ha", 0),
+        "max_area_ha": metrics.get("max_area_ha", 0),
+        "mean_vertices": metrics.get("mean_vertices", 0),
+        "max_vertices": metrics.get("max_vertices", 0),
+        "crs": metrics.get("crs"),
         "valid": True,
     }
     if verbose:
         print("\nComputed Metrics:")
         print(f"  - Polygon count:         {results['count']:,}")
+        if results.get("file_size_mb") is not None:
+            print(f"  - File size (MB):        {results['file_size_mb']:,.2f}")
+        if results.get("crs") is not None:
+            print(f"  - CRS:                   {results['crs']}")
         print(f"  - Mean area (ha):        {results['mean_area_ha']:,}")
         print(f"  - Max area (ha):         {results['max_area_ha']:,}")
         print(f"  - Mean vertices:         {results['mean_vertices']:,}")
@@ -726,34 +746,48 @@ def validate_geojson_constraints(
         max_max_area_ha=max_max_area_ha,
         max_mean_vertices=max_mean_vertices,
         max_max_vertices=max_max_vertices,
+        max_file_size_mb=max_file_size_mb,
     )
+    # Check CRS if validation is requested
+    if allowed_crs is not None and results.get("crs"):
+        if results["crs"] not in allowed_crs:
+            violations.append(
+                f"CRS '{results['crs']}' is not in allowed list: {allowed_crs}"
+            )
     # Report results
     if verbose:
         print("\n" + "=" * 80)
         if violations:
-            print("VALIDATION FAILED")
+            print("LIMITS CHECK FAILED")
             print("=" * 80)
             for violation in violations:
                 print(f"\n{violation}")
             results["valid"] = False
         else:
-            print("VALIDATION PASSED")
+            print("LIMITS CHECK PASSED")
             print("=" * 80)
             print("\nAll metrics within acceptable limits")
     # Raise error with detailed message if any constraint violated
     if violations:
-        error_message = "Constraint validation failed:\n" + "\n".join(violations)
+        error_message = "GeoJSON limits check failed:\n" + "\n".join(violations)
         raise ValueError(error_message)
     return results
+# Backward compatibility aliases
+screen_geojson = check_geojson_limits
+validate_geojson_constraints = check_geojson_limits
 def suggest_processing_mode(
     feature_count,
     mean_area_ha=None,
     mean_vertices=None,
+    file_size_mb=None,
     feature_type="polygon",
     verbose=True,
 ):
@@ -762,6 +796,9 @@ def suggest_processing_mode(
     Decision thresholds from comprehensive benchmark data (Nov 2025):
+    FILE SIZE:
+    - Files >= 10 MB: recommend sequential mode (avoids payload size limits)
     POINTS:
     - Break-even: 750-1000 features
     - Sequential faster: < 750 features
@@ -785,6 +822,8 @@ def suggest_processing_mode(
         Mean area per polygon in hectares (required for polygons, ignored for points)
     mean_vertices : float, optional
         Mean number of vertices per polygon (influences decision for complex geometries)
+    file_size_mb : float, optional
+        File size in megabytes (if >= 10 MB, recommends sequential mode)
     feature_type : str
         'polygon', 'multipolygon', or 'point' (default: 'polygon')
     verbose : bool
@@ -795,6 +834,14 @@ def suggest_processing_mode(
     str: 'concurrent' or 'sequential'
     """
+    # File size check: large files should use sequential mode
+    if file_size_mb is not None and file_size_mb >= 10:
+        if verbose:
+            print(f"\nMETHOD RECOMMENDATION (File Size Constraint)")
+            print(f"   File size: {file_size_mb:.2f} MB (>= 10 MB threshold)")
+            print(f"   Method: SEQUENTIAL (avoids payload size limits)")
+        return "sequential"
     # Points: simple threshold-based decision
     if feature_type == "point":
         breakeven = 750

{openforis_whisp-3.0.0a6 → openforis_whisp-3.0.0a7}/src/openforis_whisp/datasets.py RENAMED Viewed

@@ -61,8 +61,9 @@ def g_esa_worldcover_trees_prep():
 # EUFO_2020
 def g_jrc_gfc_2020_prep():
-    jrc_gfc2020_raw = ee.ImageCollection("JRC/GFC2020/V2")
-    return jrc_gfc2020_raw.mosaic().rename("EUFO_2020").selfMask()
+    # JRC GFC2020 V3 is a single Image with band 'Map'
+    jrc_gfc2020 = ee.Image("JRC/GFC2020/V3").select("Map")
+    return jrc_gfc2020.rename("EUFO_2020").selfMask()
 # GFC_TC_2020

{openforis_whisp-3.0.0a6 → openforis_whisp-3.0.0a7}/src/openforis_whisp/stats.py RENAMED Viewed

@@ -165,7 +165,6 @@ def whisp_formatted_stats_geojson_to_df(
     batch_size: int = 10,
     max_concurrent: int = 20,
     geometry_audit_trail: bool = False,
-    status_file: str = None,
 ) -> pd.DataFrame:
     """
     Main entry point for converting GeoJSON to Whisp statistics.
@@ -219,13 +218,6 @@ def whisp_formatted_stats_geojson_to_df(
         Processing metadata stored in df.attrs['processing_metadata'].
         These columns enable full transparency for geometry modifications during processing.
-    status_file : str, optional
-        Path to JSON status file or directory for real-time progress tracking.
-        If a directory is provided, creates 'whisp_processing_status.json' in that directory.
-        Updates every 3 minutes and at progress milestones (5%, 10%, etc.).
-        Format: {"status": "processing", "progress": "450/1000", "percent": 45.0,
-                 "elapsed_sec": 120, "eta_sec": 145, "updated_at": "2025-11-13T14:23:45"}
-        Most useful for large concurrent jobs. Works in both concurrent and sequential modes.
     Returns
     -------
@@ -315,7 +307,6 @@ def whisp_formatted_stats_geojson_to_df(
             batch_size=batch_size,
             max_concurrent=max_concurrent,
             geometry_audit_trail=geometry_audit_trail,
-            status_file=status_file,
         )
     else:
         raise ValueError(