PyPI - water-column-sonar-processing - Versions diffs - 0.0.13__tar.gz → 24.1.1__tar.gz - Mend

water-column-sonar-processing 0.0.13tar.gz → 24.1.1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of water-column-sonar-processing might be problematic. Click here for more details.

Files changed (59) hide show

{water_column_sonar_processing-0.0.13 → water_column_sonar_processing-24.1.1}/.pre-commit-config.yaml RENAMED Viewed

@@ -1,4 +1,5 @@
 repos:
+  ### Security Scan for AWS Secrets ###
   - repo: local
     hooks:
       - id: trufflehog
@@ -34,3 +35,12 @@ repos:
 #      - id: isort
 #        name: isort (python)
 #        args: ["--profile", "black", "--filter-files"]
+### Static Security Scan ###
+  # To run manually you can do: "bandit -c pyproject.toml -r ."
+  - repo: https://github.com/PyCQA/bandit
+    rev: '1.8.0'
+    hooks:
+    - id: bandit
+      args: ["-c", "pyproject.toml"]
+      additional_dependencies: [ "bandit[toml]" ]

{water_column_sonar_processing-0.0.13 → water_column_sonar_processing-24.1.1}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
-Metadata-Version: 2.1
+Metadata-Version: 2.2
 Name: water_column_sonar_processing
-Version: 0.0.13
+Version: 24.1.1
 Summary: A processing tool for water column sonar data.
 Author-email: Rudy Klucik <rudy.klucik@noaa.gov>
 Project-URL: Homepage, https://github.com/CI-CMG/water-column-sonar-processing
@@ -8,7 +8,7 @@ Project-URL: Issues, https://github.com/CI-CMG/water-column-sonar-processing/iss
 Classifier: Programming Language :: Python :: 3
 Classifier: License :: OSI Approved :: MIT License
 Classifier: Operating System :: OS Independent
-Requires-Python: >=3.10
+Requires-Python: >=3.8
 Description-Content-Type: text/markdown
 License-File: LICENSE
 Requires-Dist: aiobotocore==2.15.2
@@ -26,26 +26,19 @@ Requires-Dist: pandas==2.2.3
 Requires-Dist: pyarrow==18.1.0
 Requires-Dist: python-dotenv==1.0.1
 Requires-Dist: requests==2.32.3
-Requires-Dist: s3fs==2023.12.1
+Requires-Dist: s3fs==2024.2.0
 Requires-Dist: scipy==1.14.1
 Requires-Dist: setuptools
 Requires-Dist: shapely==2.0.3
 Requires-Dist: typing-extensions==4.10.0
 Requires-Dist: xarray==2024.10.0
+Requires-Dist: xbatcher==0.4.0
 Requires-Dist: zarr==2.18.3
 # Water Column Sonar Processing
 Processing tool for converting L0 data to L1 and L2 as well as generating geospatial information
-![GitHub Actions Workflow Status](https://img.shields.io/github/actions/workflow/status/CI-CMG/water-column-sonar-processing/test_action.yaml)
-![GitHub License](https://img.shields.io/github/license/CI-CMG/water-column-sonar-processing)
-![PyPI - Implementation](https://img.shields.io/pypi/v/water-column-sonar-processing?color=black)
-![PyPI - Downloads](https://img.shields.io/pypi/dd/water-column-sonar-processing)
-![GitHub code size in bytes](https://img.shields.io/github/languages/code-size/CI-CMG/water-column-sonar-processing) ![GitHub repo size](https://img.shields.io/github/repo-size/CI-CMG/water-column-sonar-processing)
+![GitHub Actions Workflow Status](https://img.shields.io/github/actions/workflow/status/CI-CMG/water-column-sonar-processing/test_action.yaml?color=black) ![PyPI - Implementation](https://img.shields.io/pypi/v/water-column-sonar-processing?color=black) ![GitHub License](https://img.shields.io/github/license/CI-CMG/water-column-sonar-processing?color=black) ![PyPI - Downloads](https://img.shields.io/pypi/dd/water-column-sonar-processing?color=black) ![GitHub code size in bytes](https://img.shields.io/github/languages/code-size/CI-CMG/water-column-sonar-processing?color=black) ![GitHub repo size](https://img.shields.io/github/repo-size/CI-CMG/water-column-sonar-processing?color=black)
 # Setting up the Python Environment
 > Python 3.10.12
@@ -103,12 +96,6 @@ or
 Following this tutorial:
 https://packaging.python.org/en/latest/tutorials/packaging-projects/
-# To Publish To PROD
-```commandline
-python -m build
-python -m twine upload --repository pypi dist/*
-```
 # Pre Commit Hook
 see here for installation: https://pre-commit.com/
 https://dev.to/rafaelherik/using-trufflehog-and-pre-commit-hook-to-prevent-secret-exposure-edo
@@ -133,13 +120,29 @@ https://colab.research.google.com/drive/1KiLMueXiz9WVB9o4RuzYeGjNZ6PsZU7a#scroll
 # Tag a Release
 Step 1 --> increment the semantic version in the zarr_manager.py "metadata" & the "pyproject.toml"
 ```commandline
-git tag "v0.0.13" -a
+git tag -a v24.01.01 -m "Releasing version v24.01.01"
 ```
-Step 3 --> enter description
 ```commandline
 git push origin --tags
 ```
+# To Publish To PROD
+```commandline
+python -m build
+python -m twine upload --repository pypi dist/*
+```
 # TODO:
 add https://pypi.org/project/setuptools-scm/
 for extracting the version
+# Security scanning
+> bandit -r water_column_sonar_processing/
+# Data Debugging
+Experimental Plotting in Xarray (hvPlot):
+https://colab.research.google.com/drive/18vrI9LAip4xRGEX6EvnuVFp35RAiVYwU#scrollTo=q9_j9p2yXsLV
+HB0707 Cruise zoomable:
+https://hb0707.s3.us-east-1.amazonaws.com/index.html

{water_column_sonar_processing-0.0.13 → water_column_sonar_processing-24.1.1}/README.md RENAMED Viewed

@@ -1,15 +1,7 @@
 # Water Column Sonar Processing
 Processing tool for converting L0 data to L1 and L2 as well as generating geospatial information
-![GitHub Actions Workflow Status](https://img.shields.io/github/actions/workflow/status/CI-CMG/water-column-sonar-processing/test_action.yaml)
-![GitHub License](https://img.shields.io/github/license/CI-CMG/water-column-sonar-processing)
-![PyPI - Implementation](https://img.shields.io/pypi/v/water-column-sonar-processing?color=black)
-![PyPI - Downloads](https://img.shields.io/pypi/dd/water-column-sonar-processing)
-![GitHub code size in bytes](https://img.shields.io/github/languages/code-size/CI-CMG/water-column-sonar-processing) ![GitHub repo size](https://img.shields.io/github/repo-size/CI-CMG/water-column-sonar-processing)
+![GitHub Actions Workflow Status](https://img.shields.io/github/actions/workflow/status/CI-CMG/water-column-sonar-processing/test_action.yaml?color=black) ![PyPI - Implementation](https://img.shields.io/pypi/v/water-column-sonar-processing?color=black) ![GitHub License](https://img.shields.io/github/license/CI-CMG/water-column-sonar-processing?color=black) ![PyPI - Downloads](https://img.shields.io/pypi/dd/water-column-sonar-processing?color=black) ![GitHub code size in bytes](https://img.shields.io/github/languages/code-size/CI-CMG/water-column-sonar-processing?color=black) ![GitHub repo size](https://img.shields.io/github/repo-size/CI-CMG/water-column-sonar-processing?color=black)
 # Setting up the Python Environment
 > Python 3.10.12
@@ -67,12 +59,6 @@ or
 Following this tutorial:
 https://packaging.python.org/en/latest/tutorials/packaging-projects/
-# To Publish To PROD
-```commandline
-python -m build
-python -m twine upload --repository pypi dist/*
-```
 # Pre Commit Hook
 see here for installation: https://pre-commit.com/
 https://dev.to/rafaelherik/using-trufflehog-and-pre-commit-hook-to-prevent-secret-exposure-edo
@@ -97,13 +83,29 @@ https://colab.research.google.com/drive/1KiLMueXiz9WVB9o4RuzYeGjNZ6PsZU7a#scroll
 # Tag a Release
 Step 1 --> increment the semantic version in the zarr_manager.py "metadata" & the "pyproject.toml"
 ```commandline
-git tag "v0.0.13" -a
+git tag -a v24.01.01 -m "Releasing version v24.01.01"
 ```
-Step 3 --> enter description
 ```commandline
 git push origin --tags
 ```
+# To Publish To PROD
+```commandline
+python -m build
+python -m twine upload --repository pypi dist/*
+```
 # TODO:
 add https://pypi.org/project/setuptools-scm/
-for extracting the version
+for extracting the version
+# Security scanning
+> bandit -r water_column_sonar_processing/
+# Data Debugging
+Experimental Plotting in Xarray (hvPlot):
+https://colab.research.google.com/drive/18vrI9LAip4xRGEX6EvnuVFp35RAiVYwU#scrollTo=q9_j9p2yXsLV
+HB0707 Cruise zoomable:
+https://hb0707.s3.us-east-1.amazonaws.com/index.html

{water_column_sonar_processing-0.0.13 → water_column_sonar_processing-24.1.1}/pyproject.toml RENAMED Viewed

@@ -8,13 +8,14 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "water_column_sonar_processing"
-version = "0.0.13"
+version = "24.01.01"
 authors = [
   { name="Rudy Klucik", email="rudy.klucik@noaa.gov" },
 ]
 description = "A processing tool for water column sonar data."
 readme = "README.md"
-requires-python = ">=3.10"
+#requires-python = ">=3.10"
+requires-python = ">=3.8"
 classifiers = [
     "Programming Language :: Python :: 3",
     "License :: OSI Approved :: MIT License",
@@ -34,4 +35,9 @@ optional-dependencies = {dev = { file = ["requirements_dev.txt"] }}
 #fallback_version = "unknown"
 #local_scheme = "node-and-date"
 #write_to = "_water_column_sonar_processing_version.py"
-#write_to_template = 'version = "{version}"'
+#write_to_template = 'version = "{version}"'
+[tool.bandit]
+exclude_dirs = ["tests"]
+[tool.pre-commit-hooks.bandit]
+exclude = ["*/tests/*"]

{water_column_sonar_processing-0.0.13 → water_column_sonar_processing-24.1.1}/requirements.txt RENAMED Viewed

@@ -19,12 +19,14 @@ pyarrow==18.1.0
 python-dotenv==1.0.1
 requests==2.32.3
 #s3fs==2024.3.1
-#s3fs==2024.10.0 # this version creates problems
-s3fs==2023.12.1
+#s3fs==2024.3.0 # does not work
+s3fs==2024.2.0 # works ...something between 2024.2 and 2024.3 creates the problem
 scipy==1.14.1
 #setuptools==75.6.0
 setuptools
 shapely==2.0.3
 typing-extensions==4.10.0
 xarray==2024.10.0
+#  xbatcher[tensorflow]
+xbatcher==0.4.0
 zarr==2.18.3

{water_column_sonar_processing-0.0.13 → water_column_sonar_processing-24.1.1}/requirements_dev.txt RENAMED Viewed

@@ -1,5 +1,6 @@
 -r requirements.txt
+bandit[toml]==1.8.0
 build
 pre-commit
 pyinstaller
@@ -9,4 +10,5 @@ flake8==7.1.1
 pooch==1.8.2
 pytest~=8.3.3
 pytest-cov==6.0.0
-tqdm
+tqdm
+bandit

{water_column_sonar_processing-0.0.13 → water_column_sonar_processing-24.1.1}/water_column_sonar_processing/aws/s3fs_manager.py RENAMED Viewed

@@ -16,6 +16,7 @@ class S3FSManager:
         # self.output_bucket_name = os.environ.get("OUTPUT_BUCKET_NAME")
         self.s3_region = os.environ.get("AWS_REGION", default="us-east-1")
         self.s3fs = s3fs.S3FileSystem(
+            asynchronous=False,
             endpoint_url=endpoint_url,
             key=os.environ.get("OUTPUT_BUCKET_ACCESS_KEY"),
             secret=os.environ.get("OUTPUT_BUCKET_SECRET_ACCESS_KEY"),

{water_column_sonar_processing-0.0.13 → water_column_sonar_processing-24.1.1}/water_column_sonar_processing/cruise/create_empty_zarr_store.py RENAMED Viewed

@@ -1,4 +1,5 @@
 import os
+import tempfile
 import numcodecs
 import numpy as np
@@ -11,7 +12,6 @@ from water_column_sonar_processing.utility import Cleaner
 numcodecs.blosc.use_threads = False
 numcodecs.blosc.set_nthreads(1)
-# TEMPDIR = "/tmp"
 # TODO: when ready switch to version 3 of model spec
 # ZARR_V3_EXPERIMENTAL_API = 1
 # creates the latlon data: foo = ep.consolidate.add_location(ds_Sv, echodata)
@@ -61,7 +61,6 @@ class CreateEmptyZarrStore:
         # TODO: move to common place
     #######################################################
-    # @classmethod
     def create_cruise_level_zarr_store(
         self,
         output_bucket_name: str,
@@ -69,8 +68,8 @@ class CreateEmptyZarrStore:
         cruise_name: str,
         sensor_name: str,
         table_name: str,
-        tempdir: str,
     ) -> None:
+        tempdir = tempfile.TemporaryDirectory()
         try:
             # HB0806 - 123, HB0903 - 220
             dynamo_db_manager = DynamoDBManager()
@@ -146,7 +145,7 @@ class CreateEmptyZarrStore:
             print(f"new_height: {new_height}")
             zarr_manager.create_zarr_store(
-                path=tempdir,
+                path=tempdir.name, # TODO: need to use .name or problem
                 ship_name=ship_name,
                 cruise_name=cruise_name,
                 sensor_name=sensor_name,
@@ -159,7 +158,7 @@ class CreateEmptyZarrStore:
             #################################################################
             self.upload_zarr_store_to_s3(
                 output_bucket_name=output_bucket_name,
-                local_directory=tempdir,
+                local_directory=tempdir.name, # TODO: need to use .name or problem
                 object_prefix=zarr_prefix,
                 cruise_name=cruise_name,
             )

water_column_sonar_processing-24.1.1/water_column_sonar_processing/cruise/datatree_manager.py ADDED Viewed

@@ -0,0 +1,24 @@
+### https://xarray-datatree.readthedocs.io/en/latest/data-structures.html
+import numpy as np
+from datatree import DataTree
+import xarray as xr
+class DatatreeManager:
+    #######################################################
+    def __init__(
+        self,
+    ):
+        self.dtype = "float32"
+    #################################################################
+    def create_datatree(
+        self,
+        input_ds,
+    ) -> None:
+        ds1 = xr.Dataset({"foo": "orange"})
+        dt = DataTree(name="root", data=ds1)  # create root node
+        ds2 = xr.Dataset({"bar": 0}, coords={"y": ("y", [0, 1, 2])})
+        return dt

{water_column_sonar_processing-0.0.13 → water_column_sonar_processing-24.1.1}/water_column_sonar_processing/cruise/resample_regrid.py RENAMED Viewed

@@ -281,12 +281,7 @@ class ResampleRegrid:
                 print(f"start_ping_time_index: {start_ping_time_index}, end_ping_time_index: {end_ping_time_index}")
                 #########################################################################
                 # write Sv values to cruise-level-model-store
-                for channel in range(
-                    len(input_xr.channel.values)
-                ):  # does not like being written in one fell swoop :(
-                    output_zarr_store.Sv[
-                        :, start_ping_time_index:end_ping_time_index, channel
-                    ] = regrid_resample[:, :, channel]
+                output_zarr_store.Sv[:, start_ping_time_index:end_ping_time_index, :] = regrid_resample.values
                 #########################################################################
                 # [5] write subset of latitude/longitude
@@ -300,27 +295,27 @@ class ResampleRegrid:
                 #########################################################################
                 # TODO: add the "detected_seafloor_depth/" to the
                 #  L2 cruise dataarrays
-                # TODO: make bottom optional if 'detected_seafloor_depth' in input_xr.variables:
+                # TODO: make bottom optional
                 # TODO: Only checking the first channel for now. Need to average across all channels
                 #  in the future. See https://github.com/CI-CMG/water-column-sonar-processing/issues/11
-                # detected_seafloor_depths = input_xr.detected_seafloor_depth.values[0, :] # note can include nans?
-                detected_seafloor_depth = input_xr.detected_seafloor_depth.values
-                detected_seafloor_depth[detected_seafloor_depth == 0.] = np.nan
-                detected_seafloor_depths = np.nanmean(detected_seafloor_depth, 0)
-                detected_seafloor_depths[detected_seafloor_depths == 0.] = np.nan
-                print(f"min depth measured: {np.nanmin(detected_seafloor_depths)}")
-                print(f"max depth measured: {np.nanmax(detected_seafloor_depths)}")
-                #available_indices = np.argwhere(np.isnan(geospatial['latitude'].values))
-                output_zarr_store.bottom[
-                    start_ping_time_index:end_ping_time_index
-                ] = detected_seafloor_depths
+                if 'detected_seafloor_depth' in input_xr.variables:
+                    print('Found detected_seafloor_depth, adding data to output store.')
+                    detected_seafloor_depth = input_xr.detected_seafloor_depth.values
+                    detected_seafloor_depth[detected_seafloor_depth == 0.] = np.nan
+                    # TODO: problem here: Processing file: D20070711-T210709.
+                    detected_seafloor_depths = np.nanmean(detected_seafloor_depth, 0) # RuntimeWarning: Mean of empty slice detected_seafloor_depths = np.nanmean(detected_seafloor_depth, 0)
+                    detected_seafloor_depths[detected_seafloor_depths == 0.] = np.nan
+                    print(f"min depth measured: {np.nanmin(detected_seafloor_depths)}")
+                    print(f"max depth measured: {np.nanmax(detected_seafloor_depths)}")
+                    #available_indices = np.argwhere(np.isnan(geospatial['latitude'].values))
+                    output_zarr_store.bottom[
+                        start_ping_time_index:end_ping_time_index
+                    ] = detected_seafloor_depths
                 #########################################################################
                 #########################################################################
         except Exception as err:
             print(f"Problem interpolating the data: {err}")
             raise err
-        # else:
-        #     pass
         finally:
             print("Done interpolating data.")
             # TODO: read across times and verify data was written?

{water_column_sonar_processing-0.0.13 → water_column_sonar_processing-24.1.1}/water_column_sonar_processing/geometry/__init__.py RENAMED Viewed

@@ -1,5 +1,6 @@
+from .elevation_manager import ElevationManager
 from .geometry_manager import GeometryManager
 from .geometry_simplification import GeometrySimplification
 from .pmtile_generation import PMTileGeneration
-__all__ = ["GeometryManager", "GeometrySimplification", "PMTileGeneration"]
+__all__ = ["ElevationManager", "GeometryManager", "GeometrySimplification", "PMTileGeneration"]

water_column_sonar_processing-24.1.1/water_column_sonar_processing/geometry/elevation_manager.py ADDED Viewed

@@ -0,0 +1,112 @@
+"""
+https://gis.ngdc.noaa.gov/arcgis/rest/services/DEM_mosaics/DEM_global_mosaic/ImageServer/identify?geometry=-31.70235%2C13.03332&geometryType=esriGeometryPoint&returnGeometry=false&returnCatalogItems=false&f=json
+https://gis.ngdc.noaa.gov/arcgis/rest/services/DEM_mosaics/DEM_global_mosaic/ImageServer/
+    identify?
+        geometry=-31.70235%2C13.03332
+        &geometryType=esriGeometryPoint
+        &returnGeometry=false
+        &returnCatalogItems=false
+        &f=json
+{"objectId":0,"name":"Pixel","value":"-5733","location":{"x":-31.702349999999999,"y":13.03332,"spatialReference":{"wkid":4326,"latestWkid":4326}},"properties":null,"catalogItems":null,"catalogItemVisibilities":[]}
+-5733
+(base) rudy:deleteME rudy$ curl https://api.opentopodata.org/v1/gebco2020?locations=13.03332,-31.70235
+{
+  "results": [
+    {
+      "dataset": "gebco2020",
+      "elevation": -5729.0,
+      "location": {
+        "lat": 13.03332,
+        "lng": -31.70235
+      }
+    }
+  ],
+  "status": "OK"
+}
+"""
+import json
+import time
+import requests
+from collections.abc import Generator
+def chunked(
+    ll: list,
+    n: int
+) -> Generator:
+    # Yields successively n-sized chunks from ll.
+    for i in range(0, len(ll), n):
+        yield ll[i : i + n]
+class ElevationManager:
+    #######################################################
+    def __init__(
+        self,
+    ):
+        self.DECIMAL_PRECISION = 5  # precision for GPS coordinates
+        self.TIMOUT_SECONDS = 10
+    #######################################################
+    def get_arcgis_elevation(
+            self,
+            lngs: list,
+            lats: list,
+            chunk_size: int=500, # I think this is the api limit
+    ) -> int:
+        # Reference: https://developers.arcgis.com/rest/services-reference/enterprise/map-to-image/
+        # Info: https://www.arcgis.com/home/item.html?id=c876e3c96a8642ab8557646a3b4fa0ff
+        ### 'https://gis.ngdc.noaa.gov/arcgis/rest/services/DEM_mosaics/DEM_global_mosaic/ImageServer/identify?geometry={"points":[[-31.70235,13.03332],[-32.70235,14.03332]]}&geometryType=esriGeometryMultipoint&returnGeometry=false&returnCatalogItems=false&f=json'
+        if len(lngs) != len(lats):
+            raise ValueError("lngs and lats must have same length")
+        geometryType = "esriGeometryMultipoint" # TODO: allow single point?
+        depths = []
+        list_of_points = [list(elem) for elem in list(zip(lngs, lats))]
+        for chunk in chunked(list_of_points, chunk_size):
+            time.sleep(0.1)
+            # order: (lng, lat)
+            geometry = f'{{"points":{str(chunk)}}}'
+            url=f'https://gis.ngdc.noaa.gov/arcgis/rest/services/DEM_mosaics/DEM_global_mosaic/ImageServer/identify?geometry={geometry}&geometryType={geometryType}&returnGeometry=false&returnCatalogItems=false&f=json'
+            result = requests.get(url, timeout=self.TIMOUT_SECONDS)
+            res = json.loads(result.content.decode('utf8'))
+            if 'results' in res:
+                for element in res['results']:
+                    depths.append(float(element['value']))
+            elif 'value' in res:
+                depths.append(float(res['value']))
+        return depths
+    # def get_gebco_bathymetry_elevation(self) -> int:
+    #     # Documentation: https://www.opentopodata.org/datasets/gebco2020/
+    #     latitude = 13.03332
+    #     longitude = -31.70235
+    #     dataset = "gebco2020"
+    #     url = f"https://api.opentopodata.org/v1/{dataset}?locations={latitude},{longitude}"
+    #     pass
+    # def get_elevation(
+    #         self,
+    #         df,
+    #         lat_column,
+    #         lon_column,
+    # ) -> int:
+    #     """Query service using lat, lon. add the elevation values as a new column."""
+    #     url = r'https://epqs.nationalmap.gov/v1/json?'
+    #     elevations = []
+    #     for lat, lon in zip(df[lat_column], df[lon_column]):
+    #         # define rest query params
+    #         params = {
+    #             'output': 'json',
+    #             'x': lon,
+    #             'y': lat,
+    #             'units': 'Meters'
+    #         }
+    #         result = requests.get((url + urllib.parse.urlencode(params)))
+    #         elevations.append(result.json()['value'])
+    #     return elevations

{water_column_sonar_processing-0.0.13 → water_column_sonar_processing-24.1.1}/water_column_sonar_processing/index/index_manager.py RENAMED Viewed

@@ -7,13 +7,20 @@ from concurrent.futures import as_completed
 from water_column_sonar_processing.aws import S3Manager
+MAX_POOL_CONNECTIONS = 64
+MAX_CONCURRENCY = 64
+MAX_WORKERS = 64
+GB = 1024**3
 class IndexManager:
+    # TODO: index into dynamodb instead of csv files
     def __init__(self, input_bucket_name, calibration_bucket, calibration_key):
         self.input_bucket_name = input_bucket_name
         self.calibration_bucket = calibration_bucket
         self.calibration_key = calibration_key
-        self.s3_manager = S3Manager()
+        self.s3_manager = S3Manager() # TODO: make anonymous?
     #################################################################
     def list_ships(
@@ -50,6 +57,9 @@ class IndexManager:
         self,
         cruise_prefixes,
     ):
+        """
+        This returns a list of ek60 prefixed cruises.
+        """
         cruise_sensors = []  # includes all sensor types
         for cruise_prefix in cruise_prefixes:
             page_iterator = self.s3_manager.paginator.paginate(
@@ -67,9 +77,12 @@ class IndexManager:
         cruise_name,
         sensor_name,
     ):
+        # Gets all raw files for a cruise under the given prefix
         prefix = f"data/raw/{ship_name}/{cruise_name}/{sensor_name}/"  # Note no forward slash at beginning
         page_iterator = self.s3_manager.paginator.paginate(
-            Bucket=self.input_bucket_name, Prefix=prefix, Delimiter="/"
+            Bucket=self.input_bucket_name,
+            Prefix=prefix,
+            Delimiter="/"
         )
         all_files = []
         for page in page_iterator:
@@ -77,6 +90,57 @@ class IndexManager:
                 all_files.extend([i["Key"] for i in page["Contents"]])
         return [i for i in all_files if i.endswith(".raw")]
+    def get_first_raw_file(
+        self,
+        ship_name,
+        cruise_name,
+        sensor_name,
+    ):
+        # Same as above but only needs to get the first raw file
+        # because we are only interested in the first datagram of one file
+        prefix = f"data/raw/{ship_name}/{cruise_name}/{sensor_name}/"  # Note no forward slash at beginning
+        # page_iterator = self.s3_manager.paginator.paginate(
+        #     Bucket=self.input_bucket_name,
+        #     Prefix=prefix,
+        #     Delimiter="/",
+        #     PaginationConfig={ 'MaxItems': 5 }
+        # ) # TODO: this can create a problem if there is a non raw file returned first
+        ### filter with JMESPath expressions ###
+        page_iterator = self.s3_manager.paginator.paginate(
+            Bucket=self.input_bucket_name,
+            Prefix=prefix,
+            Delimiter="/",
+        )
+        # page_iterator = page_iterator.search("Contents[?Size < `2200`][]")
+        page_iterator = page_iterator.search(expression="Contents[?contains(Key, '.raw')] ")
+        for res in page_iterator:
+            if "Key" in res:
+                return res["Key"]
+        # else raise exception?
+        # DSJ0604-D20060406-T050022.bot 2kB == 2152 'Size'
+    def get_files_under_size(
+        self,
+        ship_name,
+        cruise_name,
+        sensor_name,
+    ):
+        # THIS isn't used, just playing with JMES paths spec
+        prefix = f"data/raw/{ship_name}/{cruise_name}/{sensor_name}/"
+        ### filter with JMESPath expressions ###
+        page_iterator = self.s3_manager.paginator.paginate(
+            Bucket=self.input_bucket_name,
+            Prefix=prefix,
+            Delimiter="/",
+        )
+        page_iterator = page_iterator.search("Contents[?Size < `2200`][]")
+        all_files = []
+        for page in page_iterator:
+            if "Contents" in page.keys():
+                all_files.extend([i["Key"] for i in page["Contents"]])
+        return [i for i in all_files if i.endswith(".raw")]
     #################################################################
     def get_raw_files_csv(
         self,
@@ -102,6 +166,29 @@ class IndexManager:
         df.to_csv(f"{ship_name}_{cruise_name}.csv", index=False, header=False, sep=" ")
         print("done")
+    def get_raw_files_list(
+        self,
+        ship_name,
+        cruise_name,
+        sensor_name,
+    ):
+        # gets all raw files in cruise and returns a list of dicts
+        raw_files = self.get_raw_files(
+            ship_name=ship_name,
+            cruise_name=cruise_name,
+            sensor_name=sensor_name
+        )
+        files_list = [
+            {
+                "ship_name": ship_name,
+                "cruise_name": cruise_name,
+                "sensor_name": sensor_name,
+                "file_name": os.path.basename(raw_file),
+            }
+            for raw_file in raw_files
+        ]
+        return files_list
     #################################################################
     def get_subset_ek60_prefix( # TODO: is this used?
         self,
@@ -169,16 +256,14 @@ class IndexManager:
         return first_datagram
     #################################################################
-    def get_subset_datagrams(
+    def get_subset_datagrams( # TODO: is this getting used
         self,
         df: pd.DataFrame
     ) -> list:
         print("getting subset of datagrams")
-        select_keys = list(
-            df[["KEY", "CRUISE"]].drop_duplicates(subset="CRUISE")["KEY"].values
-        )
+        select_keys = df[["KEY", "CRUISE"]].drop_duplicates(subset="CRUISE")["KEY"].values.tolist()
         all_datagrams = []
-        with ThreadPoolExecutor(max_workers=self.max_pool_connections) as executor:
+        with ThreadPoolExecutor(max_workers=MAX_POOL_CONNECTIONS) as executor:
             futures = [
                 executor.submit(self.scan_datagram, select_key)
                 for select_key in select_keys

water-column-sonar-processing 0.0.13__tar.gz → 24.1.1__tar.gz

Potentially problematic release.

water-column-sonar-processing 0.0.13tar.gz → 24.1.1tar.gz