PyPI - pyseter - Versions diffs - 0.1.0__tar.gz - Mend

pyseter 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (14) hide show

pyseter-0.1.0/LICENSE +21 -0
pyseter-0.1.0/PKG-INFO +144 -0
pyseter-0.1.0/README.md +115 -0
pyseter-0.1.0/pyproject.toml +41 -0
pyseter-0.1.0/pyseter/__init__.py +12 -0
pyseter-0.1.0/pyseter/extract.py +419 -0
pyseter-0.1.0/pyseter/grade.py +34 -0
pyseter-0.1.0/pyseter/sort.py +282 -0
pyseter-0.1.0/pyseter.egg-info/PKG-INFO +144 -0
pyseter-0.1.0/pyseter.egg-info/SOURCES.txt +12 -0
pyseter-0.1.0/pyseter.egg-info/dependency_links.txt +1 -0
pyseter-0.1.0/pyseter.egg-info/requires.txt +7 -0
pyseter-0.1.0/pyseter.egg-info/top_level.txt +1 -0
pyseter-0.1.0/setup.cfg +4 -0

pyseter-0.1.0/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2025 Philip Patton
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

pyseter-0.1.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,144 @@
+Metadata-Version: 2.4
+Name: pyseter
+Version: 0.1.0
+Summary: Sort images by an automatically generated ID before photo-ID
+Author-email: "Philip T. Patton" <philtpatton@gmail.com>
+License: MIT
+Project-URL: Homepage, https://github.com/philpatton/pyseter
+Project-URL: Repository, https://github.com/philpatton/pyseter
+Project-URL: Bug Tracker, https://github.com/philpatton/pyseter/issues
+Keywords: individual identification,pytorch,marine mammal
+Classifier: Development Status :: 3 - Alpha
+Classifier: License :: OSI Approved :: MIT License
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.8
+Classifier: Programming Language :: Python :: 3.9
+Classifier: Programming Language :: Python :: 3.10
+Classifier: Programming Language :: Python :: 3.11
+Requires-Python: >=3.8
+Description-Content-Type: text/markdown
+License-File: LICENSE
+Requires-Dist: numpy>=1.21.0
+Requires-Dist: pandas>=1.3.0
+Requires-Dist: scikit-learn>=1.1.0
+Requires-Dist: networkx>=2.6.0
+Requires-Dist: matplotlib>=3.5.0
+Requires-Dist: tqdm>=4.60.0
+Requires-Dist: timm>=0.6.0
+Dynamic: license-file
+# Pyseter
+A Python package that sorts images by an automatically generated ID before photo-identification.
+## Installation
+### New to Python?
+While most biologists use R, we chose to release Pyseter as a Python package because it relies heavily on Pytorch, a deep learning library. If you're new to Python, please follow these steps to getting started with Python and conda.
+#### Step 1: Install conda
+Conda is an important tool for managing packages in Python. Unlike Python, R (for the most part) handles packages for you behind the scenes. Python requires a more hands on approach.
+   - Download and install [Miniforge](https://conda-forge.org/download/) (a form of conda)
+After installing, you can verify your installation by opening the **command line interface** (CLI), which will depend on your operating system. Are you on Windows? Open the "miniforge prompt" in your start menu. Are you on Mac? Open the Terminal application. Then, type the following command into the CLI and hit return.
+```bash
+conda --version
+```
+You should see something like `conda 25.5.1`. Of course, Anaconda, miniconda, mamba, or any other form of conda will work too.
+#### Step 2: Create a new environment
+Then, you'll create an environment for the package will live in. Environments are walled off areas where we can install packages. This allows you to have multiple versions of the same package installed on your machine, which can help prevent conflicts.
+Enter the following two commands into the CLI:
+``` bash
+conda create -n pyseter_env
+conda activate pyseter_env
+```
+Here, I name (hence the `-n`) the environment `pyseter_env`, but you can call it anything you like!
+Now your environment is ready to go! Try installing your first package, pip. Pip is another way of installing Python packages, and will be helpful for installing PyTorch and pyseter (see below). To do so, enter the following command into the CLI.
+``` bash
+conda install pip -y
+```
+#### Step 3: Install Pytorch
+Installing PyTorch will allow users to extract features from images, i.e., identify individuals in images. This will be fast for users with an NVIDIA GPU or 16 GB Mac with Apple Silicon. **For all other users, extracting features from images will be extremely slow.**
+PyTorch installation can be a little finicky. I recommend following [these instructions](https://pytorch.org/get-started/locally/). Below is an example for Windows users. If you haven't already, activate your environment before installing.
+``` bash
+conda activate pyseter_env
+pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
+```
+PyTorch is pretty big (over a gigabyte), so this may take a few minutes.
+#### Step 4: Install pyseter
+Now, install pyseter. If you haven't already, activate your environment before installing.
+``` bash
+conda activate pyseter_env
+pip3 install pyseter
+```
+Now you're ready to go! You can verify your pyseter installation by opening Python in the CLI (assuming your environment is still activated).
+``` bash
+python
+```
+Then, run the following Python commands.
+``` python
+import pyseter
+pyseter.verify_pytorch()
+quit()
+```
+If successful, you should see a message like this.
+```
+✓ PyTorch 2.7.0 detected
+✓ Apple Silicon (MPS) GPU available
+```
+## Step 5: AnyDorsal weights
+Pyseter relies on the [AnyDorsal algorithm](https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/2041-210X.14167) to extract features from images. Please download the weights and place them anywhere you like. You'll reference the file location later when using the `FeatureExtractor`.
+## Jupyter
+There are several different ways to interact with Python. The most common way for data analysts is through a *Jupyter Notebook*, which is similar to an RMarkdown document or a Quarto document.
+Just to make things confusing, there are several ways to open Jupyter Notebooks. Personally, I think the easiest way is through [VS Code](https://code.visualstudio.com/download). VS Code is an IDE (like R Studio) for editing code of all languages, and has great support for [Jupyter notebooks](https://code.visualstudio.com/docs/datascience/jupyter-notebooks). Alternatively, [Positron](https://positron.posit.co) is a VS-Code-based editor developed by the R Studio team.
+Alternatively, you can try [Jupyter Lab](https://docs.jupyter.org/en/latest/). To do so, [install Jupyter](https://jupyter.org/install) via the command line (see below). I also recommend installing the [ipykernel](https://ipython.readthedocs.io/en/stable/install/kernel_install.html#kernels-for-different-environments), which helps you select the right conda environment in Jupyter Lab.
+``` bash
+conda activate pyseter_env
+conda install jupyter ipykernel -y
+python -m ipykernel install --user --name pyseter --display-name "Python (pyseter)"
+```
+Note that you only need to activate `pyseter_env` when you open a new command line (i.e., terminal or miniforge prompt). Then you can open Jupyter Lab with the following command:
+``` bash
+jupyter lab
+```
+## Getting Started
+To get started with pyseter, please check out the "General Overview" [notebook](https://github.com/philpatton/pyseter/blob/main/examples/general-overview.ipynb) in the examples folder of this repository!

pyseter-0.1.0/README.md ADDED Viewed

@@ -0,0 +1,115 @@
+# Pyseter
+A Python package that sorts images by an automatically generated ID before photo-identification.
+## Installation
+### New to Python?
+While most biologists use R, we chose to release Pyseter as a Python package because it relies heavily on Pytorch, a deep learning library. If you're new to Python, please follow these steps to getting started with Python and conda.
+#### Step 1: Install conda
+Conda is an important tool for managing packages in Python. Unlike Python, R (for the most part) handles packages for you behind the scenes. Python requires a more hands on approach.
+   - Download and install [Miniforge](https://conda-forge.org/download/) (a form of conda)
+After installing, you can verify your installation by opening the **command line interface** (CLI), which will depend on your operating system. Are you on Windows? Open the "miniforge prompt" in your start menu. Are you on Mac? Open the Terminal application. Then, type the following command into the CLI and hit return.
+```bash
+conda --version
+```
+You should see something like `conda 25.5.1`. Of course, Anaconda, miniconda, mamba, or any other form of conda will work too.
+#### Step 2: Create a new environment
+Then, you'll create an environment for the package will live in. Environments are walled off areas where we can install packages. This allows you to have multiple versions of the same package installed on your machine, which can help prevent conflicts.
+Enter the following two commands into the CLI:
+``` bash
+conda create -n pyseter_env
+conda activate pyseter_env
+```
+Here, I name (hence the `-n`) the environment `pyseter_env`, but you can call it anything you like!
+Now your environment is ready to go! Try installing your first package, pip. Pip is another way of installing Python packages, and will be helpful for installing PyTorch and pyseter (see below). To do so, enter the following command into the CLI.
+``` bash
+conda install pip -y
+```
+#### Step 3: Install Pytorch
+Installing PyTorch will allow users to extract features from images, i.e., identify individuals in images. This will be fast for users with an NVIDIA GPU or 16 GB Mac with Apple Silicon. **For all other users, extracting features from images will be extremely slow.**
+PyTorch installation can be a little finicky. I recommend following [these instructions](https://pytorch.org/get-started/locally/). Below is an example for Windows users. If you haven't already, activate your environment before installing.
+``` bash
+conda activate pyseter_env
+pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
+```
+PyTorch is pretty big (over a gigabyte), so this may take a few minutes.
+#### Step 4: Install pyseter
+Now, install pyseter. If you haven't already, activate your environment before installing.
+``` bash
+conda activate pyseter_env
+pip3 install pyseter
+```
+Now you're ready to go! You can verify your pyseter installation by opening Python in the CLI (assuming your environment is still activated).
+``` bash
+python
+```
+Then, run the following Python commands.
+``` python
+import pyseter
+pyseter.verify_pytorch()
+quit()
+```
+If successful, you should see a message like this.
+```
+✓ PyTorch 2.7.0 detected
+✓ Apple Silicon (MPS) GPU available
+```
+## Step 5: AnyDorsal weights
+Pyseter relies on the [AnyDorsal algorithm](https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/2041-210X.14167) to extract features from images. Please download the weights and place them anywhere you like. You'll reference the file location later when using the `FeatureExtractor`.
+## Jupyter
+There are several different ways to interact with Python. The most common way for data analysts is through a *Jupyter Notebook*, which is similar to an RMarkdown document or a Quarto document.
+Just to make things confusing, there are several ways to open Jupyter Notebooks. Personally, I think the easiest way is through [VS Code](https://code.visualstudio.com/download). VS Code is an IDE (like R Studio) for editing code of all languages, and has great support for [Jupyter notebooks](https://code.visualstudio.com/docs/datascience/jupyter-notebooks). Alternatively, [Positron](https://positron.posit.co) is a VS-Code-based editor developed by the R Studio team.
+Alternatively, you can try [Jupyter Lab](https://docs.jupyter.org/en/latest/). To do so, [install Jupyter](https://jupyter.org/install) via the command line (see below). I also recommend installing the [ipykernel](https://ipython.readthedocs.io/en/stable/install/kernel_install.html#kernels-for-different-environments), which helps you select the right conda environment in Jupyter Lab.
+``` bash
+conda activate pyseter_env
+conda install jupyter ipykernel -y
+python -m ipykernel install --user --name pyseter --display-name "Python (pyseter)"
+```
+Note that you only need to activate `pyseter_env` when you open a new command line (i.e., terminal or miniforge prompt). Then you can open Jupyter Lab with the following command:
+``` bash
+jupyter lab
+```
+## Getting Started
+To get started with pyseter, please check out the "General Overview" [notebook](https://github.com/philpatton/pyseter/blob/main/examples/general-overview.ipynb) in the examples folder of this repository!

pyseter-0.1.0/pyproject.toml ADDED Viewed

@@ -0,0 +1,41 @@
+[build-system]
+requires = ["setuptools>=61.0", "wheel"]
+build-backend = "setuptools.build_meta"
+[project]
+name = "pyseter"
+version = "0.1.0"
+description = "Sort images by an automatically generated ID before photo-ID"
+readme = "README.md"
+authors = [{name = "Philip T. Patton", email = "philtpatton@gmail.com"}]
+license = {text = "MIT"}
+classifiers = [
+    "Development Status :: 3 - Alpha",
+    "License :: OSI Approved :: MIT License",
+    "Programming Language :: Python :: 3",
+    "Programming Language :: Python :: 3.8",
+    "Programming Language :: Python :: 3.9",
+    "Programming Language :: Python :: 3.10",
+    "Programming Language :: Python :: 3.11",
+]
+keywords = ["individual identification", "pytorch", "marine mammal"]
+requires-python = ">=3.8"
+dependencies = [
+    "numpy>=1.21.0",
+    "pandas>=1.3.0",
+    "scikit-learn>=1.1.0",
+    "networkx>=2.6.0",
+    "matplotlib>=3.5.0",
+    "tqdm>=4.60.0",
+    "timm>=0.6.0"
+]
+[project.urls]
+Homepage = "https://github.com/philpatton/pyseter"
+# Documentation = "https://pyseter.readthedocs.io"
+Repository = "https://github.com/philpatton/pyseter"
+"Bug Tracker" = "https://github.com/philpatton/pyseter/issues"
+[tool.setuptools.packages.find]
+where = ["."]
+include = ["pyseter*"]

pyseter-0.1.0/pyseter/__init__.py ADDED Viewed

@@ -0,0 +1,12 @@
+"""
+Your Package Name
+A description of what your package does.
+"""
+__version__ = "0.1.0"
+# Import main functions/classes for easy access
+from pyseter.extract import verify_pytorch, get_best_device
+# Define what gets imported with "from your_package import *"
+__all__ = ["verify_pytorch", "get_best_device"]

pyseter-0.1.0/pyseter/extract.py ADDED Viewed

@@ -0,0 +1,419 @@
+"""Extract features from images with AnyDorsal.
+Leave one blank line.  The rest of this docstring should contain an
+overall description of the module or program.  Optionally, it may also
+contain a brief description of exported classes and functions and/or usage
+examples.
+Typical usage example:
+  foo = ClassFoo()
+  bar = foo.function_bar()
+"""
+from typing import Optional, Dict, LiteralString
+import os
+from sklearn.preprocessing import normalize
+from torch import nn
+from torch.amp import autocast # pyright: ignore[reportPrivateImportUsage]
+from torch.utils.data import Dataset, DataLoader
+from torchvision.transforms import v2
+from torchvision.io import decode_image
+from tqdm import tqdm
+import numpy as np
+import pandas as pd
+import timm
+import torch
+import torch.nn.functional as F
+import torch.utils.checkpoint as cp
+# def main():
+#     # Print the name of the device
+#     if torch.cuda.is_available():
+#         print('... with cuda device:', torch.cuda.get_device_name())
+#         mem = torch.cuda.get_device_properties(0).total_memory
+#         print('... which has', int(mem / 1024**3), 'GB of memory')
+#     # sometimes I want to up the batch size depending on which GPU i get
+#     user_input = input('Continue? (y/n): ')
+#     if user_input.lower() != 'y':
+#         print('Exiting...')
+#         return
+#     # Load the configuration file
+#     with open('config.yaml', 'r') as f:
+#         config = yaml.load(f, Loader=yaml.SafeLoader)
+#     # create an output directory
+#     root = config['image_root']
+#     out_dir = os.path.join(root, 'features')
+#     os.makedirs(out_dir, exist_ok=True)
+#     # extract the features
+#     sample_size = config['sample_size']
+#     for sample in range(sample_size):
+#         print('Sample:', sample)
+#         features = extract(config)
+#         # define the output path
+#         if config['stochastic']:
+#             out_path = os.path.join(out_dir, f'features_{sample:>03}.npy')
+#         else:
+#             out_path = os.path.join(out_dir, 'features.npy')
+#         # save the features
+#         np.save(out_path, features)
+    # # might as well save the features in a single array
+    # if config['stochastic']:
+    #     stochastic_features = [f for f in os.listdir(out_dir) if 'sample' in f]
+    #     files = [k for k in features.keys()]
+    #     features_array = load_all_features(stochastic_features, files)
+    #     out_path = os.path.join(out_dir, 'stochastic_feature_array.npy')
+    #     np.save(out_path, features_array)
+def verify_pytorch() -> None:
+    """Verify PyTorch installation and show device options."""
+    try:
+        import torch
+        print(f"✓ PyTorch {torch.__version__} detected")
+        # Check all device options
+        if torch.cuda.is_available():
+            print(f"✓ CUDA GPU available: {torch.cuda.get_device_name(0)}")
+        if torch.backends.mps.is_available():
+            print("✓ Apple Silicon (MPS) GPU available")
+        if not torch.cuda.is_available() and not torch.backends.mps.is_available():
+            print("! No GPU acceleration available. Expect slow feature extraction.")
+        return None
+    except ImportError:
+        print("✗ PyTorch not found!")
+        print("See homepage for PyTorch installation instructions")
+        return None
+def get_best_device() -> LiteralString:
+    """Select torch device based on expected performance."""
+    if torch.cuda.is_available():
+        device = "cuda"
+        device_name = torch.cuda.get_device_name(0)
+    elif torch.backends.mps.is_available():
+        device = "mps"
+        device_name = "Apple Silicon GPU"
+    else:
+        device = "cpu"
+        device_name = "CPU"
+    print(f"Using device: {device} ({device_name})")
+    return device
+class FeatureExtractor:
+    def __init__(self, batch_size: int, model_path: str,
+                 device: Optional[str]=None,
+                 stochastic: bool=False,
+                 bbox_csv: Optional[str]=None):
+        self.batch_size = batch_size
+        self.model_path = model_path
+        self.stochastic = stochastic
+        self.bbox_csv = bbox_csv
+        if device is None:
+            self.device = get_best_device()
+        else:
+            self.device = device
+    def extract(self, image_dir: str) -> Dict:
+        print('Loading model...')
+        model = self.get_model()
+        if not self.stochastic:
+            model.eval()
+        test_dataloader = get_test_data(image_dir, self.batch_size, self.bbox_csv)
+        print('Extracting features...')
+        features = self.extract_features(test_dataloader, model)
+        return features
+    def get_model(self):
+        """Build the model from the checkpoint."""
+        # Create the backbone
+        backbone = EfficientNetCustomBackbone(
+            model_name='tf_efficientnet_l2_ns',
+            drop_path_rate=0.2,
+            with_cp=False
+        )
+        # Create the complete model
+        model = ImageClassifier(
+            backbone=backbone,
+            in_channels=5504,  # Feature dimension for efficientnet_l2_ns
+            num_classes=15587  # Number of classes
+        )
+        # Load the checkpoint
+        print('Loading model from:', self.model_path, )
+        checkpoint = torch.load(self.model_path, map_location=self.device)
+        # Process state dict
+        state_dict = checkpoint["state_dict"]
+        if 'head.compute_loss.margins' in state_dict:
+            _ = state_dict.pop('head.compute_loss.margins')
+        # Adapt the state dict keys if needed
+        new_state_dict = {}
+        for k, v in state_dict.items():
+            # Handle backbone keys
+            if k.startswith('backbone.timm_model.'):
+                # Map to our structure
+                if 'blocks' in k:
+                    new_k = k.replace('backbone.timm_model.', 'backbone.base_model.')
+                elif 'conv_stem' in k:
+                    new_k = k.replace('backbone.timm_model.', 'backbone.base_model.')
+                elif 'bn1' in k:
+                    new_k = k.replace('backbone.timm_model.', 'backbone.base_model.')
+                elif 'bn2' in k:
+                    new_k = k.replace('backbone.timm_model.', 'backbone.base_model.')
+                elif 'conv_head' in k:
+                    new_k = k.replace('backbone.timm_model.', 'backbone.base_model.')
+                else:
+                    new_k = k.replace('backbone.timm_model.', 'backbone.base_model.')
+            else:
+                new_k = k
+            new_state_dict[new_k] = v
+        # Load state dict with some flexibility for missing keys
+        missing_keys, unexpected_keys = model.load_state_dict(new_state_dict, strict=False)
+        if missing_keys:
+            print(f"Warning: Missing keys when loading pretrained weights: {missing_keys}")
+        if unexpected_keys:
+            print(f"Warning: Unexpected keys when loading pretrained weights: {unexpected_keys}")
+        # # Convert DropPath to inference mode if stochastic is enabled
+        # if stochastic:
+        #     model = convert_droppath_to_inference(model)
+        # Move model to device
+        model.to(self.device)
+        return model
+    def extract_features(self, dataloader, model) -> dict:
+        """Extract features from images using the model."""
+        file_list = []
+        feature_list = []
+        with torch.no_grad():
+            for file, image in tqdm(dataloader):
+                image = image.to(self.device)
+                with autocast(self.device): # Big speed up
+                    feature_vector = model(image, return_loss=False)
+                file_list.append(file)
+                feature_list.append(feature_vector.cpu().numpy())
+        # Handle case with only one batch
+        if isinstance(file_list[0], list) or isinstance(file_list[0], tuple):
+            files = np.concatenate(file_list)
+        else:
+            files = np.array(file_list)
+        feats = np.vstack(feature_list)
+        # Create dictionary mapping filenames to features
+        feature_dict = dict(zip(files, feats))
+        return feature_dict
+class EfficientNetCustomBackbone(nn.Module):
+    """Custom EfficientNet backbone that mimics the MMCLS implementation."""
+    def __init__(self, model_name='tf_efficientnet_l2_ns', drop_path_rate=0.2, with_cp=False):
+        super().__init__()
+        # Create the base model
+        self.base_model = timm.create_model(
+            model_name,
+            pretrained=False,
+            drop_path_rate=drop_path_rate,
+        )
+        self.with_cp = with_cp
+        # Remove the classifier and pooling
+        self.base_model.classifier = nn.Identity()
+        self.base_model.global_pool = nn.Identity()
+    def forward(self, x):
+        # We need to replicate the exact forward pass from your MMCLS TimmEfficientNet class
+        # This extracts features before the pooling layer
+        # Apply stem
+        x = self.base_model.conv_stem(x)
+        x = self.base_model.bn1(x)
+        # x = self.base_model.act1(x)
+        # Process through all blocks
+        for block_idx, blocks in enumerate(self.base_model.blocks):
+            for idx, block in enumerate(blocks):
+                if self.with_cp and x.requires_grad: # pyright: ignore[reportOptionalMemberAccess]
+                    x = cp.checkpoint(block, x)
+                else:
+                    x = block(x)
+        # Final convolution and activation
+        x = self.base_model.conv_head(x)
+        x = self.base_model.bn2(x)
+        # x = self.base_model.act2(x)
+        return x
+class NormLinear(nn.Linear):
+    """Linear layer with optional feature and weight normalization."""
+    def __init__(self, in_features, out_features, bias=False, feature_norm=True,
+                 weight_norm=True):
+        super().__init__(in_features, out_features, bias=bias)
+        self.weight_norm = weight_norm
+        self.feature_norm = feature_norm
+    def forward(self, data):
+        if self.feature_norm:
+            data = F.normalize(data)
+        if self.weight_norm:
+            weight = F.normalize(self.weight)
+        else:
+            weight = self.weight
+        return F.linear(data, weight, self.bias)
+class ImageClassifier(nn.Module):
+    """Complete model that mimics the MMCLS classifier structure."""
+    def __init__(self, backbone, in_channels=5504, num_classes=15587):
+        super().__init__()
+        self.backbone = backbone
+        # Global pooling layer
+        self.neck = nn.AdaptiveAvgPool2d((1, 1))
+        # Classification head
+        self.head = NormLinear(
+            in_features=in_channels,
+            out_features=num_classes,
+            bias=False,
+            feature_norm=True,
+            weight_norm=True
+        )
+    def forward(self, x, return_loss=True):
+        # Extract features from backbone
+        x = self.backbone(x)
+        # Apply global pooling and flatten
+        x = self.neck(x)
+        x = torch.flatten(x, 1)
+        # Return features if not computing loss
+        if not return_loss:
+            return F.normalize(x)
+        # Compute logits for training (not used in feature extraction)
+        logits = self.head(x)
+        return logits
+def load_bounding_boxes(csv_path):
+    '''Load bounding boxes from a CSV file.'''
+    df = pd.read_csv(csv_path)
+    bboxes = {}
+    for _, row in df.iterrows():
+        # filename = row['image_name']
+        # bbox_columns = ['bbox_x', 'bbox_y', 'bbox_width', 'bbox_height']
+        # xmin, ymin, w, h = row[bbox_columns].values
+        # xmax, ymax = xmin + w, ymin + h
+        filename = row['filename']
+        xmin, ymin, xmax, ymax = row['xmin'], row['ymin'], row['xmax'], row['ymax']
+        bboxes[filename] = (xmin, ymin, xmax, ymax)
+    return bboxes
+class DorsalImageDataset(Dataset):
+    """Dataset for dorsal fin images with optional bounding boxes."""
+    def __init__(self, image_dir, transform=None, bbox_csv=None):
+        self.image_dir = image_dir
+        self.transform = transform
+        self.images = list_images(image_dir)
+        self.bboxes = load_bounding_boxes(bbox_csv) if bbox_csv else None
+    def __len__(self):
+        return len(self.images)
+    def __getitem__(self, idx):
+        filename = self.images[idx]
+        full_path = os.path.join(self.image_dir, filename)
+        image = decode_image(full_path)
+        if self.bboxes and filename in self.bboxes:
+            bbox = self.bboxes[filename]
+            # Crop image to bounding box
+            image = image[:, bbox[1]:bbox[3], bbox[0]:bbox[2]]
+        if self.transform:
+            image = self.transform(image)
+        return filename, image
+def get_test_data(directory, batch_size, bbox_csv=None):
+    """Get the dataloader from a directory of images."""
+    # RGB normalization values
+    bgr_mean = np.array([123.675, 116.28, 103.53]) / 255
+    bgr_std = np.array([58.395, 57.12, 57.375]) / 255
+    rgb_mean = bgr_mean[[2, 1, 0]]
+    rgb_std = bgr_std[[2, 1, 0]]
+    image_size = (768, 768)
+    data_transforms = v2.Compose([
+        v2.Resize(image_size),
+        v2.ToDtype(torch.float32, scale=True),
+        v2.Normalize(rgb_mean.tolist(), rgb_std.tolist()),
+    ])
+    test_data = DorsalImageDataset(directory, transform=data_transforms,
+                                  bbox_csv=bbox_csv)
+    dataloader = DataLoader(test_data, batch_size=batch_size, shuffle=False,
+                           num_workers=2, pin_memory=True)
+    return dataloader
+def list_images(image_dir):
+    """List all images in a directory."""
+    images = []
+    formats = ('png', 'jpg', 'jpeg')
+    for file in os.listdir(image_dir):
+        if file.lower().endswith(formats):
+            images.append(file)
+    return images
+def load_and_process_features(path, image_list, l2=True):
+    """Load features for a single file and process them into array format."""
+    features = np.load(path, allow_pickle=True).item()
+    feature_array = np.array([np.array(features[image]) for image in image_list])
+    if l2:
+        feature_array = normalize(feature_array, axis=0)
+    return feature_array
+def load_all_features(feature_paths, image_list):
+    """Load and process all features."""
+    feature_list = []
+    print('Loading features...')
+    for path in tqdm(feature_paths):
+        feature = load_and_process_features(path, image_list)
+        feature_list.append(feature)
+    return np.stack(feature_list, axis=1)

pyseter-0.1.0/pyseter/grade.py ADDED Viewed

@@ -0,0 +1,34 @@
+'''Grade images by distinctiveness.'''
+from warnings import warn
+from scipy.spatial.distance import cosine
+from sklearn.preprocessing import normalize
+import numpy as np
+from pyseter.sort import NetworkCluster
+def rate_distinctiveness(features: np.ndarray, match_threshold: float=0.6) -> np.ndarray:
+    '''Grade images by their distinctiveness.'''
+    # watch out!
+    warn('Distinctiveness grades are experimental and should be verified.')
+    # we use single linkage clustering to find the unrecognizable identity (UI)
+    nc = NetworkCluster(match_threshold=match_threshold)
+    results = nc.cluster_images(features, message=False)
+    # we assume that the largest cluster is the UI
+    cluster_ids = np.array(results.cluster_idx)
+    labs, count = np.unique(cluster_ids, return_counts=True)
+    ui_index = labs[np.argmax(count)]
+    print(f'Unrecognizable identity cluster consists of {np.max(count)} images.')
+    # average the features for all images in the largest cluster to get the UI embedding
+    ui_feature_array = features[cluster_ids == ui_index]
+    ui_norm = normalize(ui_feature_array)
+    ui_center = ui_norm.mean(axis=0)
+    # compute the embedding recognizability score
+    ers = [cosine(f, ui_center) for f in features]
+    return np.array(ers)

pyseter-0.1.0/pyseter/sort.py ADDED Viewed

@@ -0,0 +1,282 @@
+"""Sort images by an automatically generated ID before photo-identification"""
+from collections import Counter
+from pathlib import Path
+from typing import Tuple, List, Optional
+import os
+import shutil
+from sklearn.cluster import AgglomerativeClustering
+from sklearn.metrics.pairwise import cosine_distances
+import matplotlib.pyplot as plt
+import networkx as nx
+import numpy as np
+import pandas as pd
+def prep_images(image_dir: str, all_image_dir: str) -> None:
+    """Copy all images to a temporary directory and return encounter information"""
+    images, encounters = process_images(image_dir, all_image_dir)
+    working_dir = Path(image_dir).parent.absolute().as_posix()
+    save_encounter_info(working_dir, encounters, images)
+def process_images(image_root: str, all_image_dir: str) -> Tuple[List[str], List[str]]:
+    """Copy all images to a temporary directory and return encounter information"""
+    image_list = []
+    encounter_list = []
+    # the temporary directory lies in the image root
+    os.makedirs(all_image_dir, exist_ok=True)
+    # loop over all the files in the image root
+    i = 0
+    for path, dirs, files in os.walk(image_root, topdown=True):
+        # only look at images, not in the tmp dir, or images that have already been sorted
+        dirs[:] = [d for d in dirs if d not in all_image_dir]
+        dirs[:] = [d for d in dirs if 'cluster' not in d]
+        for file in files:
+            if not file.lower().endswith('.jpg'):
+                continue
+            image_list.append(file)
+            # get string identifies for the encounter
+            full_path = os.path.join(path, file)
+            p = Path(full_path)
+            encounter = p.parts[-2]
+            encounter_list.append(encounter)
+            # finally, copy all of the images to the tmp dir
+            shutil.copy(full_path, all_image_dir)
+            i += 1
+    print(f'Copied {i} images to:', all_image_dir)
+    return image_list, encounter_list
+def save_encounter_info(output_dir: str, encounters: List[str], images: List[str]) -> None:
+    encounter_df = pd.DataFrame(dict(encounter=encounters, image=images))
+    encounter_path = os.path.join(output_dir, 'encounter_info.csv')
+    encounter_df.to_csv(encounter_path, index=False)
+    print('Saved encounter information to:', encounter_path)
+def load_features(image_root: str) -> Tuple[np.ndarray, np.ndarray]:
+    """Load features from disk."""
+    # the features are stored in a dictionary with image names as keys
+    feature_path = os.path.join(image_root, 'features', 'features.npy')
+    feature_dict = np.load(feature_path, allow_pickle=True).item()
+    # unpack the dictionary into arrays
+    image_names = np.array(list(feature_dict.keys()))
+    feature_array = np.array(list(feature_dict.values()))
+    return image_names, feature_array
+class HierarchicalCluster:
+    def __init__(self, match_threshold: float=0.5) -> None:
+        if (match_threshold > 1.0) or (match_threshold < 0.0):
+            raise ValueError('Match threshold must lie between 0 and 1')
+        self.match_threshold=match_threshold
+    def cluster_images(self, features: np.ndarray) -> np.ndarray:
+        # convert similarity threshold to distance
+        distance_threshold = 1 - self.match_threshold
+        # cluster using average linkage
+        hac_results = AgglomerativeClustering(
+            n_clusters=None,
+            distance_threshold=distance_threshold,
+            linkage='complete',
+            metric='cosine'
+        ).fit(features)
+        # report results
+        cluster_labels = hac_results.labels_
+        return cluster_labels
+class ClusterResults:
+    def __init__(self, cluster_labels):
+        self.cluster_labels = cluster_labels
+        self.cluster_idx = [None]
+        self.filenames = None
+        self.cluster_count = len(set(cluster_labels))
+        self.cluster_sizes = Counter(cluster_labels).values()
+        self.false_positive_df = None  # type: Optional[pd.DataFrame]
+        self.graph = nx.Graph()  # Initialize with empty graph instead of None
+        self.bad_clusters = []
+        self.bad_cluster_idx = []
+    def plot_suspicious(self):
+        graph = self.graph
+        # Get connected components from the graph
+        if graph is None or graph.number_of_nodes() == 0:
+            print("No graph data available to plot suspicious connections.")
+            return
+        connected_components = [graph.subgraph(c) for c
+                                in nx.connected_components(self.graph)]
+        subplot_count = len(self.bad_clusters)
+        n_col = 5
+        n_row = int(np.ceil(subplot_count / n_col))
+        width = 1.5
+        height = 1.5
+        fig, axes = plt.subplots(n_row, n_col, tight_layout=True,
+                                figsize=(n_col * width, n_row * height))
+        flat = axes.flatten()
+        for i, idx in enumerate(self.bad_cluster_idx):
+            ax = flat[i]
+            # remove self loops
+            G = connected_components[idx].copy()
+            G.remove_edges_from(nx.selfloop_edges(G))
+            # modularity is the warning sign for a bad cluster
+            community = nx.community.louvain_communities(G) # pyright: ignore[reportAttributeAccessIssue]
+            layout = nx.spring_layout(G)
+            nx.draw_networkx_edges(G, pos=layout, ax=ax, edge_color='C7',
+                                   alpha=0.3)
+            # color each node based on the louvain_communities
+            community = nx.community.louvain_communities(G) # pyright: ignore[reportAttributeAccessIssue]
+            color_map = {}
+            for idx, comm in enumerate(community):
+                for node in comm:
+                    color_map[node] = idx
+            node_colors = [color_map[node] for node in G.nodes]
+            nx.draw_networkx_nodes(G, layout, node_size=20, edgecolors='k',
+                                   node_color=node_colors, cmap='tab10', ax=ax) # pyright: ignore[reportArgumentType]
+            label = self.bad_clusters[i]
+            ax.set_title(label, fontsize=10, loc='center')
+        # delete unused axes
+        for idx in range(subplot_count, len(flat)):
+            fig.delaxes(flat[idx])
+        s = 'Matches between images\nSingle links between clusters are suspicious'
+        fig.suptitle(s, fontsize=12)
+        plt.tight_layout()
+        plt.show()
+def format_ids(ids: np.ndarray) -> List:
+    return [f'ID-{i:04d}' for i in ids]
+class NetworkCluster:
+    def __init__(self, match_threshold: float=0.5) -> None:
+        if (match_threshold > 1.0) or (match_threshold < 0.0):
+            raise ValueError('Match threshold must lie between 0 and 1')
+        self.match_threshold=match_threshold
+    def cluster_images(self, similarity: np.ndarray, message: bool=True) -> ClusterResults:
+        MODULARITY_THRESHOLD = 0.3
+        matches = (similarity > self.match_threshold)
+        # matches = np.where(distance < distance_threshold, distance, 0)
+        # Get connected components from the graph
+        G = nx.from_numpy_array(matches)
+        connected_components = (G.subgraph(c) for c in nx.connected_components(G))
+        # Create a mapping from node index to cluster index
+        file_count, _ = similarity.shape
+        cluster_labels = np.empty(file_count, dtype=object)
+        cluster_indices = np.empty(file_count, dtype=int)
+        # Assign clusters to the cluster_labels array
+        df_list = []
+        bad_clusters = []
+        bad_cluster_idx = []
+        for cluster_idx, subgraph in enumerate(connected_components):
+            cluster_label = f'ID_{cluster_idx:04d}'
+            for node in subgraph:
+                cluster_labels[node] = cluster_label
+                cluster_indices[node] = cluster_idx
+            # modularity is the warning sign for a bad cluster
+            community = nx.community.louvain_communities(subgraph) # type: ignore
+            modularity = nx.community.quality.modularity(subgraph, community) # pyright: ignore[reportAttributeAccessIssue]
+            if modularity > MODULARITY_THRESHOLD:
+                bad_clusters.append(cluster_label)
+                bad_cluster_idx.append(cluster_idx)
+            for community_idx, comm in enumerate(community):
+                for node in comm:
+                    row = pd.DataFrame({
+                        'cluster_id': [cluster_label],
+                        'modularity': modularity,
+                        # 'filename': fnames[node],
+                        'community': community_idx
+                    })
+                    df_list.append(row)
+        if bad_clusters and message:
+            w = f'Following clusters may contain false positives:\n{bad_clusters}'
+            print(w)
+        df = pd.concat(df_list, ignore_index=True)
+        results = ClusterResults(cluster_labels)
+        # results.filenames = fnames
+        results.graph = G
+        results.false_positive_df = df
+        results.bad_clusters = bad_clusters
+        results.bad_cluster_idx = bad_cluster_idx
+        results.cluster_idx = format_ids(cluster_indices)
+        return results
+def report_cluster_results(cluster_labs: np.ndarray) -> None:
+    # quick summary of the cluster_labs results
+    label, count = np.unique(cluster_labs, return_counts=True)
+    print(f'Found {len(label)} clusters.')
+    print(f'Largest cluster has {np.max(count)} images.')
+def sort_images(id_df, all_image_dir: str, output_dir: str) -> None:
+    """Sort images into folders based on cluster and encounter."""
+    # check that the input directory is a valid derectory
+    if not os.path.isdir(all_image_dir):
+        raise ValueError('input_dir', all_image_dir, 'is not a valid directory')
+    # check that the names are valid
+    required_column_names = ['image', 'proposed_id', 'encounter']
+    names_correct = all([i in id_df.columns for i in required_column_names])
+    if not names_correct:
+        raise ValueError("id_df must contain the column names 'image', 'proposed_id', 'encounter'")
+    if os.path.exists(output_dir):
+        shutil.rmtree(output_dir)
+    os.makedirs(output_dir)
+    grouped = id_df.groupby(['proposed_id', 'encounter'])
+    i = 0
+    j = 0
+    for (clust_id, enc_id), mini_df in grouped:
+        i += 1
+        cluster_dir = os.path.join(output_dir, clust_id)
+        os.makedirs(cluster_dir, exist_ok=True)
+        encounter_dir = os.path.join(cluster_dir, enc_id)
+        os.makedirs(encounter_dir, exist_ok=True)
+        for img in mini_df['image']:
+            j += 1
+            old_path = os.path.join(all_image_dir, img)
+            shutil.copy(old_path, encounter_dir)
+    print(f'Sorted {j} images into {i} folders.')

pyseter-0.1.0/pyseter.egg-info/PKG-INFO ADDED Viewed

@@ -0,0 +1,144 @@
+Metadata-Version: 2.4
+Name: pyseter
+Version: 0.1.0
+Summary: Sort images by an automatically generated ID before photo-ID
+Author-email: "Philip T. Patton" <philtpatton@gmail.com>
+License: MIT
+Project-URL: Homepage, https://github.com/philpatton/pyseter
+Project-URL: Repository, https://github.com/philpatton/pyseter
+Project-URL: Bug Tracker, https://github.com/philpatton/pyseter/issues
+Keywords: individual identification,pytorch,marine mammal
+Classifier: Development Status :: 3 - Alpha
+Classifier: License :: OSI Approved :: MIT License
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.8
+Classifier: Programming Language :: Python :: 3.9
+Classifier: Programming Language :: Python :: 3.10
+Classifier: Programming Language :: Python :: 3.11
+Requires-Python: >=3.8
+Description-Content-Type: text/markdown
+License-File: LICENSE
+Requires-Dist: numpy>=1.21.0
+Requires-Dist: pandas>=1.3.0
+Requires-Dist: scikit-learn>=1.1.0
+Requires-Dist: networkx>=2.6.0
+Requires-Dist: matplotlib>=3.5.0
+Requires-Dist: tqdm>=4.60.0
+Requires-Dist: timm>=0.6.0
+Dynamic: license-file
+# Pyseter
+A Python package that sorts images by an automatically generated ID before photo-identification.
+## Installation
+### New to Python?
+While most biologists use R, we chose to release Pyseter as a Python package because it relies heavily on Pytorch, a deep learning library. If you're new to Python, please follow these steps to getting started with Python and conda.
+#### Step 1: Install conda
+Conda is an important tool for managing packages in Python. Unlike Python, R (for the most part) handles packages for you behind the scenes. Python requires a more hands on approach.
+   - Download and install [Miniforge](https://conda-forge.org/download/) (a form of conda)
+After installing, you can verify your installation by opening the **command line interface** (CLI), which will depend on your operating system. Are you on Windows? Open the "miniforge prompt" in your start menu. Are you on Mac? Open the Terminal application. Then, type the following command into the CLI and hit return.
+```bash
+conda --version
+```
+You should see something like `conda 25.5.1`. Of course, Anaconda, miniconda, mamba, or any other form of conda will work too.
+#### Step 2: Create a new environment
+Then, you'll create an environment for the package will live in. Environments are walled off areas where we can install packages. This allows you to have multiple versions of the same package installed on your machine, which can help prevent conflicts.
+Enter the following two commands into the CLI:
+``` bash
+conda create -n pyseter_env
+conda activate pyseter_env
+```
+Here, I name (hence the `-n`) the environment `pyseter_env`, but you can call it anything you like!
+Now your environment is ready to go! Try installing your first package, pip. Pip is another way of installing Python packages, and will be helpful for installing PyTorch and pyseter (see below). To do so, enter the following command into the CLI.
+``` bash
+conda install pip -y
+```
+#### Step 3: Install Pytorch
+Installing PyTorch will allow users to extract features from images, i.e., identify individuals in images. This will be fast for users with an NVIDIA GPU or 16 GB Mac with Apple Silicon. **For all other users, extracting features from images will be extremely slow.**
+PyTorch installation can be a little finicky. I recommend following [these instructions](https://pytorch.org/get-started/locally/). Below is an example for Windows users. If you haven't already, activate your environment before installing.
+``` bash
+conda activate pyseter_env
+pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
+```
+PyTorch is pretty big (over a gigabyte), so this may take a few minutes.
+#### Step 4: Install pyseter
+Now, install pyseter. If you haven't already, activate your environment before installing.
+``` bash
+conda activate pyseter_env
+pip3 install pyseter
+```
+Now you're ready to go! You can verify your pyseter installation by opening Python in the CLI (assuming your environment is still activated).
+``` bash
+python
+```
+Then, run the following Python commands.
+``` python
+import pyseter
+pyseter.verify_pytorch()
+quit()
+```
+If successful, you should see a message like this.
+```
+✓ PyTorch 2.7.0 detected
+✓ Apple Silicon (MPS) GPU available
+```
+## Step 5: AnyDorsal weights
+Pyseter relies on the [AnyDorsal algorithm](https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/2041-210X.14167) to extract features from images. Please download the weights and place them anywhere you like. You'll reference the file location later when using the `FeatureExtractor`.
+## Jupyter
+There are several different ways to interact with Python. The most common way for data analysts is through a *Jupyter Notebook*, which is similar to an RMarkdown document or a Quarto document.
+Just to make things confusing, there are several ways to open Jupyter Notebooks. Personally, I think the easiest way is through [VS Code](https://code.visualstudio.com/download). VS Code is an IDE (like R Studio) for editing code of all languages, and has great support for [Jupyter notebooks](https://code.visualstudio.com/docs/datascience/jupyter-notebooks). Alternatively, [Positron](https://positron.posit.co) is a VS-Code-based editor developed by the R Studio team.
+Alternatively, you can try [Jupyter Lab](https://docs.jupyter.org/en/latest/). To do so, [install Jupyter](https://jupyter.org/install) via the command line (see below). I also recommend installing the [ipykernel](https://ipython.readthedocs.io/en/stable/install/kernel_install.html#kernels-for-different-environments), which helps you select the right conda environment in Jupyter Lab.
+``` bash
+conda activate pyseter_env
+conda install jupyter ipykernel -y
+python -m ipykernel install --user --name pyseter --display-name "Python (pyseter)"
+```
+Note that you only need to activate `pyseter_env` when you open a new command line (i.e., terminal or miniforge prompt). Then you can open Jupyter Lab with the following command:
+``` bash
+jupyter lab
+```
+## Getting Started
+To get started with pyseter, please check out the "General Overview" [notebook](https://github.com/philpatton/pyseter/blob/main/examples/general-overview.ipynb) in the examples folder of this repository!

pyseter-0.1.0/pyseter.egg-info/SOURCES.txt ADDED Viewed

@@ -0,0 +1,12 @@
+LICENSE
+README.md
+pyproject.toml
+pyseter/__init__.py
+pyseter/extract.py
+pyseter/grade.py
+pyseter/sort.py
+pyseter.egg-info/PKG-INFO
+pyseter.egg-info/SOURCES.txt
+pyseter.egg-info/dependency_links.txt
+pyseter.egg-info/requires.txt
+pyseter.egg-info/top_level.txt

pyseter-0.1.0/pyseter.egg-info/dependency_links.txt ADDED Viewed

	@@ -0,0 +1 @@
1	+

pyseter-0.1.0/pyseter.egg-info/requires.txt ADDED Viewed

@@ -0,0 +1,7 @@
+numpy>=1.21.0
+pandas>=1.3.0
+scikit-learn>=1.1.0
+networkx>=2.6.0
+matplotlib>=3.5.0
+tqdm>=4.60.0
+timm>=0.6.0

pyseter-0.1.0/pyseter.egg-info/top_level.txt ADDED Viewed

	@@ -0,0 +1 @@
1	+ pyseter

pyseter-0.1.0/setup.cfg ADDED Viewed

@@ -0,0 +1,4 @@
+[egg_info]
+tag_build =
+tag_date = 0