bplusplus 1.2.3__tar.gz → 1.2.4__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of bplusplus might be problematic. Click here for more details.

@@ -0,0 +1,207 @@
1
+ Metadata-Version: 2.3
2
+ Name: bplusplus
3
+ Version: 1.2.4
4
+ Summary: A simple method to create AI models for biodiversity, with collect and prepare pipeline
5
+ License: MIT
6
+ Author: Titus Venverloo
7
+ Author-email: tvenver@mit.edu
8
+ Requires-Python: >=3.10,<4.0
9
+ Classifier: License :: OSI Approved :: MIT License
10
+ Classifier: Programming Language :: Python :: 3
11
+ Classifier: Programming Language :: Python :: 3.10
12
+ Classifier: Programming Language :: Python :: 3.11
13
+ Classifier: Programming Language :: Python :: 3.12
14
+ Classifier: Programming Language :: Python :: 3.13
15
+ Requires-Dist: numpy (==1.26.4)
16
+ Requires-Dist: pandas (==2.1.4)
17
+ Requires-Dist: pillow (==11.3.0)
18
+ Requires-Dist: prettytable (==3.7.0)
19
+ Requires-Dist: pygbif (==0.6.5)
20
+ Requires-Dist: pyyaml (==6.0.1)
21
+ Requires-Dist: requests (==2.25.1)
22
+ Requires-Dist: scikit-learn (==1.7.1)
23
+ Requires-Dist: tabulate (==0.9.0)
24
+ Requires-Dist: tqdm (==4.66.4)
25
+ Requires-Dist: ultralytics (==8.3.173)
26
+ Requires-Dist: validators (==0.33.0)
27
+ Description-Content-Type: text/markdown
28
+
29
+ # B++ repository
30
+
31
+ [![DOI](https://zenodo.org/badge/765250194.svg)](https://zenodo.org/badge/latestdoi/765250194)
32
+ [![PyPi version](https://img.shields.io/pypi/v/bplusplus.svg)](https://pypi.org/project/bplusplus/)
33
+ [![Python versions](https://img.shields.io/pypi/pyversions/bplusplus.svg)](https://pypi.org/project/bplusplus/)
34
+ [![License](https://img.shields.io/pypi/l/bplusplus.svg)](https://pypi.org/project/bplusplus/)
35
+ [![Downloads](https://static.pepy.tech/badge/bplusplus)](https://pepy.tech/project/bplusplus)
36
+ [![Downloads](https://static.pepy.tech/badge/bplusplus/month)](https://pepy.tech/project/bplusplus)
37
+ [![Downloads](https://static.pepy.tech/badge/bplusplus/week)](https://pepy.tech/project/bplusplus)
38
+
39
+ This project provides a complete, end-to-end pipeline for building a custom insect classification system. The framework is designed to be **domain-agnostic**, allowing you to train a powerful detection and classification model for **any insect species** by simply providing a list of names.
40
+
41
+ Using the `Bplusplus` library, this pipeline automates the entire machine learning workflow, from data collection to video inference.
42
+
43
+ ## Key Features
44
+
45
+ - **Automated Data Collection**: Downloads hundreds of images for any species from the GBIF database.
46
+ - **Intelligent Data Preparation**: Uses a pre-trained model to automatically find, crop, and resize insects from raw images, ensuring high-quality training data.
47
+ - **Hierarchical Classification**: Trains a model to identify insects at three taxonomic levels: **family, genus, and species**.
48
+ - **Video Inference & Tracking**: Processes video files to detect, classify, and track individual insects over time, providing aggregated predictions.
49
+ ## Pipeline Overview
50
+
51
+ The process is broken down into six main steps, all detailed in the `full_pipeline.ipynb` notebook:
52
+
53
+ 1. **Collect Data**: Select your target species and fetch raw insect images from the web.
54
+ 2. **Prepare Data**: Filter, clean, and prepare images for training.
55
+ 3. **Train Model**: Train the hierarchical classification model.
56
+ 4. **Download Weights**: Fetch pre-trained weights for the detection model.
57
+ 5. **Test Model**: Evaluate the performance of the trained model.
58
+ 6. **Run Inference**: Run the full pipeline on a video file for real-world application.
59
+
60
+ ## How to Use
61
+
62
+ ### Prerequisites
63
+
64
+ - Python 3.10+
65
+
66
+ ### Setup
67
+
68
+ 1. **Create and activate a virtual environment:**
69
+ ```bash
70
+ python3 -m venv venv
71
+ source venv/bin/activate
72
+ ```
73
+
74
+ 2. **Install the required packages:**
75
+ ```bash
76
+ pip install bplusplus
77
+ ```
78
+
79
+ ### Running the Pipeline
80
+
81
+ The pipeline can be run step-by-step using the functions from the `bplusplus` library. While the `full_pipeline.ipynb` notebook provides a complete, executable workflow, the core functions are described below.
82
+
83
+ #### Step 1: Collect Data
84
+ Download images for your target species from the GBIF database. You'll need to provide a list of scientific names.
85
+
86
+ ```python
87
+ import bplusplus
88
+ from pathlib import Path
89
+
90
+ # Define species and directories
91
+ names = ["Vespa crabro", "Vespula vulgaris", "Dolichovespula media"]
92
+ GBIF_DATA_DIR = Path("./GBIF_data")
93
+
94
+ # Define search parameters
95
+ search = {"scientificName": names}
96
+
97
+ # Run collection
98
+ bplusplus.collect(
99
+ group_by_key=bplusplus.Group.scientificName,
100
+ search_parameters=search,
101
+ images_per_group=200, # Recommended to download more than needed
102
+ output_directory=GBIF_DATA_DIR,
103
+ num_threads=5
104
+ )
105
+ ```
106
+
107
+ #### Step 2: Prepare Data
108
+ Process the raw images to extract, crop, and resize insects. This step uses a pre-trained model to ensure only high-quality images are used for training.
109
+
110
+ ```python
111
+ PREPARED_DATA_DIR = Path("./prepared_data")
112
+
113
+ bplusplus.prepare(
114
+ input_directory=GBIF_DATA_DIR,
115
+ output_directory=PREPARED_DATA_DIR,
116
+ img_size=640 # Target image size for training
117
+ )
118
+ ```
119
+
120
+ #### Step 3: Train Model
121
+ Train the hierarchical classification model on your prepared data. The model learns to identify family, genus, and species.
122
+
123
+ ```python
124
+ TRAINED_MODEL_DIR = Path("./trained_model")
125
+
126
+ bplusplus.train(
127
+ batch_size=4,
128
+ epochs=30,
129
+ patience=3,
130
+ img_size=640,
131
+ data_dir=PREPARED_DATA_DIR,
132
+ output_dir=TRAINED_MODEL_DIR,
133
+ species_list=names
134
+ # num_workers=0 # Optional: force single-process loading (most stable)
135
+ )
136
+ ```
137
+
138
+ **Note:** The `num_workers` parameter controls DataLoader multiprocessing (defaults to 0 for stability). You can increase it for potentially faster data loading.
139
+
140
+ #### Step 4: Download Detection Weights
141
+ The inference pipeline uses a separate, pre-trained YOLO model for initial insect detection. You need to download its weights manually.
142
+
143
+ You can download the weights file from [this link](https://github.com/Tvenver/Bplusplus/releases/download/v1.2.3/v11small-generic.pt).
144
+
145
+ Place it in the `trained_model` directory and ensure it is named `yolo_weights.pt`.
146
+
147
+ #### Step 5: Run Inference on Video
148
+ Process a video file to detect, classify, and track insects. The final output is an annotated video and a CSV file with aggregated results for each tracked insect.
149
+
150
+ ```python
151
+ VIDEO_INPUT_PATH = Path("my_video.mp4")
152
+ VIDEO_OUTPUT_PATH = Path("my_video_annotated.mp4")
153
+ HIERARCHICAL_MODEL_PATH = TRAINED_MODEL_DIR / "best_multitask.pt"
154
+ YOLO_WEIGHTS_PATH = TRAINED_MODEL_DIR / "yolo_weights.pt"
155
+
156
+ bplusplus.inference(
157
+ species_list=names,
158
+ yolo_model_path=YOLO_WEIGHTS_PATH,
159
+ hierarchical_model_path=HIERARCHICAL_MODEL_PATH,
160
+ confidence_threshold=0.35,
161
+ video_path=VIDEO_INPUT_PATH,
162
+ output_path=VIDEO_OUTPUT_PATH,
163
+ tracker_max_frames=60,
164
+ fps=15 # Optional: set processing FPS
165
+ )
166
+ ```
167
+
168
+ ### Customization
169
+
170
+ To train the model on your own set of insect species, you only need to change the `names` list in **Step 1**. The pipeline will automatically handle the rest.
171
+
172
+ ```python
173
+ # To use your own species, change the names in this list
174
+ names = [
175
+ "Vespa crabro",
176
+ "Vespula vulgaris",
177
+ "Dolichovespula media",
178
+ # Add your species here
179
+ ]
180
+ ```
181
+
182
+ #### Handling an "Unknown" Class
183
+ To train a model that can recognize an "unknown" class for insects that don't belong to your target species, add `"unknown"` to your `species_list`. You must also provide a corresponding `unknown` folder containing images of various other insects in your data directories (e.g., `prepared_data/train/unknown`).
184
+
185
+ ```python
186
+ # Example with an unknown class
187
+ names_with_unknown = [
188
+ "Vespa crabro",
189
+ "Vespula vulgaris",
190
+ "unknown"
191
+ ]
192
+ ```
193
+
194
+ ## Directory Structure
195
+
196
+ The pipeline will create the following directories to store artifacts:
197
+
198
+ - `GBIF_data/`: Stores the raw images downloaded from GBIF.
199
+ - `prepared_data/`: Contains the cleaned, cropped, and resized images ready for training.
200
+ - `trained_model/`: Saves the trained model weights (`best_multitask.pt`) and pre-trained detection weights.
201
+
202
+ # Citation
203
+
204
+ All information in this GitHub is available under MIT license, as long as credit is given to the authors.
205
+
206
+ **Venverloo, T., Duarte, F., B++: Towards Real-Time Monitoring of Insect Species. MIT Senseable City Laboratory, AMS Institute.**
207
+
@@ -0,0 +1,178 @@
1
+ # B++ repository
2
+
3
+ [![DOI](https://zenodo.org/badge/765250194.svg)](https://zenodo.org/badge/latestdoi/765250194)
4
+ [![PyPi version](https://img.shields.io/pypi/v/bplusplus.svg)](https://pypi.org/project/bplusplus/)
5
+ [![Python versions](https://img.shields.io/pypi/pyversions/bplusplus.svg)](https://pypi.org/project/bplusplus/)
6
+ [![License](https://img.shields.io/pypi/l/bplusplus.svg)](https://pypi.org/project/bplusplus/)
7
+ [![Downloads](https://static.pepy.tech/badge/bplusplus)](https://pepy.tech/project/bplusplus)
8
+ [![Downloads](https://static.pepy.tech/badge/bplusplus/month)](https://pepy.tech/project/bplusplus)
9
+ [![Downloads](https://static.pepy.tech/badge/bplusplus/week)](https://pepy.tech/project/bplusplus)
10
+
11
+ This project provides a complete, end-to-end pipeline for building a custom insect classification system. The framework is designed to be **domain-agnostic**, allowing you to train a powerful detection and classification model for **any insect species** by simply providing a list of names.
12
+
13
+ Using the `Bplusplus` library, this pipeline automates the entire machine learning workflow, from data collection to video inference.
14
+
15
+ ## Key Features
16
+
17
+ - **Automated Data Collection**: Downloads hundreds of images for any species from the GBIF database.
18
+ - **Intelligent Data Preparation**: Uses a pre-trained model to automatically find, crop, and resize insects from raw images, ensuring high-quality training data.
19
+ - **Hierarchical Classification**: Trains a model to identify insects at three taxonomic levels: **family, genus, and species**.
20
+ - **Video Inference & Tracking**: Processes video files to detect, classify, and track individual insects over time, providing aggregated predictions.
21
+ ## Pipeline Overview
22
+
23
+ The process is broken down into six main steps, all detailed in the `full_pipeline.ipynb` notebook:
24
+
25
+ 1. **Collect Data**: Select your target species and fetch raw insect images from the web.
26
+ 2. **Prepare Data**: Filter, clean, and prepare images for training.
27
+ 3. **Train Model**: Train the hierarchical classification model.
28
+ 4. **Download Weights**: Fetch pre-trained weights for the detection model.
29
+ 5. **Test Model**: Evaluate the performance of the trained model.
30
+ 6. **Run Inference**: Run the full pipeline on a video file for real-world application.
31
+
32
+ ## How to Use
33
+
34
+ ### Prerequisites
35
+
36
+ - Python 3.10+
37
+
38
+ ### Setup
39
+
40
+ 1. **Create and activate a virtual environment:**
41
+ ```bash
42
+ python3 -m venv venv
43
+ source venv/bin/activate
44
+ ```
45
+
46
+ 2. **Install the required packages:**
47
+ ```bash
48
+ pip install bplusplus
49
+ ```
50
+
51
+ ### Running the Pipeline
52
+
53
+ The pipeline can be run step-by-step using the functions from the `bplusplus` library. While the `full_pipeline.ipynb` notebook provides a complete, executable workflow, the core functions are described below.
54
+
55
+ #### Step 1: Collect Data
56
+ Download images for your target species from the GBIF database. You'll need to provide a list of scientific names.
57
+
58
+ ```python
59
+ import bplusplus
60
+ from pathlib import Path
61
+
62
+ # Define species and directories
63
+ names = ["Vespa crabro", "Vespula vulgaris", "Dolichovespula media"]
64
+ GBIF_DATA_DIR = Path("./GBIF_data")
65
+
66
+ # Define search parameters
67
+ search = {"scientificName": names}
68
+
69
+ # Run collection
70
+ bplusplus.collect(
71
+ group_by_key=bplusplus.Group.scientificName,
72
+ search_parameters=search,
73
+ images_per_group=200, # Recommended to download more than needed
74
+ output_directory=GBIF_DATA_DIR,
75
+ num_threads=5
76
+ )
77
+ ```
78
+
79
+ #### Step 2: Prepare Data
80
+ Process the raw images to extract, crop, and resize insects. This step uses a pre-trained model to ensure only high-quality images are used for training.
81
+
82
+ ```python
83
+ PREPARED_DATA_DIR = Path("./prepared_data")
84
+
85
+ bplusplus.prepare(
86
+ input_directory=GBIF_DATA_DIR,
87
+ output_directory=PREPARED_DATA_DIR,
88
+ img_size=640 # Target image size for training
89
+ )
90
+ ```
91
+
92
+ #### Step 3: Train Model
93
+ Train the hierarchical classification model on your prepared data. The model learns to identify family, genus, and species.
94
+
95
+ ```python
96
+ TRAINED_MODEL_DIR = Path("./trained_model")
97
+
98
+ bplusplus.train(
99
+ batch_size=4,
100
+ epochs=30,
101
+ patience=3,
102
+ img_size=640,
103
+ data_dir=PREPARED_DATA_DIR,
104
+ output_dir=TRAINED_MODEL_DIR,
105
+ species_list=names
106
+ # num_workers=0 # Optional: force single-process loading (most stable)
107
+ )
108
+ ```
109
+
110
+ **Note:** The `num_workers` parameter controls DataLoader multiprocessing (defaults to 0 for stability). You can increase it for potentially faster data loading.
111
+
112
+ #### Step 4: Download Detection Weights
113
+ The inference pipeline uses a separate, pre-trained YOLO model for initial insect detection. You need to download its weights manually.
114
+
115
+ You can download the weights file from [this link](https://github.com/Tvenver/Bplusplus/releases/download/v1.2.3/v11small-generic.pt).
116
+
117
+ Place it in the `trained_model` directory and ensure it is named `yolo_weights.pt`.
118
+
119
+ #### Step 5: Run Inference on Video
120
+ Process a video file to detect, classify, and track insects. The final output is an annotated video and a CSV file with aggregated results for each tracked insect.
121
+
122
+ ```python
123
+ VIDEO_INPUT_PATH = Path("my_video.mp4")
124
+ VIDEO_OUTPUT_PATH = Path("my_video_annotated.mp4")
125
+ HIERARCHICAL_MODEL_PATH = TRAINED_MODEL_DIR / "best_multitask.pt"
126
+ YOLO_WEIGHTS_PATH = TRAINED_MODEL_DIR / "yolo_weights.pt"
127
+
128
+ bplusplus.inference(
129
+ species_list=names,
130
+ yolo_model_path=YOLO_WEIGHTS_PATH,
131
+ hierarchical_model_path=HIERARCHICAL_MODEL_PATH,
132
+ confidence_threshold=0.35,
133
+ video_path=VIDEO_INPUT_PATH,
134
+ output_path=VIDEO_OUTPUT_PATH,
135
+ tracker_max_frames=60,
136
+ fps=15 # Optional: set processing FPS
137
+ )
138
+ ```
139
+
140
+ ### Customization
141
+
142
+ To train the model on your own set of insect species, you only need to change the `names` list in **Step 1**. The pipeline will automatically handle the rest.
143
+
144
+ ```python
145
+ # To use your own species, change the names in this list
146
+ names = [
147
+ "Vespa crabro",
148
+ "Vespula vulgaris",
149
+ "Dolichovespula media",
150
+ # Add your species here
151
+ ]
152
+ ```
153
+
154
+ #### Handling an "Unknown" Class
155
+ To train a model that can recognize an "unknown" class for insects that don't belong to your target species, add `"unknown"` to your `species_list`. You must also provide a corresponding `unknown` folder containing images of various other insects in your data directories (e.g., `prepared_data/train/unknown`).
156
+
157
+ ```python
158
+ # Example with an unknown class
159
+ names_with_unknown = [
160
+ "Vespa crabro",
161
+ "Vespula vulgaris",
162
+ "unknown"
163
+ ]
164
+ ```
165
+
166
+ ## Directory Structure
167
+
168
+ The pipeline will create the following directories to store artifacts:
169
+
170
+ - `GBIF_data/`: Stores the raw images downloaded from GBIF.
171
+ - `prepared_data/`: Contains the cleaned, cropped, and resized images ready for training.
172
+ - `trained_model/`: Saves the trained model weights (`best_multitask.pt`) and pre-trained detection weights.
173
+
174
+ # Citation
175
+
176
+ All information in this GitHub is available under MIT license, as long as credit is given to the authors.
177
+
178
+ **Venverloo, T., Duarte, F., B++: Towards Real-Time Monitoring of Insect Species. MIT Senseable City Laboratory, AMS Institute.**
@@ -1,27 +1,25 @@
1
1
  [tool.poetry]
2
2
  name = "bplusplus"
3
- version = "1.2.3"
3
+ version = "1.2.4"
4
4
  description = "A simple method to create AI models for biodiversity, with collect and prepare pipeline"
5
5
  authors = ["Titus Venverloo <tvenver@mit.edu>", "Deniz Aydemir <deniz@aydemir.us>", "Orlando Closs <orlandocloss@pm.me>", "Ase Hatveit <aase@mit.edu>"]
6
6
  license = "MIT"
7
7
  readme = "README.md"
8
8
 
9
9
  [tool.poetry.dependencies]
10
- python = "^3.9.0"
10
+ python = "^3.10"
11
11
  requests = "2.25.1"
12
12
  pandas = "2.1.4"
13
- ultralytics = ">=8.3.0"
13
+ ultralytics = "8.3.173"
14
14
  pyyaml = "6.0.1"
15
15
  tqdm = "4.66.4"
16
16
  prettytable = "3.7.0"
17
- torch = "^2.5.0"
18
- torchvision = "*"
19
- pillow = "*"
20
- numpy = "*"
21
- scikit-learn = "*"
22
- pygbif = "^0.6.4"
23
- validators = "^0.33.0"
24
- tabulate = "^0.9.0"
17
+ pillow = "11.3.0"
18
+ numpy = "1.26.4"
19
+ scikit-learn = "1.7.1"
20
+ pygbif = "0.6.5"
21
+ validators = "0.33.0"
22
+ tabulate = "0.9.0"
25
23
 
26
24
  [tool.poetry.group.dev.dependencies]
27
25
  jupyter = "^1.0.0"
@@ -0,0 +1,15 @@
1
+ try:
2
+ import torch
3
+ import torchvision
4
+ except ImportError:
5
+ raise ImportError(
6
+ "PyTorch and Torchvision are not installed. "
7
+ "Please install them before using bplusplus by following the instructions "
8
+ "on the official PyTorch website: https://pytorch.org/get-started/locally/"
9
+ )
10
+
11
+ from .collect import Group, collect
12
+ from .prepare import prepare
13
+ from .train import train
14
+ from .test import test
15
+ from .inference import inference
@@ -9,6 +9,8 @@ from datetime import datetime
9
9
  from pathlib import Path
10
10
  from .tracker import InsectTracker
11
11
  import torch
12
+ import torchvision.transforms as T
13
+ from torchvision.models.detection import fasterrcnn_resnet50_fpn
12
14
  from ultralytics import YOLO
13
15
  from torchvision import transforms
14
16
  from PIL import Image
@@ -19,6 +21,16 @@ import logging
19
21
  from collections import defaultdict
20
22
  import uuid
21
23
 
24
+ # Add this check for backwards compatibility
25
+ if hasattr(torch.serialization, 'add_safe_globals'):
26
+ torch.serialization.add_safe_globals([
27
+ 'torch.LongTensor',
28
+ 'torch.cuda.LongTensor',
29
+ 'torch.FloatStorage',
30
+ 'torch.FloatStorage',
31
+ 'torch.cuda.FloatStorage',
32
+ ])
33
+
22
34
  # Set up logging
23
35
  logging.basicConfig(level=logging.INFO)
24
36
  logger = logging.getLogger(__name__)
@@ -36,12 +48,15 @@ def get_taxonomy(species_list):
36
48
  species_to_genus = {}
37
49
  genus_to_family = {}
38
50
 
39
- logger.info(f"Building taxonomy from GBIF for {len(species_list)} species")
51
+ species_list_for_gbif = [s for s in species_list if s.lower() != 'unknown']
52
+ has_unknown = len(species_list_for_gbif) != len(species_list)
53
+
54
+ logger.info(f"Building taxonomy from GBIF for {len(species_list_for_gbif)} species")
40
55
 
41
56
  print(f"\n{'Species':<30} {'Family':<20} {'Genus':<20} {'Status'}")
42
57
  print("-" * 80)
43
58
 
44
- for species_name in species_list:
59
+ for species_name in species_list_for_gbif:
45
60
  url = f"https://api.gbif.org/v1/species/match?name={species_name}&verbose=true"
46
61
  try:
47
62
  response = requests.get(url)
@@ -72,6 +87,21 @@ def get_taxonomy(species_list):
72
87
  except Exception as e:
73
88
  print(f"{species_name:<30} {'Error':<20} {'Error':<20} FAILED")
74
89
  logger.error(f"Error retrieving data for '{species_name}': {str(e)}")
90
+
91
+ if has_unknown:
92
+ unknown_family = "Unknown"
93
+ unknown_genus = "Unknown"
94
+ unknown_species = "unknown"
95
+
96
+ if unknown_family not in taxonomy[1]:
97
+ taxonomy[1].append(unknown_family)
98
+
99
+ taxonomy[2][unknown_genus] = unknown_family
100
+ taxonomy[3][unknown_species] = unknown_genus
101
+ species_to_genus[unknown_species] = unknown_genus
102
+ genus_to_family[unknown_genus] = unknown_family
103
+
104
+ print(f"{unknown_species:<30} {unknown_family:<20} {unknown_genus:<20} {'OK'}")
75
105
 
76
106
  taxonomy[1] = sorted(list(set(taxonomy[1])))
77
107
  print("-" * 80)
@@ -85,18 +115,26 @@ def get_taxonomy(species_list):
85
115
  logger.info(f"Taxonomy built: {len(taxonomy[1])} families, {len(taxonomy[2])} genera, {len(taxonomy[3])} species")
86
116
  return taxonomy, species_to_genus, genus_to_family
87
117
 
88
- def create_mappings(taxonomy):
118
+ def create_mappings(taxonomy, species_list=None):
89
119
  """Create index mappings from taxonomy"""
90
120
  level_to_idx = {}
91
121
  idx_to_level = {}
92
122
 
93
123
  for level, labels in taxonomy.items():
94
124
  if isinstance(labels, list):
125
+ # Level 1: Family (already sorted)
95
126
  level_to_idx[level] = {label: idx for idx, label in enumerate(labels)}
96
127
  idx_to_level[level] = {idx: label for idx, label in enumerate(labels)}
97
- else: # Dictionary
98
- level_to_idx[level] = {label: idx for idx, label in enumerate(labels.keys())}
99
- idx_to_level[level] = {idx: label for idx, label in enumerate(labels.keys())}
128
+ else: # Dictionary for levels 2 and 3
129
+ if level == 3 and species_list is not None:
130
+ # For species, the order is determined by species_list
131
+ sorted_keys = species_list
132
+ else:
133
+ # For genus, sort alphabetically
134
+ sorted_keys = sorted(labels.keys())
135
+
136
+ level_to_idx[level] = {label: idx for idx, label in enumerate(sorted_keys)}
137
+ idx_to_level[level] = {idx: label for idx, label in enumerate(sorted_keys)}
100
138
 
101
139
  return level_to_idx, idx_to_level
102
140
 
@@ -321,9 +359,9 @@ class VideoInferenceProcessor:
321
359
 
322
360
  # Build taxonomy from species list
323
361
  self.taxonomy, self.species_to_genus, self.genus_to_family = get_taxonomy(species_list)
324
- self.level_to_idx, self.idx_to_level = create_mappings(self.taxonomy)
325
- self.family_list = self.taxonomy[1]
326
- self.genus_list = list(self.taxonomy[2].keys())
362
+ self.level_to_idx, self.idx_to_level = create_mappings(self.taxonomy, species_list)
363
+ self.family_list = sorted(self.taxonomy[1])
364
+ self.genus_list = sorted(list(self.taxonomy[2].keys()))
327
365
 
328
366
  # Load models
329
367
  print(f"Loading YOLO model from {yolo_model_path}")
@@ -863,7 +901,7 @@ def main():
863
901
  species_list = [
864
902
  "Coccinella septempunctata", "Apis mellifera", "Bombus lapidarius", "Bombus terrestris",
865
903
  "Eupeodes corollae", "Episyrphus balteatus", "Aglais urticae", "Vespula vulgaris",
866
- "Eristalis tenax"
904
+ "Eristalis tenax", "unknown"
867
905
  ]
868
906
 
869
907
  # Paths (replace with your actual paths)
@@ -174,17 +174,18 @@ def _prepare_model_and_clean_images(temp_dir_path: Path):
174
174
  print(" ✓ Model weights already exist")
175
175
 
176
176
  # Add all required classes to safe globals
177
- serialization.add_safe_globals([
178
- DetectionModel, Sequential, Conv, Conv2d, BatchNorm2d,
179
- SiLU, ReLU, LeakyReLU, MaxPool2d, Linear, Dropout, Upsample,
180
- Module, ModuleList, ModuleDict,
181
- Bottleneck, C2f, SPPF, Detect, Concat, DFL,
182
- # Add torch internal classes
183
- torch.nn.parameter.Parameter,
184
- torch.Tensor,
185
- torch._utils._rebuild_tensor_v2,
186
- torch._utils._rebuild_parameter
187
- ])
177
+ if hasattr(serialization, 'add_safe_globals'):
178
+ serialization.add_safe_globals([
179
+ DetectionModel, Sequential, Conv, Conv2d, BatchNorm2d,
180
+ SiLU, ReLU, LeakyReLU, MaxPool2d, Linear, Dropout, Upsample,
181
+ Module, ModuleList, ModuleDict,
182
+ Bottleneck, C2f, SPPF, Detect, Concat, DFL,
183
+ # Add torch internal classes
184
+ torch.nn.parameter.Parameter,
185
+ torch.Tensor,
186
+ torch._utils._rebuild_tensor_v2,
187
+ torch._utils._rebuild_parameter
188
+ ])
188
189
 
189
190
  return weights_path
190
191
 
@@ -74,6 +74,16 @@ def setup_gpu():
74
74
  logger.warning("Falling back to CPU")
75
75
  return torch.device("cpu")
76
76
 
77
+ # Add this check for backwards compatibility
78
+ if hasattr(torch.serialization, 'add_safe_globals'):
79
+ torch.serialization.add_safe_globals([
80
+ 'torch.LongTensor',
81
+ 'torch.cuda.LongTensor',
82
+ 'torch.FloatStorage',
83
+ 'torch.FloatStorage',
84
+ 'torch.cuda.FloatStorage',
85
+ ])
86
+
77
87
  class HierarchicalInsectClassifier(nn.Module):
78
88
  def __init__(self, num_classes_per_level):
79
89
  """
@@ -14,18 +14,28 @@ import logging
14
14
  from tqdm import tqdm
15
15
  import sys
16
16
 
17
- def train(batch_size=4, epochs=30, patience=3, img_size=640, data_dir='input', output_dir='./output', species_list=None):
17
+ def train(batch_size=4, epochs=30, patience=3, img_size=640, data_dir='input', output_dir='./output', species_list=None, num_workers=4):
18
18
  """
19
19
  Main function to run the entire training pipeline.
20
20
  Sets up datasets, model, training process and handles errors.
21
+
22
+ Args:
23
+ batch_size (int): Number of samples per batch. Default: 4
24
+ epochs (int): Maximum number of training epochs. Default: 30
25
+ patience (int): Early stopping patience (epochs without improvement). Default: 3
26
+ img_size (int): Target image size for training. Default: 640
27
+ data_dir (str): Directory containing train/valid subdirectories. Default: 'input'
28
+ output_dir (str): Directory to save trained model and logs. Default: './output'
29
+ species_list (list): List of species names for training. Required.
30
+ num_workers (int): Number of DataLoader worker processes.
31
+ Set to 0 to disable multiprocessing (most stable). Default: 4
21
32
  """
22
33
  global logger, device
23
34
 
24
35
  logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
25
36
  logger = logging.getLogger(__name__)
26
37
 
27
- logger.info(f"Hyperparameters - Batch size: {batch_size}, Epochs: {epochs}, Patience: {patience}, Image size: {img_size}, Data directory: {data_dir}, Output directory: {output_dir}")
28
-
38
+ logger.info(f"Hyperparameters - Batch size: {batch_size}, Epochs: {epochs}, Patience: {patience}, Image size: {img_size}, Data directory: {data_dir}, Output directory: {output_dir}, Num workers: {num_workers}")
29
39
 
30
40
  device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
31
41
 
@@ -52,7 +62,7 @@ def train(batch_size=4, epochs=30, patience=3, img_size=640, data_dir='input', o
52
62
 
53
63
  taxonomy = get_taxonomy(species_list)
54
64
 
55
- level_to_idx, parent_child_relationship = create_mappings(taxonomy)
65
+ level_to_idx, parent_child_relationship = create_mappings(taxonomy, species_list)
56
66
 
57
67
  num_classes_per_level = [len(taxonomy[level]) if isinstance(taxonomy[level], list)
58
68
  else len(taxonomy[level].keys()) for level in sorted(taxonomy.keys())]
@@ -75,14 +85,14 @@ def train(batch_size=4, epochs=30, patience=3, img_size=640, data_dir='input', o
75
85
  train_dataset,
76
86
  batch_size=batch_size,
77
87
  shuffle=True,
78
- num_workers=4
88
+ num_workers=num_workers
79
89
  )
80
90
 
81
91
  val_loader = DataLoader(
82
92
  val_dataset,
83
93
  batch_size=batch_size,
84
94
  shuffle=False,
85
- num_workers=4
95
+ num_workers=num_workers
86
96
  )
87
97
 
88
98
  try:
@@ -150,14 +160,17 @@ def get_taxonomy(species_list):
150
160
  species_to_genus = {}
151
161
  genus_to_family = {}
152
162
 
153
- logger.info(f"Building taxonomy from GBIF for {len(species_list)} species")
163
+ species_list_for_gbif = [s for s in species_list if s.lower() != 'unknown']
164
+ has_unknown = len(species_list_for_gbif) != len(species_list)
165
+
166
+ logger.info(f"Building taxonomy from GBIF for {len(species_list_for_gbif)} species")
154
167
 
155
168
  print("\nTaxonomy Results:")
156
169
  print("-" * 80)
157
170
  print(f"{'Species':<30} {'Family':<20} {'Genus':<20} {'Status'}")
158
171
  print("-" * 80)
159
172
 
160
- for species_name in species_list:
173
+ for species_name in species_list_for_gbif:
161
174
  url = f"https://api.gbif.org/v1/species/match?name={species_name}&verbose=true"
162
175
  try:
163
176
  response = requests.get(url)
@@ -199,6 +212,19 @@ def get_taxonomy(species_list):
199
212
  print(f"{species_name:<30} {'Error':<20} {'Error':<20} FAILED")
200
213
  print(f"Error: {error_msg}")
201
214
  sys.exit(1) # Stop the script
215
+
216
+ if has_unknown:
217
+ unknown_family = "Unknown"
218
+ unknown_genus = "Unknown"
219
+ unknown_species = "unknown"
220
+
221
+ if unknown_family not in taxonomy[1]:
222
+ taxonomy[1].append(unknown_family)
223
+
224
+ taxonomy[2][unknown_genus] = unknown_family
225
+ taxonomy[3][unknown_species] = unknown_genus
226
+
227
+ print(f"{unknown_species:<30} {unknown_family:<20} {unknown_genus:<20} {'OK'}")
202
228
 
203
229
  taxonomy[1] = sorted(list(set(taxonomy[1])))
204
230
  print("-" * 80)
@@ -212,7 +238,7 @@ def get_taxonomy(species_list):
212
238
  print(f" {i}: {family}")
213
239
 
214
240
  print("\nGenus indices:")
215
- for i, genus in enumerate(taxonomy[2].keys()):
241
+ for i, genus in enumerate(sorted(taxonomy[2].keys())):
216
242
  print(f" {i}: {genus}")
217
243
 
218
244
  print("\nSpecies indices:")
@@ -244,7 +270,7 @@ def get_species_from_directory(train_dir):
244
270
  logger.info(f"Found {len(species_list)} species in {train_dir}")
245
271
  return species_list
246
272
 
247
- def create_mappings(taxonomy):
273
+ def create_mappings(taxonomy, species_list=None):
248
274
  """
249
275
  Creates mapping dictionaries from taxonomy data.
250
276
  Returns level-to-index mapping and parent-child relationships between taxonomic levels.
@@ -254,9 +280,17 @@ def create_mappings(taxonomy):
254
280
 
255
281
  for level, labels in taxonomy.items():
256
282
  if isinstance(labels, list):
283
+ # Level 1: Family (already sorted)
257
284
  level_to_idx[level] = {label: idx for idx, label in enumerate(labels)}
258
- else:
259
- level_to_idx[level] = {label: idx for idx, label in enumerate(labels.keys())}
285
+ else: # dict for levels 2 and 3
286
+ if level == 3 and species_list is not None:
287
+ # For species, the order is determined by species_list
288
+ level_to_idx[level] = {label: idx for idx, label in enumerate(species_list)}
289
+ else:
290
+ # For genus (and as a fallback for species), sort alphabetically
291
+ sorted_keys = sorted(labels.keys())
292
+ level_to_idx[level] = {label: idx for idx, label in enumerate(sorted_keys)}
293
+
260
294
  for child, parent in labels.items():
261
295
  if (level, parent) not in parent_child_relationship:
262
296
  parent_child_relationship[(level, parent)] = []
@@ -670,7 +704,7 @@ if __name__ == '__main__':
670
704
  species_list = [
671
705
  "Coccinella septempunctata", "Apis mellifera", "Bombus lapidarius", "Bombus terrestris",
672
706
  "Eupeodes corollae", "Episyrphus balteatus", "Aglais urticae", "Vespula vulgaris",
673
- "Eristalis tenax"
707
+ "Eristalis tenax", "unknown"
674
708
  ]
675
- train_multitask(species_list=species_list, epochs=2)
709
+ train(species_list=species_list, epochs=2)
676
710
 
bplusplus-1.2.3/PKG-INFO DELETED
@@ -1,101 +0,0 @@
1
- Metadata-Version: 2.3
2
- Name: bplusplus
3
- Version: 1.2.3
4
- Summary: A simple method to create AI models for biodiversity, with collect and prepare pipeline
5
- License: MIT
6
- Author: Titus Venverloo
7
- Author-email: tvenver@mit.edu
8
- Requires-Python: >=3.9.0,<4.0.0
9
- Classifier: License :: OSI Approved :: MIT License
10
- Classifier: Programming Language :: Python :: 3
11
- Classifier: Programming Language :: Python :: 3.9
12
- Classifier: Programming Language :: Python :: 3.10
13
- Classifier: Programming Language :: Python :: 3.11
14
- Classifier: Programming Language :: Python :: 3.12
15
- Classifier: Programming Language :: Python :: 3.13
16
- Requires-Dist: numpy
17
- Requires-Dist: pandas (==2.1.4)
18
- Requires-Dist: pillow
19
- Requires-Dist: prettytable (==3.7.0)
20
- Requires-Dist: pygbif (>=0.6.4,<0.7.0)
21
- Requires-Dist: pyyaml (==6.0.1)
22
- Requires-Dist: requests (==2.25.1)
23
- Requires-Dist: scikit-learn
24
- Requires-Dist: tabulate (>=0.9.0,<0.10.0)
25
- Requires-Dist: torch (>=2.5.0,<3.0.0)
26
- Requires-Dist: torchvision
27
- Requires-Dist: tqdm (==4.66.4)
28
- Requires-Dist: ultralytics (>=8.3.0)
29
- Requires-Dist: validators (>=0.33.0,<0.34.0)
30
- Description-Content-Type: text/markdown
31
-
32
- # Domain-Agnostic Insect Classification Pipeline
33
-
34
- This project provides a complete, end-to-end pipeline for building a custom insect classification system. The framework is designed to be **domain-agnostic**, allowing you to train a powerful detection and classification model for **any insect species** by simply providing a list of names.
35
-
36
- Using the `Bplusplus` library, this pipeline automates the entire machine learning workflow, from data collection to video inference.
37
-
38
- ## Key Features
39
-
40
- - **Automated Data Collection**: Downloads hundreds of images for any species from the GBIF database.
41
- - **Intelligent Data Preparation**: Uses a pre-trained model to automatically find, crop, and resize insects from raw images, ensuring high-quality training data.
42
- - **Hierarchical Classification**: Trains a model to identify insects at three taxonomic levels: **family, genus, and species**.
43
- - **Video Inference & Tracking**: Processes video files to detect, classify, and track individual insects over time, providing aggregated predictions.
44
- ## Pipeline Overview
45
-
46
- The process is broken down into six main steps, all detailed in the `full_pipeline.ipynb` notebook:
47
-
48
- 1. **Collect Data**: Select your target species and fetch raw insect images from the web.
49
- 2. **Prepare Data**: Filter, clean, and prepare images for training.
50
- 3. **Train Model**: Train the hierarchical classification model.
51
- 4. **Download Weights**: Fetch pre-trained weights for the detection model.
52
- 5. **Test Model**: Evaluate the performance of the trained model.
53
- 6. **Run Inference**: Run the full pipeline on a video file for real-world application.
54
-
55
- ## How to Use
56
-
57
- ### Prerequisites
58
-
59
- - Python 3.8+
60
- - `venv` for creating a virtual environment (recommended)
61
-
62
- ### Setup
63
-
64
- 1. **Create and activate a virtual environment:**
65
- ```bash
66
- python3 -m venv venv
67
- source venv/bin/activate
68
- ```
69
-
70
- 2. **Install the required packages:**
71
- ```bash
72
- pip install bplusplus
73
- ```
74
-
75
- ### Running the Pipeline
76
-
77
- The entire workflow is contained within **`full_pipeline.ipynb`**. Open it with a Jupyter Notebook or JupyterLab environment and run the cells sequentially to execute the full pipeline.
78
-
79
- ### Customization
80
-
81
- To train the model on different insect species, simply modify the `names` list in **Step 1** of the notebook:
82
-
83
- ```python
84
- # a/full_pipeline.ipynb
85
-
86
- # To use your own species, change the names in this list
87
- names = [
88
- "Vespa crabro", "Vespula vulgaris", "Dolichovespula media"
89
- ]
90
- ```
91
-
92
- The pipeline will automatically handle the rest, from data collection to training, for your new set of species.
93
-
94
- ## Directory Structure
95
-
96
- The pipeline will create the following directories to store artifacts:
97
-
98
- - `GBIF_data/`: Stores the raw images downloaded from GBIF.
99
- - `prepared_data/`: Contains the cleaned, cropped, and resized images ready for training.
100
- - `trained_model/`: Saves the trained model weights (`best_multitask.pt`) and pre-trained detection weights.
101
-
bplusplus-1.2.3/README.md DELETED
@@ -1,69 +0,0 @@
1
- # Domain-Agnostic Insect Classification Pipeline
2
-
3
- This project provides a complete, end-to-end pipeline for building a custom insect classification system. The framework is designed to be **domain-agnostic**, allowing you to train a powerful detection and classification model for **any insect species** by simply providing a list of names.
4
-
5
- Using the `Bplusplus` library, this pipeline automates the entire machine learning workflow, from data collection to video inference.
6
-
7
- ## Key Features
8
-
9
- - **Automated Data Collection**: Downloads hundreds of images for any species from the GBIF database.
10
- - **Intelligent Data Preparation**: Uses a pre-trained model to automatically find, crop, and resize insects from raw images, ensuring high-quality training data.
11
- - **Hierarchical Classification**: Trains a model to identify insects at three taxonomic levels: **family, genus, and species**.
12
- - **Video Inference & Tracking**: Processes video files to detect, classify, and track individual insects over time, providing aggregated predictions.
13
- ## Pipeline Overview
14
-
15
- The process is broken down into six main steps, all detailed in the `full_pipeline.ipynb` notebook:
16
-
17
- 1. **Collect Data**: Select your target species and fetch raw insect images from the web.
18
- 2. **Prepare Data**: Filter, clean, and prepare images for training.
19
- 3. **Train Model**: Train the hierarchical classification model.
20
- 4. **Download Weights**: Fetch pre-trained weights for the detection model.
21
- 5. **Test Model**: Evaluate the performance of the trained model.
22
- 6. **Run Inference**: Run the full pipeline on a video file for real-world application.
23
-
24
- ## How to Use
25
-
26
- ### Prerequisites
27
-
28
- - Python 3.8+
29
- - `venv` for creating a virtual environment (recommended)
30
-
31
- ### Setup
32
-
33
- 1. **Create and activate a virtual environment:**
34
- ```bash
35
- python3 -m venv venv
36
- source venv/bin/activate
37
- ```
38
-
39
- 2. **Install the required packages:**
40
- ```bash
41
- pip install bplusplus
42
- ```
43
-
44
- ### Running the Pipeline
45
-
46
- The entire workflow is contained within **`full_pipeline.ipynb`**. Open it with a Jupyter Notebook or JupyterLab environment and run the cells sequentially to execute the full pipeline.
47
-
48
- ### Customization
49
-
50
- To train the model on different insect species, simply modify the `names` list in **Step 1** of the notebook:
51
-
52
- ```python
53
- # a/full_pipeline.ipynb
54
-
55
- # To use your own species, change the names in this list
56
- names = [
57
- "Vespa crabro", "Vespula vulgaris", "Dolichovespula media"
58
- ]
59
- ```
60
-
61
- The pipeline will automatically handle the rest, from data collection to training, for your new set of species.
62
-
63
- ## Directory Structure
64
-
65
- The pipeline will create the following directories to store artifacts:
66
-
67
- - `GBIF_data/`: Stores the raw images downloaded from GBIF.
68
- - `prepared_data/`: Contains the cleaned, cropped, and resized images ready for training.
69
- - `trained_model/`: Saves the trained model weights (`best_multitask.pt`) and pre-trained detection weights.
@@ -1,5 +0,0 @@
1
- from .collect import Group, collect
2
- from .prepare import prepare
3
- from .train import train
4
- from .test import test
5
- from .inference import inference
File without changes