deepal6 1.0.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
deepal6-1.0.0/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2025 Bob Philip Aila — AIMS Rwanda
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
deepal6-1.0.0/PKG-INFO ADDED
@@ -0,0 +1,530 @@
1
+ Metadata-Version: 2.4
2
+ Name: deepal6
3
+ Version: 1.0.0
4
+ Summary: DeepAL6 — Deep Active Learning — 6 query strategies vs random baseline (AIMS Rwanda Thesis)
5
+ Home-page: https://github.com/bobphilip/deepal6
6
+ Author: Bob Philip Aila
7
+ Author-email:
8
+ License: MIT
9
+ Keywords: active learning,deep learning,machine learning,uncertainty sampling
10
+ Classifier: Programming Language :: Python :: 3
11
+ Classifier: License :: OSI Approved :: MIT License
12
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
13
+ Requires-Python: >=3.8
14
+ Description-Content-Type: text/markdown
15
+ License-File: LICENSE
16
+ Requires-Dist: numpy>=1.21
17
+ Requires-Dist: torch>=1.12
18
+ Requires-Dist: scikit-learn>=1.0
19
+ Requires-Dist: matplotlib>=3.4
20
+ Provides-Extra: image
21
+ Requires-Dist: torchvision>=0.13; extra == "image"
22
+ Requires-Dist: Pillow>=9.0; extra == "image"
23
+ Provides-Extra: dev
24
+ Requires-Dist: pytest>=7.0; extra == "dev"
25
+ Requires-Dist: pytest-cov; extra == "dev"
26
+ Requires-Dist: twine; extra == "dev"
27
+ Requires-Dist: build; extra == "dev"
28
+ Dynamic: home-page
29
+ Dynamic: license-file
30
+ Dynamic: requires-python
31
+
32
+ # DeepAL6 — Deep Active Learning Library
33
+
34
+ <p align="center">
35
+ <img src="https://img.shields.io/badge/python-3.8%2B-blue" alt="Python">
36
+ <img src="https://img.shields.io/badge/PyTorch-1.12%2B-orange" alt="PyTorch">
37
+ <img src="https://img.shields.io/badge/license-MIT-green" alt="License">
38
+ <img src="https://img.shields.io/badge/version-1.0.0-purple" alt="Version">
39
+ <img src="https://img.shields.io/badge/strategies-6-red" alt="Strategies">
40
+ </p>
41
+
42
+ <p align="center">
43
+ <b>Bob Philip Aila · AIMS Rwanda</b><br>
44
+ Pool-based deep active learning: 6 query strategies benchmarked against a random baseline,<br>
45
+ for tabular and image domains.
46
+ </p>
47
+
48
+ ---
49
+
50
+ ## Table of Contents
51
+
52
+ - [Overview](#overview)
53
+ - [Installation](#installation)
54
+ - [Quick Start](#quick-start)
55
+ - [Tabular Data](#tabular-german-credit--any-binary-classification)
56
+ - [Image Data](#image-nih-chest-x-ray--any-binary-image-dataset)
57
+ - [Strategies](#strategies)
58
+ - [ALConfig Parameters](#alconfig-parameters)
59
+ - [Available Plots](#available-plots)
60
+ - [Custom Strategies](#custom-strategies)
61
+ - [Batch Size Ablation](#batch-size-ablation)
62
+ - [Loading Any Data Type](#loading-any-data-type)
63
+ - [Design Principles](#design-principles)
64
+ - [Project Structure](#project-structure)
65
+ - [Common Errors](#common-errors)
66
+ - [Citation](#citation)
67
+ - [License](#license)
68
+
69
+ ---
70
+
71
+ ## Overview
72
+
73
+ `deepal6` is a flexible, research-grade active learning framework built for reproducible experimentation. It lets you:
74
+
75
+ - Run **6 query strategies** — Random, Entropy, Margin, BALD, CoreSet, BADGE — under identical budget protocols
76
+ - Work with **tabular data** (CreditNet) or **image data** (ResNet-18 + Dropout head)
77
+ - Control every parameter: initial size, batch size, rounds, seeds, augmentations, and more
78
+ - Get **publication-quality plots** — learning curves, strategy-vs-random gap, calibration (ECE)
79
+ - Extend with **custom strategies** via a one-line registration API
80
+ - Compare strategies fairly with **stratified initial draws** and **class-weighted loss**
81
+
82
+ ---
83
+
84
+ ## Installation
85
+
86
+ ### Standard install (recommended)
87
+
88
+ ```bash
89
+ pip install deepal6
90
+ ```
91
+
92
+ ### With image support (ResNet-18 / NIH Chest X-ray)
93
+
94
+ ```bash
95
+ pip install "deepal6[image]"
96
+ ```
97
+
98
+ ### Development install (editable, from source)
99
+
100
+ ```bash
101
+ git clone https://github.com/bob-aila/deepal6.git
102
+ cd deepal6
103
+ pip install -e .
104
+
105
+ # With image support
106
+ pip install -e ".[image]"
107
+ ```
108
+
109
+ ### Verify the install
110
+
111
+ ```python
112
+ from deepal6 import list_strategies
113
+ print(list_strategies())
114
+ # ['Random', 'Entropy', 'Margin', 'BALD', 'CoreSet', 'BADGE']
115
+ ```
116
+
117
+ ### Requirements
118
+
119
+ | Package | Minimum version |
120
+ |---|---|
121
+ | Python | 3.8+ |
122
+ | PyTorch | 1.12+ |
123
+ | scikit-learn | 1.0+ |
124
+ | numpy | 1.21+ |
125
+ | matplotlib | 3.4+ |
126
+ | torchvision *(image only)* | 0.13+ |
127
+ | Pillow *(image only)* | 9.0+ |
128
+
129
+ ---
130
+
131
+ ## Quick Start
132
+
133
+ ### Tabular (German Credit / any binary classification)
134
+
135
+ ```python
136
+ import numpy as np
137
+ from sklearn.preprocessing import StandardScaler
138
+ from sklearn.model_selection import train_test_split
139
+ from deepal6 import ActiveLearner, TabularDataModule, ALConfig
140
+
141
+ # 1. Prepare data — scale your features, binary labels 0/1
142
+ X_train, X_test, y_train, y_test = train_test_split(
143
+ X, y, test_size=0.2, stratify=y, random_state=42)
144
+ scaler = StandardScaler()
145
+ X_train = scaler.fit_transform(X_train)
146
+ X_test = scaler.transform(X_test)
147
+
148
+ # 2. Wrap in DataModule
149
+ data = TabularDataModule(X_train, y_train, X_test, y_test, pos_label=0)
150
+
151
+ # 3. Configure experiment
152
+ config = ALConfig(
153
+ strategy = ['Random', 'Entropy', 'BALD', 'CoreSet', 'BADGE'],
154
+ initial_size = 50, # stratified initial labeled set
155
+ batch_size = 20, # samples queried per round
156
+ n_rounds = 20, # AL rounds
157
+ n_seeds = 5, # independent runs for mean ± std
158
+ train_epochs = 50,
159
+ )
160
+
161
+ # 4. Run
162
+ learner = ActiveLearner(data, config)
163
+ results = learner.run()
164
+
165
+ # 5. Visualise
166
+ learner.plot(results, metric='auc')
167
+ learner.summary_table(results)
168
+ ```
169
+
170
+ ---
171
+
172
+ ### Image (NIH Chest X-ray / any binary image dataset)
173
+
174
+ ```python
175
+ import pandas as pd
176
+ from deepal6 import ActiveLearner, ImageDataModule, ALConfig
177
+
178
+ # DataFrame with 'filepath' and 'label' columns
179
+ train_df = pd.DataFrame({'filepath': [...], 'label': [0, 1, ...]})
180
+ test_df = pd.DataFrame({'filepath': [...], 'label': [0, 1, ...]})
181
+
182
+ # Optional: custom augmentations
183
+ from torchvision import transforms
184
+ my_aug = transforms.Compose([
185
+ transforms.Resize((256, 256)),
186
+ transforms.RandomHorizontalFlip(),
187
+ transforms.RandomRotation(10),
188
+ transforms.ColorJitter(brightness=0.2, contrast=0.2),
189
+ transforms.ToTensor(),
190
+ transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
191
+ ])
192
+
193
+ data = ImageDataModule(
194
+ train_df, test_df,
195
+ train_transform = my_aug,
196
+ img_size = 256,
197
+ pos_label = 1, # minority / positive class
198
+ )
199
+
200
+ config = ALConfig(
201
+ strategy = ['Random', 'BALD', 'CoreSet', 'BADGE'],
202
+ initial_size = 50,
203
+ batch_size = 20,
204
+ n_rounds = 20,
205
+ n_seeds = 5,
206
+ train_epochs = 10, # fewer epochs — fine-tuning pretrained weights
207
+ lr = 1e-4, # lower LR for ResNet-18
208
+ dropout_rate = 0.4,
209
+ )
210
+
211
+ learner = ActiveLearner(data, config)
212
+ results = learner.run()
213
+ learner.plot(results, metric='auc', show_std=True)
214
+ ```
215
+
216
+ ---
217
+
218
+ ## Strategies
219
+
220
+ | Strategy | Type | Selection Criterion | When it shines |
221
+ |---|---|---|---|
222
+ | **Random** | Baseline | Uniform draw — no model used | Always run as lower bound |
223
+ | **Entropy** | Uncertainty | Highest Shannon entropy H[y\|x] | Well-calibrated models |
224
+ | **Margin** | Uncertainty | Smallest gap \|p − 0.5\| | Fast, low compute |
225
+ | **BALD** | Bayesian | Mutual information I[y;θ\|x,D] via MC Dropout | Best epistemic uncertainty estimate |
226
+ | **CoreSet** | Diversity | Greedy k-center in embedding space | Rich embeddings (image > tabular) |
227
+ | **BADGE** | Hybrid | Gradient embeddings + k-means++ | Combines uncertainty + diversity |
228
+
229
+ > **Note:** BALD requires `dropout_rate > 0` (default 0.3). All strategies use the same
230
+ > budget, training procedure, and evaluation protocol so results are directly comparable.
231
+
232
+ ---
233
+
234
+ ## ALConfig Parameters
235
+
236
+ ```python
237
+ from deepal6 import ALConfig
238
+
239
+ config = ALConfig(
240
+ strategy = ['Random', 'BALD'], # or a single string: 'BALD'
241
+ initial_size = 50,
242
+ batch_size = 20,
243
+ n_rounds = 20,
244
+ n_seeds = 5,
245
+ train_epochs = 50,
246
+ lr = 1e-3,
247
+ weight_decay = 1e-4,
248
+ dropout_rate = 0.3,
249
+ mc_passes = 20,
250
+ train_batch_size = 32,
251
+ device = None, # auto-detect GPU/CPU
252
+ seed = 42,
253
+ verbose = True,
254
+ save_checkpoints = False,
255
+ checkpoint_dir = './checkpoints',
256
+ )
257
+ print(config.summary()) # prints a formatted parameter table
258
+ print(config.total_budget) # initial_size + n_rounds * batch_size
259
+ ```
260
+
261
+ | Parameter | Default | Description |
262
+ |---|---|---|
263
+ | `strategy` | all 6 | Strategy name(s) to run |
264
+ | `initial_size` | 50 | Stratified initial labeled set size |
265
+ | `batch_size` | 20 | Samples queried per AL round |
266
+ | `n_rounds` | 20 | Maximum AL rounds |
267
+ | `n_seeds` | 5 | Independent runs per strategy (for mean ± std) |
268
+ | `train_epochs` | 50 | Training epochs per round |
269
+ | `lr` | 1e-3 | Adam learning rate |
270
+ | `weight_decay` | 1e-4 | L2 regularisation |
271
+ | `dropout_rate` | 0.3 | Dropout probability (also controls BALD MC stochasticity) |
272
+ | `mc_passes` | 20 | MC Dropout forward passes for BALD |
273
+ | `train_batch_size` | 32 | Mini-batch size during model training |
274
+ | `device` | auto | `'cpu'`, `'cuda'`, or `None` (auto-detect) |
275
+ | `seed` | 42 | Base seed; each of `n_seeds` runs uses `seed + i` |
276
+ | `verbose` | True | Print per-round metrics during experiment |
277
+ | `save_checkpoints` | False | Save best model checkpoint per strategy per seed |
278
+ | `checkpoint_dir` | `'./checkpoints'` | Directory for saved checkpoints |
279
+ | `extra_strategy_kwargs` | `{}` | Extra kwargs forwarded to strategy functions |
280
+
281
+ ---
282
+
283
+ ## Available Plots
284
+
285
+ ```python
286
+ # Learning curves — metric vs labelling budget
287
+ learner.plot(results, metric='auc') # AUC-ROC
288
+ learner.plot(results, metric='bal_acc') # Balanced accuracy
289
+ learner.plot(results, metric='recall') # Recall (minority class)
290
+ learner.plot(results, metric='ece') # Calibration error
291
+
292
+ # Strategy vs random gap (automatically shown when 'Random' is in results)
293
+ # — positive gap means the strategy beats random at that budget point
294
+
295
+ # ECE calibration detail
296
+ learner.plot_calibration(results)
297
+
298
+ # Printed summary table
299
+ learner.summary_table(results, metric='auc')
300
+
301
+ # Save any plot to file
302
+ learner.plot(results, metric='auc', save_path='results/auc_curves.png')
303
+ ```
304
+
305
+ All plots support `show_std=True` (default) to shade ± 1 std band across seeds.
306
+
307
+ ---
308
+
309
+ ## Custom Strategies
310
+
311
+ Register your own query strategy with one line and use it alongside the built-ins:
312
+
313
+ ```python
314
+ import numpy as np
315
+ from deepal6.strategies import register_strategy
316
+ from deepal6 import ActiveLearner, ALConfig
317
+
318
+ def my_strategy(model, data, pool_idx, n_query, **kwargs):
319
+ """
320
+ Custom query strategy.
321
+
322
+ Parameters
323
+ ----------
324
+ model : trained PyTorch model
325
+ data : TabularDataModule or ImageDataModule
326
+ pool_idx : list of global indices into the unlabeled pool
327
+ n_query : number of samples to select
328
+ **kwargs : extra kwargs from ALConfig.extra_strategy_kwargs
329
+
330
+ Returns
331
+ -------
332
+ np.ndarray of LOCAL indices into pool_idx (not global indices)
333
+ """
334
+ probs = data.predict_proba(model, pool_idx)
335
+ # your selection logic here ...
336
+ scores = np.abs(probs - 0.5)
337
+ return np.argsort(scores)[:n_query]
338
+
339
+ register_strategy('MyStrategy', my_strategy)
340
+
341
+ # Use exactly like any built-in strategy
342
+ config = ALConfig(strategy=['Random', 'BALD', 'MyStrategy'])
343
+ results = ActiveLearner(data, config).run()
344
+ ```
345
+
346
+ ---
347
+
348
+ ## Batch Size Ablation
349
+
350
+ Study how query batch size affects learning efficiency for your best strategy:
351
+
352
+ ```python
353
+ from deepal6 import ActiveLearner, ALConfig
354
+ from deepal6.plotting import plot_batch_size_ablation
355
+
356
+ ablation = {}
357
+ for b in [10, 20, 50]:
358
+ cfg = ALConfig(strategy='BALD', batch_size=b, n_seeds=3)
359
+ r = ActiveLearner(data, cfg).run()
360
+ ablation[b] = r['BALD']
361
+
362
+ plot_batch_size_ablation(ablation, metric='auc')
363
+ ```
364
+
365
+ ---
366
+
367
+ ## Loading Any Data Type
368
+
369
+ ### Tabular
370
+
371
+ `TabularDataModule` requires **numpy arrays**. Here is how to convert from every common source:
372
+
373
+ ```python
374
+ # From a CSV file
375
+ import pandas as pd
376
+ from sklearn.preprocessing import LabelEncoder, StandardScaler
377
+ from sklearn.model_selection import train_test_split
378
+ from deepal6 import TabularDataModule
379
+
380
+ df = pd.read_csv('your_data.csv')
381
+ df['label'] = df['label'].map({'Good': 1, 'Bad': 0}) # encode target to 0/1
382
+ for col in df.select_dtypes(include='object').columns:
383
+ if col != 'label':
384
+ df[col] = LabelEncoder().fit_transform(df[col])
385
+
386
+ X = df.drop('label', axis=1).values # .values → numpy
387
+ y = df['label'].values
388
+
389
+ X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)
390
+ scaler = StandardScaler()
391
+ X_train = scaler.fit_transform(X_train) # fit on train only — no leakage
392
+ X_test = scaler.transform(X_test)
393
+
394
+ data = TabularDataModule(X_train, y_train, X_test, y_test)
395
+ ```
396
+
397
+ | Source | One-liner |
398
+ |---|---|
399
+ | pandas DataFrame | `X_train_df.values` |
400
+ | PyTorch tensor | `X_train_tensor.numpy()` |
401
+ | sklearn dataset | `bunch.data`, `bunch.target` directly |
402
+ | `.npy` / `.npz` file | `np.load('file.npz')['X_train']` |
403
+
404
+ ### Image
405
+
406
+ `ImageDataModule` accepts two formats:
407
+
408
+ **A) DataFrame with `filepath` + `label` columns** (images on disk):
409
+
410
+ ```python
411
+ import pandas as pd
412
+ from deepal6 import ImageDataModule
413
+
414
+ train_df = pd.DataFrame({
415
+ 'filepath': ['/data/train/img001.png', ...],
416
+ 'label': [0, 1, ...]
417
+ })
418
+ data = ImageDataModule(train_df, test_df)
419
+ ```
420
+
421
+ **B) Any PyTorch `Dataset` with a `.get_labels()` method** (in-memory / MedMNIST):
422
+
423
+ ```python
424
+ from torch.utils.data import Dataset
425
+ from PIL import Image
426
+ from torchvision import transforms
427
+ import numpy as np
428
+
429
+ class NumpyImageDataset(Dataset):
430
+ def __init__(self, images, labels):
431
+ self.images = images.astype(np.uint8)
432
+ self._labels = labels.astype(int)
433
+ self.tf = transforms.Compose([
434
+ transforms.Resize((64, 64)),
435
+ transforms.ToTensor(),
436
+ transforms.Normalize([0.5]*3, [0.5]*3),
437
+ ])
438
+ def __len__(self): return len(self.images)
439
+ def __getitem__(self, i):
440
+ return self.tf(Image.fromarray(self.images[i])), int(self._labels[i])
441
+ def get_labels(self): # ← required by deepal6
442
+ return self._labels
443
+
444
+ data = ImageDataModule(
445
+ NumpyImageDataset(X_train_imgs, y_train),
446
+ NumpyImageDataset(X_test_imgs, y_test),
447
+ )
448
+ ```
449
+
450
+ ---
451
+
452
+ ## Design Principles
453
+
454
+ | Decision | Reason |
455
+ |---|---|
456
+ | **Stratified initial draw** | Prevents random baseline winning by lucky class balance in L₀ |
457
+ | **Class-weighted BCE loss** | Prevents majority-class collapse on small labeled sets |
458
+ | **Fresh model each round** | No representation drift — tabular: random init, image: ImageNet weights |
459
+ | **Unified strategy interface** | `fn(model, data, pool_idx, n_query, **kwargs) → np.ndarray` |
460
+ | **DataModule pattern** | Decouples data I/O from strategy logic — easy to extend |
461
+ | **Fail-fast validation** | All config/data errors surface before the experiment starts |
462
+
463
+ ---
464
+
465
+ ## Project Structure
466
+
467
+ ```
468
+ deepal6/
469
+ ├── __init__.py ← Public API — import everything from here
470
+ ├── config.py ← ALConfig — all experiment parameters + validation
471
+ ├── learner.py ← ActiveLearner — main experiment loop
472
+ ├── metrics.py ← ECE, AULC, aggregate_seeds, summary table
473
+ ├── plotting.py ← Learning curves, gap plot, calibration plots
474
+ ├── exceptions.py ← DeepAL6Error, ConfigurationError, DataError, ...
475
+ ├── data/
476
+ │ ├── base.py ← BaseDataModule (abstract interface)
477
+ │ ├── tabular.py ← TabularDataModule + CreditNet architecture
478
+ │ └── image.py ← ImageDataModule + ResNet-18 with Dropout head
479
+ ├── strategies/
480
+ │ ├── __init__.py ← STRATEGIES registry + register_strategy()
481
+ │ ├── registry.py ← Re-export for config
482
+ │ └── query.py ← All 6 strategy implementations
483
+ └── models/
484
+ └── __init__.py ← CreditNet, build_resnet18 re-exports
485
+ ```
486
+
487
+ ---
488
+
489
+ ## Common Errors
490
+
491
+ | Error | Cause | Fix |
492
+ |---|---|---|
493
+ | `ConfigurationError: Unknown strategy` | Typo in strategy name | Names are case-sensitive: `'BALD'` not `'bald'` |
494
+ | `DataError: NaN values` | Missing values in features | Impute or drop NaNs before passing to `TabularDataModule` |
495
+ | `DataError: length mismatch` | X and y have different lengths | Check your train/test split code |
496
+ | `DataError: feature counts` | X_train and X_test have different columns | Apply the same scaler/encoder to both |
497
+ | `ModelError: build_model failed` | CUDA out of memory | Set `device='cpu'` or reduce `train_batch_size` |
498
+ | `StrategyError: no Dropout layers` | BALD with `dropout_rate=0` | Set `dropout_rate > 0` (default: 0.3) |
499
+ | `ImportError: torchvision` | Image support not installed | `pip install "deepal6[image]"` |
500
+ | `DataError: missing get_labels()` | Custom Dataset missing method | Add `def get_labels(self): return self._labels` |
501
+
502
+ ---
503
+
504
+ ## Citation
505
+
506
+ If you use DeepAL6 in your research, please cite:
507
+
508
+ ```bibtex
509
+ @misc{aila2025deepal6,
510
+ author = {Aila, Bob Philip},
511
+ title = {DeepAL6: A Deep Active Learning Library for Tabular and Image Domains},
512
+ year = {2025},
513
+ publisher = {GitHub},
514
+ howpublished = {\url{https://github.com/YOUR_USERNAME/deepal6}},
515
+ note = {AIMS Rwanda Master's Thesis}
516
+ }
517
+ ```
518
+
519
+ > The full citation will be updated once the thesis is published. Check back here or contact the author.
520
+
521
+ ---
522
+
523
+ ## License
524
+
525
+ This project is licensed under the **MIT License** — see the [LICENSE](LICENSE) file for details.
526
+
527
+ ---
528
+ <p align="center">
529
+ Lest save the budgets by Labeling data informatively.
530
+ </p>