audiotimm 1.0.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,5 @@
1
+ include README.md
2
+ include PLAN.md
3
+ include pyproject.toml
4
+ recursive-include audiotimm/data *
5
+ recursive-include tests *.py
@@ -0,0 +1,426 @@
1
+ Metadata-Version: 2.4
2
+ Name: audiotimm
3
+ Version: 1.0.0
4
+ Summary: The model hub for audio intelligence โ€” timm for audio classification
5
+ License: Apache-2.0
6
+ Keywords: audio,classification,deep-learning,sound,machine-learning,audiotimm
7
+ Classifier: Development Status :: 5 - Production/Stable
8
+ Classifier: Intended Audience :: Developers
9
+ Classifier: Intended Audience :: Science/Research
10
+ Classifier: License :: OSI Approved :: Apache Software License
11
+ Classifier: Programming Language :: Python :: 3
12
+ Classifier: Programming Language :: Python :: 3.9
13
+ Classifier: Programming Language :: Python :: 3.10
14
+ Classifier: Programming Language :: Python :: 3.11
15
+ Classifier: Programming Language :: Python :: 3.12
16
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
17
+ Classifier: Topic :: Multimedia :: Sound/Audio :: Analysis
18
+ Requires-Python: >=3.9
19
+ Description-Content-Type: text/markdown
20
+ Requires-Dist: torch>=2.0
21
+ Requires-Dist: torchaudio>=2.0
22
+ Requires-Dist: numpy>=1.24
23
+ Requires-Dist: huggingface_hub>=0.20
24
+ Requires-Dist: torchlibrosa>=0.1.0
25
+ Provides-Extra: transformers
26
+ Requires-Dist: transformers>=4.35; extra == "transformers"
27
+ Provides-Extra: clap
28
+ Requires-Dist: transformers>=4.35; extra == "clap"
29
+ Requires-Dist: laion-clap>=1.1.0; extra == "clap"
30
+ Provides-Extra: speech
31
+ Requires-Dist: transformers>=4.35; extra == "speech"
32
+ Provides-Extra: whisper
33
+ Requires-Dist: transformers>=4.35; extra == "whisper"
34
+ Provides-Extra: train
35
+ Requires-Dist: torchmetrics>=1.0; extra == "train"
36
+ Requires-Dist: tqdm>=4.0; extra == "train"
37
+ Provides-Extra: onnx
38
+ Requires-Dist: onnxruntime>=1.16; extra == "onnx"
39
+ Requires-Dist: onnx>=1.14; extra == "onnx"
40
+ Provides-Extra: stream
41
+ Requires-Dist: sounddevice>=0.4; extra == "stream"
42
+ Provides-Extra: domains
43
+ Provides-Extra: dev
44
+ Requires-Dist: pytest>=7.0; extra == "dev"
45
+ Requires-Dist: pytest-cov; extra == "dev"
46
+ Requires-Dist: ruff; extra == "dev"
47
+ Requires-Dist: mypy; extra == "dev"
48
+ Requires-Dist: tqdm>=4.0; extra == "dev"
49
+
50
+ <div align="center">
51
+
52
+ # ๐ŸŽง audiotimm
53
+
54
+ **The Model Hub for Audio Intelligence**
55
+
56
+ *`timm` for audio โ€” one registry, every architecture, one clean API.*
57
+
58
+ [![PyPI](https://img.shields.io/pypi/v/audiotimm?style=flat-square&color=orange&label=PyPI)](https://pypi.org/project/audiotimm/)
59
+ [![Downloads](https://img.shields.io/pypi/dm/audiotimm?style=flat-square&label=downloads%2Fmonth&color=blue)](https://pypi.org/project/audiotimm/)
60
+ [![Python](https://img.shields.io/badge/python-3.9%2B-blue?style=flat-square&logo=python)](https://www.python.org)
61
+ [![PyTorch](https://img.shields.io/badge/PyTorch-2.0%2B-EE4C2C?style=flat-square&logo=pytorch)](https://pytorch.org)
62
+ [![License](https://img.shields.io/badge/license-Apache%202.0-green?style=flat-square)](LICENSE)
63
+ [![Version](https://img.shields.io/badge/version-1.0.0-blueviolet?style=flat-square)]()
64
+ [![Phase](https://img.shields.io/badge/v1.0.0%20%E2%80%94%20stable-brightgreen?style=flat-square)]()
65
+
66
+ </div>
67
+
68
+ ---
69
+
70
+ ## What is audiotimm?
71
+
72
+ `audiotimm` is a standalone Python library that lets you classify, tag, detect events in, and extract embeddings from audio โ€” in one line โ€” using state-of-the-art pretrained models. It is designed after the philosophy of [`timm`](https://github.com/huggingface/pytorch-image-models): a unified registry where every model family (PANNs, AST, BEATs, HTS-AT, CLAP, Wav2Vec2, WavLM, Whisper, โ€ฆ) is accessible through a single, stable API.
73
+
74
+ ```python
75
+ from audiotimm import Classifier
76
+
77
+ clf = Classifier.load() # default: panns-cnn14
78
+ result = clf.predict("dog.wav")
79
+
80
+ result.top(5) # [(label, score), ...]
81
+ result.label # "Dog"
82
+ result.scores # {"Dog": 0.94, "Animal": 0.72, ...}
83
+ ```
84
+
85
+ ---
86
+
87
+ ## Highlights
88
+
89
+ | | |
90
+ |---|---|
91
+ | **One line to classify** | `Classifier.load().predict("x.wav").top(3)` โ€” weights download and cache automatically |
92
+ | **Every major architecture** | PANNs, YAMNet, AST, BEATs, HTS-AT, AudioMAE, CLAP, Wav2Vec2, HuBERT, WavLM, Whisper |
93
+ | **Lean core** | Zero heavy deps at import time โ€” torch + torchaudio only for the default model |
94
+ | **Rich result object** | `.top(k)`, `.above(thresh)`, `.label`, `.scores`, `.as_dict()`, `.embed()` |
95
+ | **Extensible** | `@register_model` decorator to plug in custom architectures |
96
+ | **CLI included** | `audiotimm predict dog.wav --top 5` |
97
+
98
+ ---
99
+
100
+ ## Installation
101
+
102
+ ```bash
103
+ # Core (PANNs CNN-family, Wave M0)
104
+ pip install audiotimm
105
+
106
+ # + Transformer taggers: AST, BEATs, HTS-AT, AudioMAE (Wave M1)
107
+ pip install audiotimm[transformers]
108
+
109
+ # + Zero-shot classification via CLAP (Wave M2)
110
+ pip install audiotimm[clap]
111
+
112
+ # + Speech SSL backbones: Wav2Vec2, HuBERT, WavLM (Wave M3)
113
+ pip install audiotimm[speech]
114
+
115
+ # + Whisper ASR + encoder embeddings (Wave M4)
116
+ pip install audiotimm[whisper]
117
+
118
+ # + Training utilities
119
+ pip install audiotimm[train]
120
+
121
+ # + ONNX edge export
122
+ pip install audiotimm[onnx]
123
+
124
+ # Everything
125
+ pip install audiotimm[transformers,clap,speech,whisper,train,onnx]
126
+ ```
127
+
128
+ ---
129
+
130
+ ## Quick Start
131
+
132
+ ### Classify a file
133
+
134
+ ```python
135
+ from audiotimm import Classifier
136
+
137
+ clf = Classifier.load() # panns-cnn14 by default
138
+ result = clf.predict("siren.wav")
139
+
140
+ print(result.top(5))
141
+ # [("Siren", 0.93), ("Emergency vehicle", 0.81), ("Vehicle", 0.74), ...]
142
+
143
+ print(result.label) # "Siren"
144
+ print(result.score) # 0.93
145
+ ```
146
+
147
+ ### Batch classification
148
+
149
+ ```python
150
+ results = clf.predict(["a.wav", "b.wav", "c.wav"])
151
+ print(results.labels()) # ["Dog", "Car horn", "Rain"]
152
+ ```
153
+
154
+ ### Only results above a threshold
155
+
156
+ ```python
157
+ result.above(0.5)
158
+ # [("Siren", 0.93), ("Emergency vehicle", 0.81), ("Vehicle", 0.74)]
159
+ ```
160
+
161
+ ### Get embeddings
162
+
163
+ ```python
164
+ emb = clf.embed("dog.wav") # np.ndarray shape (2048,) for panns-cnn14
165
+ ```
166
+
167
+ ### Switch models
168
+
169
+ ```python
170
+ # High accuracy transformer (requires pip install audiotimm[transformers])
171
+ clf = Classifier.load("ast-10-10")
172
+
173
+ # Lightweight 16 kHz variant of PANNs
174
+ clf = Classifier.load("panns-cnn14-16k")
175
+ ```
176
+
177
+ ### CLI
178
+
179
+ ```bash
180
+ # โ”€โ”€ predict โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
181
+ # Basic classification
182
+ audiotimm predict siren.wav
183
+
184
+ # Top-10 results
185
+ audiotimm predict siren.wav --top 10
186
+
187
+ # Show only labels above a confidence threshold
188
+ audiotimm predict siren.wav --threshold 0.3
189
+
190
+ # Use a specific model
191
+ audiotimm predict siren.wav --model ast-10-10
192
+
193
+ # Batch โ€” processes all files, shows per-file results
194
+ audiotimm predict audio/*.wav --model panns-cnn14
195
+
196
+ # JSON output (single file or batch)
197
+ audiotimm predict siren.wav --json
198
+ audiotimm predict audio/*.wav --json --output results.jsonl
199
+
200
+ # Run on GPU
201
+ audiotimm predict siren.wav --model beats-iter3plus-as2m-cpt2 --device cuda
202
+
203
+ # โ”€โ”€ embed โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
204
+ # Print embedding stats to stdout
205
+ audiotimm embed dog.wav
206
+
207
+ # Save single embedding as .npy
208
+ audiotimm embed dog.wav --output dog.npy
209
+
210
+ # Save batch as compressed .npz (keys = file stems)
211
+ audiotimm embed audio/*.wav --output embeddings.npz
212
+
213
+ # Save as CSV (filename, dim_0, dim_1, โ€ฆ)
214
+ audiotimm embed audio/*.wav --output embeddings.csv
215
+
216
+ # โ”€โ”€ list / info โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
217
+ # List all models
218
+ audiotimm list
219
+
220
+ # Filter by wave or task
221
+ audiotimm list --wave M1
222
+ audiotimm list --task tagging
223
+ audiotimm list --family beats
224
+
225
+ # Machine-readable JSON
226
+ audiotimm list --json
227
+
228
+ # Detailed card for one model
229
+ audiotimm info beats-iter3plus-as2m-cpt2
230
+ audiotimm info ast-10-10
231
+
232
+ # โ”€โ”€ benchmark โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
233
+ # Time 20 inference runs and print mean/median/min/max/std
234
+ audiotimm benchmark siren.wav --model panns-cnn14 --runs 20
235
+ audiotimm benchmark siren.wav --model ast-10-10 --device cuda
236
+
237
+ # โ”€โ”€ version โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
238
+ audiotimm --version
239
+ ```
240
+
241
+ ---
242
+
243
+ ## Available Models
244
+
245
+ ### Wave M0 โ€” CNN Taggers `(core, no extras)`
246
+
247
+ | Zoo ID | Architecture | SR | Classes | mAP | Notes |
248
+ |---|---|---|---|---|---|
249
+ | `panns-cnn14` โญ | CNN14 | 32 kHz | 527 | 0.431 | **Default model** |
250
+ | `panns-cnn14-16k` | CNN14 | 16 kHz | 527 | 0.438 | Slightly higher mAP |
251
+ | `yamnet` | MobileNetV1 | 16 kHz | 521 | โ€” | PyTorch path coming in v0.2 |
252
+
253
+ ### Wave M1 โ€” Transformer Taggers `pip install audiotimm[transformers]`
254
+
255
+ | Zoo ID | Architecture | SR | Classes | mAP | Notes |
256
+ |---|---|---|---|---|---|
257
+ | `ast-10-10` โญ | Audio Spectrogram Transformer | 16 kHz | 527 | 0.459 | Default AST |
258
+ | `ast-16-16` | AST (larger patches) | 16 kHz | 527 | 0.442 | Faster |
259
+ | `ast-speechcommands` | AST | 16 kHz | 35 | โ€” | Keyword spotting |
260
+ | `htsat-audioset` | HTS-AT (Swin-style) | 32 kHz | 527 | 0.471 | Also CLAP encoder |
261
+ | `htsat-desed` | HTS-AT | 32 kHz | โ€” | โ€” | Sound event detection |
262
+ | `audiomae-base-ft` | AudioMAE (ViT-Base) | 16 kHz | 527 | 0.473 | Facebook MAE |
263
+ | `beats-iter3plus-as2m-cpt2` | BEATs | 16 kHz | 527 | 0.486 | SOTA mAP |
264
+
265
+ ### Wave M2 โ€” Zero-Shot CLAP `pip install audiotimm[clap]`
266
+
267
+ | Zoo ID | Variant | SR | Notes |
268
+ |---|---|---|---|
269
+ | `clap-laion-fused` โญ | LAION HTSAT + feature fusion | 48 kHz | Handles long audio |
270
+ | `clap-laion-unfused` | LAION HTSAT | 48 kHz | |
271
+ | `clap-laion-music-audioset` | Music + AudioSet trained | 48 kHz | ESC-50 โ‰ˆ 90.1% |
272
+ | `clap-ms-2023` โญ | MS-CLAP HTSAT + GPT-2 | 44.1 kHz | Stronger text encoder |
273
+ | `clap-ms-2022` | MS-CLAP CNN14 + BERT | 44.1 kHz | |
274
+ | `clap-ms-clapcap` | MS-CLAP + captioning head | 44.1 kHz | Audio โ†’ text captions |
275
+
276
+ ### Wave M3 โ€” Speech SSL Backbones `pip install audiotimm[speech]`
277
+
278
+ | Zoo ID | Architecture | SR | Output |
279
+ |---|---|---|---|
280
+ | `wav2vec2-base` | Wav2Vec2 Base | 16 kHz | Frame embeddings |
281
+ | `wav2vec2-large-xlsr` | XLS-R 300M (128 languages) | 16 kHz | Multilingual |
282
+ | `hubert-large-ll60k` | HuBERT Large | 16 kHz | Strong SER backbone |
283
+ | `wavlm-large` โญ | WavLM Large | 16 kHz | Best for speaker tasks |
284
+ | `wavlm-base-plus-sv` | WavLM + SV head | 16 kHz | Speaker verification |
285
+
286
+ ### Wave M4 โ€” Whisper `pip install audiotimm[whisper]`
287
+
288
+ | Zoo ID | Size | Languages | Notes |
289
+ |---|---|---|---|
290
+ | `whisper-base` | Base | 99 | Fast, general |
291
+ | `whisper-large-v3` โญ | Large v3 | 99 | Best accuracy |
292
+ | `whisper-large-v3-turbo` | Large v3 Turbo | 99 | Fast + accurate |
293
+ | `whisper-distil-large-v3` | Distil Large v3 | 1 (EN) | ~2ร— faster |
294
+
295
+ ---
296
+
297
+ ## Zero-Shot Classification (Wave M2)
298
+
299
+ Classify audio into **any labels you define** โ€” no training needed:
300
+
301
+ ```python
302
+ from audiotimm import ZeroShotClassifier # coming in Phase 2
303
+
304
+ zs = ZeroShotClassifier.load("clap-laion-fused")
305
+ result = zs.classify(
306
+ "clip.wav",
307
+ labels=["dog barking", "car horn", "rain", "crowd applause"]
308
+ )
309
+ # -> [("rain", 0.81), ("crowd applause", 0.10), ...]
310
+ ```
311
+
312
+ ---
313
+
314
+ ## Plugin API โ€” Register Custom Models
315
+
316
+ ```python
317
+ from audiotimm import register_model
318
+ from audiotimm.models._base import ModelAdapter
319
+ from audiotimm.core.registry import ModelSpec
320
+
321
+ @register_model("my-bird-net")
322
+ class BirdNet(ModelAdapter):
323
+
324
+ @classmethod
325
+ def spec(cls):
326
+ return ModelSpec(
327
+ name="", # filled by decorator
328
+ family="custom",
329
+ adapter_factory=cls,
330
+ checkpoint="./weights/birdnet.pt",
331
+ sample_rate=22050,
332
+ n_classes=500,
333
+ embed_dim=512,
334
+ task="tagging",
335
+ wave="M0",
336
+ )
337
+
338
+ def predict(self, waveform):
339
+ ... # return {label: score} dict
340
+
341
+ # Now available everywhere
342
+ from audiotimm import Classifier
343
+ clf = Classifier.load("my-bird-net")
344
+ ```
345
+
346
+ ---
347
+
348
+ ## Project Roadmap
349
+
350
+ ```
351
+ Phase 1 โœ… Core engine + PANNs CNN family (Wave M0)
352
+ Phase 2 โœ… Wave M1 โ€” AST, AudioMAE, HTS-AT, BEATs (transformer taggers)
353
+ Phase 3 ยท Wave M2 โ€” CLAP zero-shot (LAION + MS)
354
+ Phase 4 ยท Embeddings & similarity search
355
+ Phase 5 ยท Sound Event Detection timeline
356
+ Phase 6 ยท Wave M3 โ€” Wav2Vec2, HuBERT, WavLM speech SSL
357
+ Phase 7 ยท Training & fine-tuning (Trainer API)
358
+ Phase 8 ยท Wave M4 โ€” Whisper ASR + encoder embeddings
359
+ Phase 9 ยท Evaluation & explainability (Grad-CAM on mel-spectrogram)
360
+ Phase 10 ยท Domain packs (bioacoustics, security, health, music, speech)
361
+ Phase 11 ยท Streaming / real-time inference
362
+ Phase 12 ยท ONNX / TFLite edge export
363
+ Phase 13 ยท XenAudio integration + plugin API
364
+ ```
365
+
366
+ ---
367
+
368
+ ## Architecture
369
+
370
+ ```
371
+ audiotimm/
372
+ โ”œโ”€โ”€ core/
373
+ โ”‚ โ”œโ”€โ”€ classifier.py # Classifier.load(), predict(), embed()
374
+ โ”‚ โ”œโ”€โ”€ result.py # PredictionResult, BatchResult
375
+ โ”‚ โ””โ”€โ”€ registry.py # ModelRegistry singleton + @register_model
376
+ โ”œโ”€โ”€ models/
377
+ โ”‚ โ”œโ”€โ”€ _base.py # ModelAdapter ABC
378
+ โ”‚ โ”œโ”€โ”€ panns.py # Wave M0 โ€” CNN14 family
379
+ โ”‚ โ”œโ”€โ”€ yamnet.py # Wave M0 โ€” YAMNet (stub)
380
+ โ”‚ โ”œโ”€โ”€ ast.py # Wave M1 โ€” AST (coming)
381
+ โ”‚ โ”œโ”€โ”€ beats.py # Wave M1 โ€” BEATs (coming)
382
+ โ”‚ โ”œโ”€โ”€ htsat.py # Wave M1+M2 โ€” HTS-AT (coming)
383
+ โ”‚ โ”œโ”€โ”€ audiomae.py # Wave M1 โ€” AudioMAE (coming)
384
+ โ”‚ โ”œโ”€โ”€ clap.py # Wave M2 โ€” LAION + MS-CLAP (coming)
385
+ โ”‚ โ”œโ”€โ”€ wav2vec2.py # Wave M3 (coming)
386
+ โ”‚ โ”œโ”€โ”€ hubert.py # Wave M3 (coming)
387
+ โ”‚ โ”œโ”€โ”€ wavlm.py # Wave M3 (coming)
388
+ โ”‚ โ””โ”€โ”€ whisper.py # Wave M4 (coming)
389
+ โ”œโ”€โ”€ utils/
390
+ โ”‚ โ”œโ”€โ”€ audio.py # load_audio(), pad_or_trim()
391
+ โ”‚ โ””โ”€โ”€ download.py # cached downloader (~/.cache/audiotimm/)
392
+ โ””โ”€โ”€ cli.py # `audiotimm predict` / `audiotimm list`
393
+ ```
394
+
395
+ ---
396
+
397
+ ## Design Principles
398
+
399
+ - **Lazy everything** โ€” weights download on first `predict()`, not on `import`.
400
+ - **One result type** โ€” `PredictionResult` everywhere; switching models never breaks your code.
401
+ - **Lean core** โ€” `torch + torchaudio + numpy` only for the default model; every heavy dep is behind an optional extra.
402
+ - **Registry-first** โ€” every model is a registry entry; custom models slot in with `@register_model`.
403
+ - **Immutable results** โ€” `PredictionResult` is read-only; safe to cache and pass around.
404
+
405
+ ---
406
+
407
+ ## Contributing
408
+
409
+ ```bash
410
+ git clone https://github.com/shubham10divakar/audiotimm
411
+ cd audiotimm
412
+ pip install -e ".[dev]"
413
+ pytest tests/
414
+ ```
415
+
416
+ ---
417
+
418
+ ## License
419
+
420
+ Apache 2.0. Model weights are subject to their respective upstream licenses โ€” see [PLAN.md](PLAN.md) Appendix A for per-checkpoint license notes.
421
+
422
+ ---
423
+
424
+ <div align="center">
425
+ <sub>Built with โค๏ธ ยท <b>audiotimm โ€” Teach Machines to Listen.</b></sub>
426
+ </div>