libseraph 0.1.1__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- libseraph-0.1.1.dist-info/METADATA +157 -0
- libseraph-0.1.1.dist-info/RECORD +28 -0
- libseraph-0.1.1.dist-info/WHEEL +5 -0
- libseraph-0.1.1.dist-info/entry_points.txt +2 -0
- libseraph-0.1.1.dist-info/licenses/LICENSE +20 -0
- libseraph-0.1.1.dist-info/top_level.txt +1 -0
- seraph/__init__.py +3 -0
- seraph/exec/__init__.py +0 -0
- seraph/exec/audio.py +975 -0
- seraph/exec/classes.py +456 -0
- seraph/exec/integrations.py +296 -0
- seraph/exec/meta.py +308 -0
- seraph/exec/provenance.py +211 -0
- seraph/exec/prune.py +215 -0
- seraph/exec/splits.py +303 -0
- seraph/exec/version.py +296 -0
- seraph/lib/__init__.py +9 -0
- seraph/lib/author.py +174 -0
- seraph/lib/common.py +146 -0
- seraph/lib/data/media_types/audio.csv +164 -0
- seraph/lib/data/media_types/image.csv +87 -0
- seraph/lib/data/media_types/text.csv +98 -0
- seraph/lib/data/media_types/video.csv +97 -0
- seraph/lib/dataset.py +404 -0
- seraph/lib/history.py +318 -0
- seraph/lib/license.py +134 -0
- seraph/lib/media_type.py +121 -0
- seraph/main.py +41 -0
|
@@ -0,0 +1,157 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: libseraph
|
|
3
|
+
Version: 0.1.1
|
|
4
|
+
Summary: A multimedia dataset management tool for ML training
|
|
5
|
+
Author-email: Ryan Quinn <ryan.quinn@certusinnovations.com>
|
|
6
|
+
License-Expression: MIT
|
|
7
|
+
Project-URL: Homepage, https://github.com/Stonewall-Defense/libseraph
|
|
8
|
+
Classifier: Development Status :: 4 - Beta
|
|
9
|
+
Classifier: Environment :: Console
|
|
10
|
+
Classifier: Intended Audience :: Developers
|
|
11
|
+
Classifier: Natural Language :: English
|
|
12
|
+
Classifier: Operating System :: OS Independent
|
|
13
|
+
Classifier: Programming Language :: Python :: 3
|
|
14
|
+
Requires-Python: >=3.12
|
|
15
|
+
Description-Content-Type: text/markdown
|
|
16
|
+
License-File: LICENSE
|
|
17
|
+
Requires-Dist: click~=8.2
|
|
18
|
+
Requires-Dist: numpy~=2.3
|
|
19
|
+
Requires-Dist: requests~=2.32
|
|
20
|
+
Requires-Dist: rich~=14.1
|
|
21
|
+
Requires-Dist: soundfile~=0.13
|
|
22
|
+
Requires-Dist: tqdm~=4.67
|
|
23
|
+
Requires-Dist: tinytag~=2.1
|
|
24
|
+
Requires-Dist: torch~=2.8
|
|
25
|
+
Requires-Dist: torchaudio~=2.8
|
|
26
|
+
Requires-Dist: As-A-Person~=0.1
|
|
27
|
+
Dynamic: license-file
|
|
28
|
+
|
|
29
|
+
# libseraph
|
|
30
|
+
|
|
31
|
+
A hot new dataset management tool that's crazy easy!
|
|
32
|
+
|
|
33
|
+
## Motivation
|
|
34
|
+
|
|
35
|
+
There is no generally accepted metadata standard for multimedia data, and no tooling for multimedia dataset (meta)data management. At the outset of TEAM-ML, creating a training dataset was an error-prone process that typically required 8-20 hours of work from an ML expert, at a total cost of $1,200-$3,000.
|
|
36
|
+
|
|
37
|
+
To expedite the creation and refinement of training datasets, we developed an absolute minimum metadata standard for our multimedia datasets and a management tool that covers all our common use cases.
|
|
38
|
+
|
|
39
|
+
In our experience, it requires 0.5-3 hours to prepare a component dataset for Seraph management with a bespoke script. This is a non-recurring cost that depends on the original dataset format and the delta between that format and the Seraph format. Once all component datasets are configured, a training dataset can be assembled in 5-30 _minutes_, depending on (1) the user’s understanding of the tooling and the desired end composition and (2) whether the component dataset(s) can be copied in as-is or if the data needs to be resampled/resized/etc. Seraph is particularly useful for creating special-purpose or exploratory datasets from existing components; we were able to create a “9mm Parabellum Cartridge Dataset” at a cost of <$10.
|
|
40
|
+
|
|
41
|
+
Seraph currently supports only audio datasets, as anything else is out of scope for TEAM-ML.
|
|
42
|
+
|
|
43
|
+
## Installation
|
|
44
|
+
|
|
45
|
+
```bash
|
|
46
|
+
conda create --name seraph python=3.12
|
|
47
|
+
conda activate seraph
|
|
48
|
+
|
|
49
|
+
pip install -r requirements.txt
|
|
50
|
+
pip install .
|
|
51
|
+
|
|
52
|
+
conda deactivate
|
|
53
|
+
```
|
|
54
|
+
|
|
55
|
+
### Compatibility Note
|
|
56
|
+
|
|
57
|
+
Other Python ML libraries from Certus Innovations use PyTorch 2.10 and the `torchcodec` library for loading audio. Unfortunately, the `torchcodec` library does not support enough options to save files for `seraph`, so it currently must rely on `torchaudio`, which limits the PyTorch version to 2.8. Using Conda or `venv` this isn't too hard to work around, but we are actively working on a path to upgrade this library for compatibility of our packages.
|
|
58
|
+
|
|
59
|
+
## Usage
|
|
60
|
+
|
|
61
|
+
The most used features of the Seraph tool are:
|
|
62
|
+
|
|
63
|
+
- Audio
|
|
64
|
+
- Import audio data from other datasets, including allowing class selection and exclusion
|
|
65
|
+
- Generate duration metadata
|
|
66
|
+
- Clip audio data to a set length while preserving original track identity data
|
|
67
|
+
- Resample audio
|
|
68
|
+
- Prune empty audio files
|
|
69
|
+
- Classes
|
|
70
|
+
- Switch class columns
|
|
71
|
+
- Rename, merge, regex merge, and drop classes by name
|
|
72
|
+
- Check class balance, including by fold/split
|
|
73
|
+
- Compose class metadata from existing column(s)
|
|
74
|
+
- Metadata
|
|
75
|
+
- Initialize a new seraph dataset
|
|
76
|
+
- Verify all data items against the dataset contract specified in the metadata file
|
|
77
|
+
- Provenance
|
|
78
|
+
- Prototype OpenIRIS integration for showing and submitting provenance
|
|
79
|
+
- Prune
|
|
80
|
+
- Remove records with no corresponding files and vice versa
|
|
81
|
+
- Drop data by row value
|
|
82
|
+
- Drop metadata columns
|
|
83
|
+
- Splits
|
|
84
|
+
- Automatically generate train/test/validate splits or cross-validation folds with respect to class balance and optionally avoid pseudoreplication
|
|
85
|
+
- Version
|
|
86
|
+
- Prototype dataset version management by at least one [community standard](https://github.com/dslp/dslp/blob/main/semantic-versioning.md)
|
|
87
|
+
- Integrations
|
|
88
|
+
-Prototype Fuel AI metadata format export
|
|
89
|
+
|
|
90
|
+
### Examples
|
|
91
|
+
|
|
92
|
+
```bash
|
|
93
|
+
# Activate environment
|
|
94
|
+
conda activate seraph
|
|
95
|
+
|
|
96
|
+
# Initialize new dataset
|
|
97
|
+
seraph meta init
|
|
98
|
+
|
|
99
|
+
# Import audio datasets
|
|
100
|
+
seraph audio import --import_dir ~/Desktop/Kaggle_Gunshots/
|
|
101
|
+
seraph audio import --import_dir ~/Desktop/Cadre_Forensics/ --channel_merge_strat mix_down --sample_rate_merge_strat mix_down
|
|
102
|
+
|
|
103
|
+
# Switch classes from `gun_type` to `caliber`
|
|
104
|
+
seraph classes switch --new_class_col caliber --new_name_for_current_class_col gun_type
|
|
105
|
+
|
|
106
|
+
# Merge degenerate classes
|
|
107
|
+
seraph classes merge --target_class_name 9x19 --classes_to_merge "9mm Luger" --classes_to_merge "9mm"
|
|
108
|
+
|
|
109
|
+
# Add durations to columns and clip to 1 sec
|
|
110
|
+
seraph audio duration --metadata_column_conflict_strat replace
|
|
111
|
+
seraph audio clip --clip_duration_secs 1 --dry_run
|
|
112
|
+
|
|
113
|
+
# Show provenance data (WIP)
|
|
114
|
+
seraph prov show
|
|
115
|
+
seraph prov submit --activity_label "Make new gunshot dataset"
|
|
116
|
+
|
|
117
|
+
# Show verioning data (WIP)
|
|
118
|
+
seraph version show
|
|
119
|
+
|
|
120
|
+
# Cleanup
|
|
121
|
+
conda deactivate
|
|
122
|
+
```
|
|
123
|
+
|
|
124
|
+
## Testing
|
|
125
|
+
|
|
126
|
+
```bash
|
|
127
|
+
python3 -m coverage run -m unittest discover -s test -p "*_test.py" && python -m coverage report --skip-covered
|
|
128
|
+
python -m coverage html
|
|
129
|
+
```
|
|
130
|
+
|
|
131
|
+
### Tests to Write
|
|
132
|
+
|
|
133
|
+
- No Coverage
|
|
134
|
+
- integrations
|
|
135
|
+
- provenance
|
|
136
|
+
- Partial Coverage
|
|
137
|
+
- meta
|
|
138
|
+
- version
|
|
139
|
+
|
|
140
|
+
## Feature Wish-List
|
|
141
|
+
|
|
142
|
+
- **IDEMPOTENCE**
|
|
143
|
+
- Prevent a dataset from being "double-tapped"
|
|
144
|
+
- Pipe dreams
|
|
145
|
+
- Undo
|
|
146
|
+
|
|
147
|
+
## Versioning
|
|
148
|
+
|
|
149
|
+
We use [SemVer](http://semver.org/) for versioning. For the versions available, see the [tags on this repository](https://github.com/Stonewall-Defense/libseraph/tags).
|
|
150
|
+
|
|
151
|
+
## Authors
|
|
152
|
+
|
|
153
|
+
- **Ryan Quinn** - _Initial work_
|
|
154
|
+
|
|
155
|
+
## License
|
|
156
|
+
|
|
157
|
+
MIT.
|
|
@@ -0,0 +1,28 @@
|
|
|
1
|
+
libseraph-0.1.1.dist-info/licenses/LICENSE,sha256=ZGxGE7xzaYRti8pi4TDgcCHpYwx5QjObSWpUp7H-AE0,1084
|
|
2
|
+
seraph/__init__.py,sha256=SjD55C4w8RGyn2wqcaCsEM6qjlC9FIAYBLT8BdkAJ84,256
|
|
3
|
+
seraph/main.py,sha256=cdMLthF0bGp8mT8SMhmNxDyiBpszwOhUkDYpPhAbVKc,1242
|
|
4
|
+
seraph/exec/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
|
|
5
|
+
seraph/exec/audio.py,sha256=uWFwf3OXCuVPco5TRUaJl4eJkHXepsT-JFnTw6YFQ74,41844
|
|
6
|
+
seraph/exec/classes.py,sha256=CKrpBKWE7fiJfd_n0QlygAHsGCtOovtOO9ysLH7m1J4,16500
|
|
7
|
+
seraph/exec/integrations.py,sha256=AN7FHrNfnvADeWp1QpMpI73bbgzqJAXSm6aFT_V5ROQ,10225
|
|
8
|
+
seraph/exec/meta.py,sha256=Loyy4wO3LH4Ngx_BnYH2OKN8JQ0uEGdSlLbm8ImC4B8,11699
|
|
9
|
+
seraph/exec/provenance.py,sha256=IZwI7swGXU1qbCcT8F0y1zULeCf7bTGa_ZnPUyKpZII,7121
|
|
10
|
+
seraph/exec/prune.py,sha256=-7UvkNEra5kEi5YozU-fxsUvP4hZnaPeplit3L_IBpc,8404
|
|
11
|
+
seraph/exec/splits.py,sha256=I0P3HwJFpC9ccfS-uC65n3wpLVMiNW_dVCAwM_L-pGA,11605
|
|
12
|
+
seraph/exec/version.py,sha256=nEKj9Yb9Yp7UAgycHUAZgYJpzAEn6tKkpAhIh6uo4R4,10883
|
|
13
|
+
seraph/lib/__init__.py,sha256=ZHZAOQgxeJ7z7o8i1QO-u4YRq5VhQYZnKhlT-XWiiXI,973
|
|
14
|
+
seraph/lib/author.py,sha256=kCVuR0eiv9VLIJj_bGoUTfe9dRWgPrQ-7HzPjE_nI1c,4865
|
|
15
|
+
seraph/lib/common.py,sha256=JwxIpLYOWWbze4jjqSr6Os45dCftTzMw11sF3L3oJ8Y,4260
|
|
16
|
+
seraph/lib/dataset.py,sha256=4PLdNLPecgSG2YTRF-dSr5MiPp5glvju7oHOdgHJylM,14786
|
|
17
|
+
seraph/lib/history.py,sha256=O-_wY2Z5de3RMx4x29wtNOaglgx5ISGnl_QZineBDtU,11088
|
|
18
|
+
seraph/lib/license.py,sha256=zglBCLMGy2uAVqYisF0jf8FXqp5ypCimCDpVxUisPq4,4778
|
|
19
|
+
seraph/lib/media_type.py,sha256=kkjlB2ae73FYGkpsDrmxRja0S0GdzoW7t_caLKEq570,4379
|
|
20
|
+
seraph/lib/data/media_types/audio.csv,sha256=4oCPCQwvjcH5C1ROmVWsXEefJoh9cOtrusU_TFscOe0,6476
|
|
21
|
+
seraph/lib/data/media_types/image.csv,sha256=d6xxX3p_bWNuL0xQmEUsQQiHiAqoJTp9ESgDUQhjICo,4080
|
|
22
|
+
seraph/lib/data/media_types/text.csv,sha256=Qxprf_vZnZp5Qq0K2xkk7HZ7M7YFEpxMhRXWFsLgAd0,4527
|
|
23
|
+
seraph/lib/data/media_types/video.csv,sha256=Zf62LvoW-xNUyVH9zWVA7MUvkvdmi_nSUEDjc3MH2LM,4336
|
|
24
|
+
libseraph-0.1.1.dist-info/METADATA,sha256=TG1TSOZp9EKNCWBMfnrv70cZjlJmlQzi_BqBknTQ9UI,5773
|
|
25
|
+
libseraph-0.1.1.dist-info/WHEEL,sha256=aeYiig01lYGDzBgS8HxWXOg3uV61G9ijOsup-k9o1sk,91
|
|
26
|
+
libseraph-0.1.1.dist-info/entry_points.txt,sha256=gtcSZlnkiVTdSvjppidnGgms2s2MGxRL5AmUiRkGkMc,43
|
|
27
|
+
libseraph-0.1.1.dist-info/top_level.txt,sha256=WRG4cJzgKBBsFPMc6BtV8kXeOIIFLnq9Z07pbu9ZDdE,7
|
|
28
|
+
libseraph-0.1.1.dist-info/RECORD,,
|
|
@@ -0,0 +1,20 @@
|
|
|
1
|
+
The MIT License (MIT)
|
|
2
|
+
Copyright (c) 2026 Certus Innovations
|
|
3
|
+
|
|
4
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
5
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
6
|
+
in the Software without restriction, including without limitation the rights
|
|
7
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
8
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
9
|
+
furnished to do so, subject to the following conditions:
|
|
10
|
+
|
|
11
|
+
The above copyright notice and this permission notice shall be included in all
|
|
12
|
+
copies or substantial portions of the Software.
|
|
13
|
+
|
|
14
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
15
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
16
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
17
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
18
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
19
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
20
|
+
SOFTWARE.
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
seraph
|
seraph/__init__.py
ADDED
seraph/exec/__init__.py
ADDED
|
File without changes
|