libseraph 0.1.1__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,157 @@
1
+ Metadata-Version: 2.4
2
+ Name: libseraph
3
+ Version: 0.1.1
4
+ Summary: A multimedia dataset management tool for ML training
5
+ Author-email: Ryan Quinn <ryan.quinn@certusinnovations.com>
6
+ License-Expression: MIT
7
+ Project-URL: Homepage, https://github.com/Stonewall-Defense/libseraph
8
+ Classifier: Development Status :: 4 - Beta
9
+ Classifier: Environment :: Console
10
+ Classifier: Intended Audience :: Developers
11
+ Classifier: Natural Language :: English
12
+ Classifier: Operating System :: OS Independent
13
+ Classifier: Programming Language :: Python :: 3
14
+ Requires-Python: >=3.12
15
+ Description-Content-Type: text/markdown
16
+ License-File: LICENSE
17
+ Requires-Dist: click~=8.2
18
+ Requires-Dist: numpy~=2.3
19
+ Requires-Dist: requests~=2.32
20
+ Requires-Dist: rich~=14.1
21
+ Requires-Dist: soundfile~=0.13
22
+ Requires-Dist: tqdm~=4.67
23
+ Requires-Dist: tinytag~=2.1
24
+ Requires-Dist: torch~=2.8
25
+ Requires-Dist: torchaudio~=2.8
26
+ Requires-Dist: As-A-Person~=0.1
27
+ Dynamic: license-file
28
+
29
+ # libseraph
30
+
31
+ A hot new dataset management tool that's crazy easy!
32
+
33
+ ## Motivation
34
+
35
+ There is no generally accepted metadata standard for multimedia data, and no tooling for multimedia dataset (meta)data management. At the outset of TEAM-ML, creating a training dataset was an error-prone process that typically required 8-20 hours of work from an ML expert, at a total cost of $1,200-$3,000.
36
+
37
+ To expedite the creation and refinement of training datasets, we developed an absolute minimum metadata standard for our multimedia datasets and a management tool that covers all our common use cases.
38
+
39
+ In our experience, it requires 0.5-3 hours to prepare a component dataset for Seraph management with a bespoke script. This is a non-recurring cost that depends on the original dataset format and the delta between that format and the Seraph format. Once all component datasets are configured, a training dataset can be assembled in 5-30 _minutes_, depending on (1) the user’s understanding of the tooling and the desired end composition and (2) whether the component dataset(s) can be copied in as-is or if the data needs to be resampled/resized/etc. Seraph is particularly useful for creating special-purpose or exploratory datasets from existing components; we were able to create a “9mm Parabellum Cartridge Dataset” at a cost of <$10.
40
+
41
+ Seraph currently supports only audio datasets, as anything else is out of scope for TEAM-ML.
42
+
43
+ ## Installation
44
+
45
+ ```bash
46
+ conda create --name seraph python=3.12
47
+ conda activate seraph
48
+
49
+ pip install -r requirements.txt
50
+ pip install .
51
+
52
+ conda deactivate
53
+ ```
54
+
55
+ ### Compatibility Note
56
+
57
+ Other Python ML libraries from Certus Innovations use PyTorch 2.10 and the `torchcodec` library for loading audio. Unfortunately, the `torchcodec` library does not support enough options to save files for `seraph`, so it currently must rely on `torchaudio`, which limits the PyTorch version to 2.8. Using Conda or `venv` this isn't too hard to work around, but we are actively working on a path to upgrade this library for compatibility of our packages.
58
+
59
+ ## Usage
60
+
61
+ The most used features of the Seraph tool are:
62
+
63
+ - Audio
64
+ - Import audio data from other datasets, including allowing class selection and exclusion
65
+ - Generate duration metadata
66
+ - Clip audio data to a set length while preserving original track identity data
67
+ - Resample audio
68
+ - Prune empty audio files
69
+ - Classes
70
+ - Switch class columns
71
+ - Rename, merge, regex merge, and drop classes by name
72
+ - Check class balance, including by fold/split
73
+ - Compose class metadata from existing column(s)
74
+ - Metadata
75
+ - Initialize a new seraph dataset
76
+ - Verify all data items against the dataset contract specified in the metadata file
77
+ - Provenance
78
+ - Prototype OpenIRIS integration for showing and submitting provenance
79
+ - Prune
80
+ - Remove records with no corresponding files and vice versa
81
+ - Drop data by row value
82
+ - Drop metadata columns
83
+ - Splits
84
+ - Automatically generate train/test/validate splits or cross-validation folds with respect to class balance and optionally avoid pseudoreplication
85
+ - Version
86
+ - Prototype dataset version management by at least one [community standard](https://github.com/dslp/dslp/blob/main/semantic-versioning.md)
87
+ - Integrations
88
+ -Prototype Fuel AI metadata format export
89
+
90
+ ### Examples
91
+
92
+ ```bash
93
+ # Activate environment
94
+ conda activate seraph
95
+
96
+ # Initialize new dataset
97
+ seraph meta init
98
+
99
+ # Import audio datasets
100
+ seraph audio import --import_dir ~/Desktop/Kaggle_Gunshots/
101
+ seraph audio import --import_dir ~/Desktop/Cadre_Forensics/ --channel_merge_strat mix_down --sample_rate_merge_strat mix_down
102
+
103
+ # Switch classes from `gun_type` to `caliber`
104
+ seraph classes switch --new_class_col caliber --new_name_for_current_class_col gun_type
105
+
106
+ # Merge degenerate classes
107
+ seraph classes merge --target_class_name 9x19 --classes_to_merge "9mm Luger" --classes_to_merge "9mm"
108
+
109
+ # Add durations to columns and clip to 1 sec
110
+ seraph audio duration --metadata_column_conflict_strat replace
111
+ seraph audio clip --clip_duration_secs 1 --dry_run
112
+
113
+ # Show provenance data (WIP)
114
+ seraph prov show
115
+ seraph prov submit --activity_label "Make new gunshot dataset"
116
+
117
+ # Show verioning data (WIP)
118
+ seraph version show
119
+
120
+ # Cleanup
121
+ conda deactivate
122
+ ```
123
+
124
+ ## Testing
125
+
126
+ ```bash
127
+ python3 -m coverage run -m unittest discover -s test -p "*_test.py" && python -m coverage report --skip-covered
128
+ python -m coverage html
129
+ ```
130
+
131
+ ### Tests to Write
132
+
133
+ - No Coverage
134
+ - integrations
135
+ - provenance
136
+ - Partial Coverage
137
+ - meta
138
+ - version
139
+
140
+ ## Feature Wish-List
141
+
142
+ - **IDEMPOTENCE**
143
+ - Prevent a dataset from being "double-tapped"
144
+ - Pipe dreams
145
+ - Undo
146
+
147
+ ## Versioning
148
+
149
+ We use [SemVer](http://semver.org/) for versioning. For the versions available, see the [tags on this repository](https://github.com/Stonewall-Defense/libseraph/tags).
150
+
151
+ ## Authors
152
+
153
+ - **Ryan Quinn** - _Initial work_
154
+
155
+ ## License
156
+
157
+ MIT.
@@ -0,0 +1,28 @@
1
+ libseraph-0.1.1.dist-info/licenses/LICENSE,sha256=ZGxGE7xzaYRti8pi4TDgcCHpYwx5QjObSWpUp7H-AE0,1084
2
+ seraph/__init__.py,sha256=SjD55C4w8RGyn2wqcaCsEM6qjlC9FIAYBLT8BdkAJ84,256
3
+ seraph/main.py,sha256=cdMLthF0bGp8mT8SMhmNxDyiBpszwOhUkDYpPhAbVKc,1242
4
+ seraph/exec/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
5
+ seraph/exec/audio.py,sha256=uWFwf3OXCuVPco5TRUaJl4eJkHXepsT-JFnTw6YFQ74,41844
6
+ seraph/exec/classes.py,sha256=CKrpBKWE7fiJfd_n0QlygAHsGCtOovtOO9ysLH7m1J4,16500
7
+ seraph/exec/integrations.py,sha256=AN7FHrNfnvADeWp1QpMpI73bbgzqJAXSm6aFT_V5ROQ,10225
8
+ seraph/exec/meta.py,sha256=Loyy4wO3LH4Ngx_BnYH2OKN8JQ0uEGdSlLbm8ImC4B8,11699
9
+ seraph/exec/provenance.py,sha256=IZwI7swGXU1qbCcT8F0y1zULeCf7bTGa_ZnPUyKpZII,7121
10
+ seraph/exec/prune.py,sha256=-7UvkNEra5kEi5YozU-fxsUvP4hZnaPeplit3L_IBpc,8404
11
+ seraph/exec/splits.py,sha256=I0P3HwJFpC9ccfS-uC65n3wpLVMiNW_dVCAwM_L-pGA,11605
12
+ seraph/exec/version.py,sha256=nEKj9Yb9Yp7UAgycHUAZgYJpzAEn6tKkpAhIh6uo4R4,10883
13
+ seraph/lib/__init__.py,sha256=ZHZAOQgxeJ7z7o8i1QO-u4YRq5VhQYZnKhlT-XWiiXI,973
14
+ seraph/lib/author.py,sha256=kCVuR0eiv9VLIJj_bGoUTfe9dRWgPrQ-7HzPjE_nI1c,4865
15
+ seraph/lib/common.py,sha256=JwxIpLYOWWbze4jjqSr6Os45dCftTzMw11sF3L3oJ8Y,4260
16
+ seraph/lib/dataset.py,sha256=4PLdNLPecgSG2YTRF-dSr5MiPp5glvju7oHOdgHJylM,14786
17
+ seraph/lib/history.py,sha256=O-_wY2Z5de3RMx4x29wtNOaglgx5ISGnl_QZineBDtU,11088
18
+ seraph/lib/license.py,sha256=zglBCLMGy2uAVqYisF0jf8FXqp5ypCimCDpVxUisPq4,4778
19
+ seraph/lib/media_type.py,sha256=kkjlB2ae73FYGkpsDrmxRja0S0GdzoW7t_caLKEq570,4379
20
+ seraph/lib/data/media_types/audio.csv,sha256=4oCPCQwvjcH5C1ROmVWsXEefJoh9cOtrusU_TFscOe0,6476
21
+ seraph/lib/data/media_types/image.csv,sha256=d6xxX3p_bWNuL0xQmEUsQQiHiAqoJTp9ESgDUQhjICo,4080
22
+ seraph/lib/data/media_types/text.csv,sha256=Qxprf_vZnZp5Qq0K2xkk7HZ7M7YFEpxMhRXWFsLgAd0,4527
23
+ seraph/lib/data/media_types/video.csv,sha256=Zf62LvoW-xNUyVH9zWVA7MUvkvdmi_nSUEDjc3MH2LM,4336
24
+ libseraph-0.1.1.dist-info/METADATA,sha256=TG1TSOZp9EKNCWBMfnrv70cZjlJmlQzi_BqBknTQ9UI,5773
25
+ libseraph-0.1.1.dist-info/WHEEL,sha256=aeYiig01lYGDzBgS8HxWXOg3uV61G9ijOsup-k9o1sk,91
26
+ libseraph-0.1.1.dist-info/entry_points.txt,sha256=gtcSZlnkiVTdSvjppidnGgms2s2MGxRL5AmUiRkGkMc,43
27
+ libseraph-0.1.1.dist-info/top_level.txt,sha256=WRG4cJzgKBBsFPMc6BtV8kXeOIIFLnq9Z07pbu9ZDdE,7
28
+ libseraph-0.1.1.dist-info/RECORD,,
@@ -0,0 +1,5 @@
1
+ Wheel-Version: 1.0
2
+ Generator: setuptools (82.0.1)
3
+ Root-Is-Purelib: true
4
+ Tag: py3-none-any
5
+
@@ -0,0 +1,2 @@
1
+ [console_scripts]
2
+ seraph = seraph.main:cli
@@ -0,0 +1,20 @@
1
+ The MIT License (MIT)
2
+ Copyright (c) 2026 Certus Innovations
3
+
4
+ Permission is hereby granted, free of charge, to any person obtaining a copy
5
+ of this software and associated documentation files (the "Software"), to deal
6
+ in the Software without restriction, including without limitation the rights
7
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
8
+ copies of the Software, and to permit persons to whom the Software is
9
+ furnished to do so, subject to the following conditions:
10
+
11
+ The above copyright notice and this permission notice shall be included in all
12
+ copies or substantial portions of the Software.
13
+
14
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
15
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
16
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
17
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
18
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
19
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
20
+ SOFTWARE.
@@ -0,0 +1 @@
1
+ seraph
seraph/__init__.py ADDED
@@ -0,0 +1,3 @@
1
+ from .lib import read_csv, read_json, write_csv, write_json # noqa
2
+ from .lib import DatasetAuthor, SeraphMetadata, SeraphDataset, SeraphMetadataError # noqa
3
+ from .lib import HistoryManager, ChangeType, VersionBumpType, ImportRecord, ChangeRecord # noqa
File without changes