bdm-tool 0.2__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
bdm_tool-0.2/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2023 Andrei Khobnia
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
bdm_tool-0.2/PKG-INFO ADDED
@@ -0,0 +1,405 @@
1
+ Metadata-Version: 2.4
2
+ Name: bdm-tool
3
+ Version: 0.2
4
+ Summary: Simple lightweight dataset versioning utility based purely on the file system and symbolic links
5
+ Author-email: Andrei Khobnia <andrei.khobnia@gmail.com>
6
+ License-Expression: MIT
7
+ Keywords: version-control,data-versioning,versioning,machine-learning,ai,data,developer-tools
8
+ Classifier: Operating System :: OS Independent
9
+ Classifier: Development Status :: 4 - Beta
10
+ Classifier: Programming Language :: Python :: 3
11
+ Classifier: Programming Language :: Python :: 3.9
12
+ Classifier: Programming Language :: Python :: 3.10
13
+ Classifier: Programming Language :: Python :: 3.11
14
+ Classifier: Programming Language :: Python :: 3.12
15
+ Requires-Python: >=3.9
16
+ Description-Content-Type: text/markdown
17
+ License-File: LICENSE
18
+ Provides-Extra: docs
19
+ Requires-Dist: sphinx; extra == "docs"
20
+ Requires-Dist: sphinx-rtd-theme; extra == "docs"
21
+ Dynamic: license-file
22
+
23
+ # BDM Tool
24
+ __BDM__ (Big Dataset Management) Tool is a __simple__ lightweight dataset versioning utility based purely on the file system and symbolic links.
25
+
26
+ BDM Tool Features:
27
+ * __No full downloads required__: Switch to any dataset version without downloading the entire dataset to your local machine.
28
+ * __Independent of external VCS__: Does not rely on external version control systems like Git or Mercurial, and does not require integrating with one.
29
+ * __Easy dataset sharing__: Supports sharing datasets via remote file systems on a data server.
30
+ * __Fast version switching__: Switching between dataset versions does not require long synchronization processes.
31
+ * __Transparent version access__: Different dataset versions are accessed through simple and intuitive paths (e.g., dataset/v1.0/, dataset/v2.0/, etc.), making versioning fully transparent to configuration files, MLflow parameters, and other tooling.
32
+ * __Storage optimization__: Efficiently stores multiple dataset versions using symbolic links to avoid duplication.
33
+ * __Designed for large, complex datasets__: Well-suited for managing big datasets with intricate directory and subdirectory structures.
34
+ * __Python API for automation__: Provides a simple Python API to automatically create new dataset versions within MLOps pipelines, workflows, ETL jobs, and other automated processes.
35
+
36
+ ## General Principles
37
+ * Each version of a dataset is a path like `dataset/v1.0/`, `dataset/v2.0/`.
38
+ * A new dataset version is generated whenever modifications are made
39
+ * Each dataset version is immutable and read-only.
40
+ * A new version includes only the files that have been added or modified, while unchanged files and directories are stored as symbolic links.
41
+ * Each version contains a readme.txt file with a summary of changes.
42
+
43
+ ## Intallation
44
+ ### Installation from PyPI (Recommended)
45
+ Use `pip` to install tool by the following command:
46
+ ```shell
47
+ pip install bdm-tool
48
+ ```
49
+
50
+ ### Installation from Sources
51
+ Use `pip` to install tool by the following command:
52
+ ```shell
53
+ pip install git+https://github.com/aikho/bdm-tool.git
54
+ ```
55
+
56
+ ## Usage
57
+ ### Start Versioning Dataset
58
+ Let's assume we have a dataset with the following structure:
59
+ ```shell
60
+ tree testdata
61
+ testdata
62
+ ├── annotation
63
+ │ ├── part01
64
+ │ │ ├── regions01.json
65
+ │ │ ├── regions02.json
66
+ │ │ ├── regions03.json
67
+ │ │ ├── regions04.json
68
+ │ │ └── regions05.json
69
+ │ ├── part02
70
+ │ │ ├── regions01.json
71
+ │ │ ├── regions02.json
72
+ │ │ ├── regions03.json
73
+ │ │ ├── regions04.json
74
+ │ │ └── regions05.json
75
+ │ └── part03
76
+ │ ├── regions01.json
77
+ │ ├── regions02.json
78
+ │ ├── regions03.json
79
+ │ ├── regions04.json
80
+ │ └── regions05.json
81
+ └── data
82
+ ├── part01
83
+ │ ├── image01.png
84
+ │ ├── image02.png
85
+ │ ├── image03.png
86
+ │ ├── image04.png
87
+ │ └── image05.png
88
+ ├── part02
89
+ │ ├── image01.png
90
+ │ ├── image02.png
91
+ │ ├── image03.png
92
+ │ ├── image04.png
93
+ │ └── image05.png
94
+ └── part03
95
+ ├── image01.png
96
+ ├── image02.png
97
+ ├── image03.png
98
+ ├── image04.png
99
+ └── image05.png
100
+
101
+ 9 directories, 30 files
102
+ ```
103
+ To put it under `bdm-tool` version control use command `bdm init`:
104
+ ```shell
105
+ bdm init testdata
106
+ Version v0.1 of dataset has been created.
107
+ Files added: 3, updated: 0, removed: 0, symlinked: 0
108
+ ```
109
+ The first version `v0.1` of the dataset has been created. Let’s take a look at the file structure:
110
+ ```shell
111
+ tree testdata
112
+ testdata
113
+ ├── current -> ./v0.1
114
+ └── v0.1
115
+ ├── annotation
116
+ │ ├── part01
117
+ │ │ ├── regions01.json
118
+ │ │ ├── regions02.json
119
+ │ │ ├── regions03.json
120
+ │ │ ├── regions04.json
121
+ │ │ └── regions05.json
122
+ │ ├── part02
123
+ │ │ ├── regions01.json
124
+ │ │ ├── regions02.json
125
+ │ │ ├── regions03.json
126
+ │ │ ├── regions04.json
127
+ │ │ └── regions05.json
128
+ │ └── part03
129
+ │ ├── regions01.json
130
+ │ ├── regions02.json
131
+ │ ├── regions03.json
132
+ │ ├── regions04.json
133
+ │ └── regions05.json
134
+ ├── data
135
+ │ ├── part01
136
+ │ │ ├── image01.png
137
+ │ │ ├── image02.png
138
+ │ │ ├── image03.png
139
+ │ │ ├── image04.png
140
+ │ │ └── image05.png
141
+ │ ├── part02
142
+ │ │ ├── image01.png
143
+ │ │ ├── image02.png
144
+ │ │ ├── image03.png
145
+ │ │ ├── image04.png
146
+ │ │ └── image05.png
147
+ │ └── part03
148
+ │ ├── image01.png
149
+ │ ├── image02.png
150
+ │ ├── image03.png
151
+ │ ├── image04.png
152
+ │ └── image05.png
153
+ └── readme.txt
154
+
155
+ 11 directories, 31 files
156
+ ```
157
+ We can see that version `v0.1` contains all the initial files along with a `readme.txt` file. Let’s take a look inside `readme.txt`:
158
+ ```shell
159
+ cat testdata/v0.1/readme.txt
160
+ Dataset version v0.1 has been created!
161
+ Created timestamp: 2023-08-07 19:40:19.498656, OS user: rock-star-ml-engineer
162
+ Files added: 2, updated: 0, removed: 0, symlinked: 0
163
+
164
+ Files added:
165
+ annotation/
166
+ data/
167
+ ```
168
+ The file shows the creation date, operating system user, relevant statistics, and a summary of performed operations.
169
+
170
+ ### Add New Files
171
+ Suppose we have additional data stored in the `new_data` directory:
172
+ ```shell
173
+ tree new_data/
174
+ new_data/
175
+ ├── annotation
176
+ │   ├── regions06.json
177
+ │   └── regions07.json
178
+ └── data
179
+ ├── image06.png
180
+ └── image07.png
181
+
182
+ 2 directories, 4 files
183
+ ```
184
+ New files can be added to a new dataset version using the `dbm change` command. Use the `--add` flag to add individual files, or `--add-all` to add all files from a specified directory:
185
+ ```shell
186
+ bdm change --add_all new_data/annotation/:annotation/part03/ --add_all new_data/data/:data/part03/ -c -m "add new files" testdata
187
+ Version v0.2 of dataset has been created.
188
+ Files added: 4, updated: 0, removed: 0, symlinked: 14
189
+ ```
190
+ The `:` character is used as a separator between the source path and the target subpath inside the dataset where the files should be added.
191
+
192
+ The `-c` flag stands for copy. When used, files are copied instead of moved. Moving files can be faster, so you may prefer it for performance reasons.
193
+
194
+ The `-m` flag allows you to add a message, which is then stored in the `readme.txt` file of the new dataset version.
195
+
196
+ Let’s take a look inside the `readme.txt` file of the new version:
197
+ ```shell
198
+ cat testdata/current/readme.txt
199
+ Dataset version v0.2 has been created from previous version v0.1!
200
+ add new files
201
+ Created timestamp: 2023-08-07 19:38:39.758828, OS user: rock-star-ml-engineer
202
+ Files added: 4, updated: 0, removed: 0, symlinked: 14
203
+
204
+ Files added:
205
+ annotation/part03/regions06.json
206
+ annotation/part03/regions07.json
207
+ data/part03/image06.png
208
+ data/part03/image07.png
209
+ ```
210
+ Next, let’s examine the updated file structure:
211
+ ```shell
212
+ tree testdata
213
+ testdata
214
+ ├── current -> ./v0.2
215
+ ├── v0.1
216
+ │ ├── annotation
217
+ │ │ ├── part01
218
+ │ │ │ ├── regions01.json
219
+ │ │ │ ├── regions02.json
220
+ │ │ │ ├── regions03.json
221
+ │ │ │ ├── regions04.json
222
+ │ │ │ └── regions05.json
223
+ │ │ ├── part02
224
+ │ │ │ ├── regions01.json
225
+ │ │ │ ├── regions02.json
226
+ │ │ │ ├── regions03.json
227
+ │ │ │ ├── regions04.json
228
+ │ │ │ └── regions05.json
229
+ │ │ └── part03
230
+ │ │ ├── regions01.json
231
+ │ │ ├── regions02.json
232
+ │ │ ├── regions03.json
233
+ │ │ ├── regions04.json
234
+ │ │ └── regions05.json
235
+ │ ├── data
236
+ │ │ ├── part01
237
+ │ │ │ ├── image01.png
238
+ │ │ │ ├── image02.png
239
+ │ │ │ ├── image03.png
240
+ │ │ │ ├── image04.png
241
+ │ │ │ └── image05.png
242
+ │ │ ├── part02
243
+ │ │ │ ├── image01.png
244
+ │ │ │ ├── image02.png
245
+ │ │ │ ├── image03.png
246
+ │ │ │ ├── image04.png
247
+ │ │ │ └── image05.png
248
+ │ │ └── part03
249
+ │ │ ├── image01.png
250
+ │ │ ├── image02.png
251
+ │ │ ├── image03.png
252
+ │ │ ├── image04.png
253
+ │ │ └── image05.png
254
+ │ └── readme.txt
255
+ └── v0.2
256
+ ├── annotation
257
+ │ ├── part01 -> ../../v0.1/annotation/part01
258
+ │ ├── part02 -> ../../v0.1/annotation/part02
259
+ │ └── part03
260
+ │ ├── regions01.json -> ../../../v0.1/annotation/part03/regions01.json
261
+ │ ├── regions02.json -> ../../../v0.1/annotation/part03/regions02.json
262
+ │ ├── regions03.json -> ../../../v0.1/annotation/part03/regions03.json
263
+ │ ├── regions04.json -> ../../../v0.1/annotation/part03/regions04.json
264
+ │ ├── regions05.json -> ../../../v0.1/annotation/part03/regions05.json
265
+ │ ├── regions06.json
266
+ │ └── regions07.json
267
+ ├── data
268
+ │ ├── part01 -> ../../v0.1/data/part01
269
+ │ ├── part02 -> ../../v0.1/data/part02
270
+ │ └── part03
271
+ │ ├── image01.png -> ../../../v0.1/data/part03/image01.png
272
+ │ ├── image02.png -> ../../../v0.1/data/part03/image02.png
273
+ │ ├── image03.png -> ../../../v0.1/data/part03/image03.png
274
+ │ ├── image04.png -> ../../../v0.1/data/part03/image04.png
275
+ │ ├── image05.png -> ../../../v0.1/data/part03/image05.png
276
+ │ ├── image06.png
277
+ │ └── image07.png
278
+ └── readme.txt
279
+
280
+ 20 directories, 46 files
281
+ ```
282
+
283
+ ### Update Files
284
+ Files can be updated in a new dataset version using the `dbm change` command. Use the `--update` flag to update individual files, or `--update-all` to update all files in a given directory:
285
+ ```shell
286
+ bdm change --update data_update/regions05.json:annotation/part03/ -c -m "update" testdata
287
+ Version v0.3 of dataset has been created.
288
+ Files added: 0, updated: 1, removed: 0, symlinked: 9
289
+ ```
290
+ Let’s take a look inside the `readme.txt` file of the new version:
291
+ ```shell
292
+ cat testdata/current/readme.txt
293
+ Dataset version v0.3 has been created from previous version v0.2!
294
+ update
295
+ Created timestamp: 2023-08-07 19:40:01.753345, OS user: rock-star-data-scientist
296
+ Files added: 0, updated: 1, removed: 0, symlinked: 9
297
+
298
+ Files updated:
299
+ annotation/part03/regions05.json
300
+ ```
301
+ Let’s take a look at the file structure:
302
+ ```shell
303
+ tree testdata
304
+ testdata
305
+ ├── current -> ./v0.3
306
+ ├── v0.1
307
+ │ ├── annotation
308
+ │ │ ├── part01
309
+ │ │ │ ├── regions01.json
310
+ │ │ │ ├── regions02.json
311
+ │ │ │ ├── regions03.json
312
+ │ │ │ ├── regions04.json
313
+ │ │ │ └── regions05.json
314
+ │ │ ├── part02
315
+ │ │ │ ├── regions01.json
316
+ │ │ │ ├── regions02.json
317
+ │ │ │ ├── regions03.json
318
+ │ │ │ ├── regions04.json
319
+ │ │ │ └── regions05.json
320
+ │ │ └── part03
321
+ │ │ ├── regions01.json
322
+ │ │ ├── regions02.json
323
+ │ │ ├── regions03.json
324
+ │ │ ├── regions04.json
325
+ │ │ └── regions05.json
326
+ │ ├── data
327
+ │ │ ├── part01
328
+ │ │ │ ├── image01.png
329
+ │ │ │ ├── image02.png
330
+ │ │ │ ├── image03.png
331
+ │ │ │ ├── image04.png
332
+ │ │ │ └── image05.png
333
+ │ │ ├── part02
334
+ │ │ │ ├── image01.png
335
+ │ │ │ ├── image02.png
336
+ │ │ │ ├── image03.png
337
+ │ │ │ ├── image04.png
338
+ │ │ │ └── image05.png
339
+ │ │ └── part03
340
+ │ │ ├── image01.png
341
+ │ │ ├── image02.png
342
+ │ │ ├── image03.png
343
+ │ │ ├── image04.png
344
+ │ │ └── image05.png
345
+ │ └── readme.txt
346
+ ├── v0.2
347
+ │ ├── annotation
348
+ │ │ ├── part01 -> ../../v0.1/annotation/part01
349
+ │ │ ├── part02 -> ../../v0.1/annotation/part02
350
+ │ │ └── part03
351
+ │ │ ├── regions01.json -> ../../../v0.1/annotation/part03/regions01.json
352
+ │ │ ├── regions02.json -> ../../../v0.1/annotation/part03/regions02.json
353
+ │ │ ├── regions03.json -> ../../../v0.1/annotation/part03/regions03.json
354
+ │ │ ├── regions04.json -> ../../../v0.1/annotation/part03/regions04.json
355
+ │ │ ├── regions05.json -> ../../../v0.1/annotation/part03/regions05.json
356
+ │ │ ├── regions06.json
357
+ │ │ └── regions07.json
358
+ │ ├── data
359
+ │ │ ├── part01 -> ../../v0.1/data/part01
360
+ │ │ ├── part02 -> ../../v0.1/data/part02
361
+ │ │ └── part03
362
+ │ │ ├── image01.png -> ../../../v0.1/data/part03/image01.png
363
+ │ │ ├── image02.png -> ../../../v0.1/data/part03/image02.png
364
+ │ │ ├── image03.png -> ../../../v0.1/data/part03/image03.png
365
+ │ │ ├── image04.png -> ../../../v0.1/data/part03/image04.png
366
+ │ │ ├── image05.png -> ../../../v0.1/data/part03/image05.png
367
+ │ │ ├── image06.png
368
+ │ │ └── image07.png
369
+ │ └── readme.txt
370
+ └── v0.3
371
+ ├── annotation
372
+ │ ├── part01 -> ../../v0.2/annotation/part01
373
+ │ ├── part02 -> ../../v0.2/annotation/part02
374
+ │ └── part03
375
+ │ ├── regions01.json -> ../../../v0.2/annotation/part03/regions01.json
376
+ │ ├── regions02.json -> ../../../v0.2/annotation/part03/regions02.json
377
+ │ ├── regions03.json -> ../../../v0.2/annotation/part03/regions03.json
378
+ │ ├── regions04.json -> ../../../v0.2/annotation/part03/regions04.json
379
+ │ ├── regions05.json
380
+ │ ├── regions06.json -> ../../../v0.2/annotation/part03/regions06.json
381
+ │ └── regions07.json -> ../../../v0.2/annotation/part03/regions07.json
382
+ ├── data -> ../v0.2/data
383
+ └── readme.txt
384
+
385
+ 26 directories, 54 file
386
+ ```
387
+
388
+ ### Remove Files
389
+ Files or directories can be removed from the dataset using `dbm change` command with key `--remove`:
390
+ ```shell
391
+ bdm change --remove annotation/part01/regions05.json --remove annotation/part01/regions04.json -c -m "remove obsolete data" testdata
392
+ Version v0.4 of dataset has been created.
393
+ Files added: 0, updated: 0, removed: 2, symlinked: 8
394
+
395
+ ```
396
+ ### Combining Operations
397
+ Adding, updating, and removing operations can be freely combined within a single dataset version. Use `bdm change -h` command to get detailed information on available keys and options:
398
+ ```shell
399
+ bdm change -h
400
+ ```
401
+
402
+ ## License
403
+ See `LICENSE` file in the repo.
404
+
405
+