dvc-utils 0.0.8__tar.gz → 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {dvc-utils-0.0.8 → dvc-utils-0.1.0}/PKG-INFO +108 -4
- {dvc-utils-0.0.8 → dvc-utils-0.1.0}/README.md +107 -3
- dvc-utils-0.1.0/dvc_utils/__init__.py +2 -0
- dvc-utils-0.0.8/dvc_utils/main.py → dvc-utils-0.1.0/dvc_utils/cli.py +26 -84
- dvc-utils-0.1.0/dvc_utils/path.py +60 -0
- {dvc-utils-0.0.8 → dvc-utils-0.1.0}/dvc_utils.egg-info/PKG-INFO +108 -4
- {dvc-utils-0.0.8 → dvc-utils-0.1.0}/dvc_utils.egg-info/SOURCES.txt +3 -1
- dvc-utils-0.1.0/dvc_utils.egg-info/entry_points.txt +3 -0
- dvc-utils-0.1.0/dvc_utils.egg-info/requires.txt +4 -0
- {dvc-utils-0.0.8 → dvc-utils-0.1.0}/setup.py +4 -3
- dvc-utils-0.0.8/dvc_utils/__init__.py +0 -0
- dvc-utils-0.0.8/dvc_utils.egg-info/entry_points.txt +0 -3
- {dvc-utils-0.0.8 → dvc-utils-0.1.0}/LICENSE +0 -0
- {dvc-utils-0.0.8 → dvc-utils-0.1.0}/dvc_utils.egg-info/dependency_links.txt +0 -0
- {dvc-utils-0.0.8 → dvc-utils-0.1.0}/dvc_utils.egg-info/top_level.txt +0 -0
- {dvc-utils-0.0.8 → dvc-utils-0.1.0}/setup.cfg +0 -0
@@ -1,6 +1,6 @@
|
|
1
1
|
Metadata-Version: 2.1
|
2
2
|
Name: dvc-utils
|
3
|
-
Version: 0.0
|
3
|
+
Version: 0.1.0
|
4
4
|
Summary: CLI for diffing DVC files at two commits (or one commit vs. current worktree), optionally passing both through another command first
|
5
5
|
Home-page: https://github.com/runsascoded/dvc-utils
|
6
6
|
Author: Ryan Williams
|
@@ -10,17 +10,20 @@ Description-Content-Type: text/markdown
|
|
10
10
|
License-File: LICENSE
|
11
11
|
|
12
12
|
# dvc-utils
|
13
|
-
|
13
|
+
Diff [DVC] files, optionally piping through other commands first.
|
14
|
+
|
15
|
+
[][PyPI]
|
14
16
|
|
15
17
|
<!-- toc -->
|
16
18
|
- [Installation](#installation)
|
17
19
|
- [Usage](#usage)
|
18
20
|
- [`dvc-diff`](#dvc-diff)
|
19
21
|
- [Examples](#examples)
|
20
|
-
- [Parquet
|
22
|
+
- [Parquet](#parquet-diff)
|
21
23
|
- [Schema diff](#parquet-schema-diff)
|
22
24
|
- [Row diff](#parquet-row-diff)
|
23
25
|
- [Row count diff](#parquet-row-count-diff)
|
26
|
+
- [GZipped CSVs](#csv-gz)
|
24
27
|
<!-- /toc -->
|
25
28
|
|
26
29
|
## Installation <a id="installation"></a>
|
@@ -83,7 +86,7 @@ dvc-diff --help
|
|
83
86
|
|
84
87
|
## Examples <a id="examples"></a>
|
85
88
|
|
86
|
-
### Parquet
|
89
|
+
### Parquet <a id="parquet-diff"></a>
|
87
90
|
See sample commands and output below for inspecting changes to [a DVC-tracked Parquet file][commit path] in [a given commit][commit].
|
88
91
|
|
89
92
|
Setup:
|
@@ -222,11 +225,112 @@ dvc-diff -r $commit^..$commit parquet_row_count $path
|
|
222
225
|
|
223
226
|
This time we get no output; [the given `$commit`][commit] didn't change the row count in the DVC-tracked Parquet file [`$path`][commit path].
|
224
227
|
|
228
|
+
### GZipped CSVs <a id="csv-gz"></a>
|
229
|
+
|
230
|
+
Here's a "one-liner" I used in [ctbk.dev][ctbk.dev gh], to normalize and compare headers of `.csv.gz.dvc` files between two commits:
|
231
|
+
|
232
|
+
```bash
|
233
|
+
# Save some `sed` substitution commands to file `seds`:
|
234
|
+
cat <<EOF >seds
|
235
|
+
s/station_//
|
236
|
+
s/latitude/lat/
|
237
|
+
s/longitude/lng/
|
238
|
+
s/starttime/started_at/
|
239
|
+
s/stoptime/ended_at/
|
240
|
+
s/usertype/member_casual/
|
241
|
+
EOF
|
242
|
+
# Commit range to diff; branch `c0` is an initial commit of some `.csv.gz` files, branch `c1` is a later commit after some updates
|
243
|
+
r=c0..c1
|
244
|
+
# List files changed in commit range `$r`, in the `s3/ctbk/csvs/` dir, piping through several post-processing commands:
|
245
|
+
gdno $r s3/ctbk/csvs/ | \
|
246
|
+
pel "ddcr $r guc h1 spc kq kcr snc 'sdf seds' sort"
|
247
|
+
```
|
248
|
+
|
249
|
+
<details>
|
250
|
+
<summary>
|
251
|
+
Explanation of aliases
|
252
|
+
</summary>
|
253
|
+
|
254
|
+
- [`gdno`] (`git diff --name-only`): list files changed in the given commit range and directory
|
255
|
+
- [`pel`]: [`parallel`] alias that prepends an `echo {}` to the command
|
256
|
+
- [`ddcr`] (`dvc-diff -cr`): colorized `diff` output, revision range `$r`
|
257
|
+
- [`guc`] (`gunzip -c`): uncompress the `.csv.gz` files
|
258
|
+
- [`h1`] (`head -n1`): only examine each file's header line
|
259
|
+
- [`spc`] (`tr , $'\n'`): **sp**lit the header line by **c**ommas (so each column name will be on one line, for easier `diff`ing below)
|
260
|
+
- [`kq`] (`tr -d '"'`): **k**ill **q**uote characters (in this case, header-column name quoting changed, but I don't care about that)
|
261
|
+
- [`kcr`] (`tr -d '\r'`): **k**ill **c**arriage **r**eturns (line endings also changed)
|
262
|
+
- [`snc`] (`sed -f 'snake_case.sed'`): snake-case column names
|
263
|
+
- [`sdf`] (`sed -f`): execute the `sed` substitution commands defined in the `seds` file above
|
264
|
+
- `sort`: sort the column names alphabetically (to identify missing or added columns, ignore rearrangements)
|
265
|
+
|
266
|
+
Note:
|
267
|
+
- Most of these are exported Bash functions, allowing them to be used inside the [`parallel`] command.
|
268
|
+
- I was able to build this pipeline iteratively, adding steps to normalize out the bits I didn't care about (and accumulating the `seds` commands).
|
269
|
+
</details>
|
270
|
+
|
271
|
+
Example output:
|
272
|
+
```diff
|
273
|
+
…
|
274
|
+
s3/ctbk/csvs/201910-citibike-tripdata.csv.gz.dvc:
|
275
|
+
s3/ctbk/csvs/201911-citibike-tripdata.csv.gz.dvc:
|
276
|
+
s3/ctbk/csvs/201912-citibike-tripdata.csv.gz.dvc:
|
277
|
+
s3/ctbk/csvs/202001-citibike-tripdata.csv.gz.dvc:
|
278
|
+
1,2d0
|
279
|
+
< bikeid
|
280
|
+
< birth_year
|
281
|
+
8d5
|
282
|
+
< gender
|
283
|
+
9a7,8
|
284
|
+
> ride_id
|
285
|
+
> rideable_type
|
286
|
+
15d13
|
287
|
+
< tripduration
|
288
|
+
s3/ctbk/csvs/202002-citibike-tripdata.csv.gz.dvc:
|
289
|
+
1,2d0
|
290
|
+
< bikeid
|
291
|
+
< birth_year
|
292
|
+
8d5
|
293
|
+
< gender
|
294
|
+
9a7,8
|
295
|
+
> ride_id
|
296
|
+
> rideable_type
|
297
|
+
15d13
|
298
|
+
< tripduration
|
299
|
+
s3/ctbk/csvs/202003-citibike-tripdata.csv.gz.dvc:
|
300
|
+
1,2d0
|
301
|
+
< bikeid
|
302
|
+
< birth_year
|
303
|
+
8d5
|
304
|
+
< gender
|
305
|
+
9a7,8
|
306
|
+
> ride_id
|
307
|
+
> rideable_type
|
308
|
+
15d13
|
309
|
+
< tripduration
|
310
|
+
…
|
311
|
+
```
|
312
|
+
|
313
|
+
This helped me see that the data update in question (`c0..c1`) dropped some fields (`bikeid, birth_year`, `gender`, `tripduration`) and added others (`ride_id`, `rideable_type`), for `202001` and later.
|
314
|
+
|
225
315
|
[DVC]: https://dvc.org/
|
316
|
+
[PyPI]: https://pypi.org/project/dvc-utils/
|
226
317
|
[`parquet2json`]: https://github.com/jupiter/parquet2json
|
227
318
|
[hudcostreets/nj-crashes]: https://github.com/hudcostreets/nj-crashes
|
228
319
|
[Parquet]: https://parquet.apache.org/
|
229
320
|
[commit]: https://github.com/hudcostreets/nj-crashes/commit/c8ae28e64f4917895d84074913f48e0a7afbc3d7
|
230
321
|
[commit path]: https://github.com/hudcostreets/nj-crashes/commit/c8ae28e64f4917895d84074913f48e0a7afbc3d7#diff-7f812dce61e0996354f4af414203e0933ccdfe9613cb406c40c1c41a14b9769c
|
231
322
|
[hudcostreets/nj-crashes]: https://github.com/hudcostreets/nj-crashes
|
323
|
+
[ctbk.dev gh]: https://github.com/neighbor-ryan/ctbk.dev
|
232
324
|
[`jq`]: https://jqlang.github.io/jq/
|
325
|
+
[`parallel`]: https://www.gnu.org/software/parallel/
|
326
|
+
|
327
|
+
[`gdno`]: https://github.com/ryan-williams/git-helpers/blob/96560df1406f41676f293becefb423895a755faf/diff/.gitconfig#L31
|
328
|
+
[`pel`]: https://github.com/ryan-williams/parallel-helpers/blob/e7ee109c4085c04036840ea78999cff73fcf9502/.parallel-rc#L6-L17
|
329
|
+
[`ddcr`]: https://github.com/ryan-williams/aws-helpers/blob/8a314f1e6b336833c772459de6b739f5c06a51a3/.dvc-rc#L84
|
330
|
+
[`guc`]: https://github.com/ryan-williams/zip-helpers/blob/c67d84fb06c0ab3609dacb68d900344d3b8e8f04/.zip-rc#L16
|
331
|
+
[`h1`]: https://github.com/ryan-williams/head-tail-helpers/blob/9715690f47ceeff6b6948b2093901f2b0830114b/.head-tail-rc#L3
|
332
|
+
[`spc`]: https://github.com/ryan-williams/col-helpers/blob/9493d003224249ee240d023f71ab03bdd4174b88/.cols-rc#L8
|
333
|
+
[`kq`]: https://github.com/ryan-williams/arg-helpers/blob/a8c60809f8878fa38b3c03614778fcf29132538e/.arg-rc#L115
|
334
|
+
[`kcr`]: https://github.com/ryan-williams/arg-helpers/blob/a8c60809f8878fa38b3c03614778fcf29132538e/.arg-rc#L118
|
335
|
+
[`snc`]: https://github.com/ryan-williams/case-helpers/blob/c40a62a9656f0d52d68fb3a108ae6bb3eed3c7bd/.case-rc#L9
|
336
|
+
[`sdf`]: https://github.com/ryan-williams/arg-helpers/blob/a8c60809f8878fa38b3c03614778fcf29132538e/.arg-rc#L138
|
@@ -1,15 +1,18 @@
|
|
1
1
|
# dvc-utils
|
2
|
-
|
2
|
+
Diff [DVC] files, optionally piping through other commands first.
|
3
|
+
|
4
|
+
[][PyPI]
|
3
5
|
|
4
6
|
<!-- toc -->
|
5
7
|
- [Installation](#installation)
|
6
8
|
- [Usage](#usage)
|
7
9
|
- [`dvc-diff`](#dvc-diff)
|
8
10
|
- [Examples](#examples)
|
9
|
-
- [Parquet
|
11
|
+
- [Parquet](#parquet-diff)
|
10
12
|
- [Schema diff](#parquet-schema-diff)
|
11
13
|
- [Row diff](#parquet-row-diff)
|
12
14
|
- [Row count diff](#parquet-row-count-diff)
|
15
|
+
- [GZipped CSVs](#csv-gz)
|
13
16
|
<!-- /toc -->
|
14
17
|
|
15
18
|
## Installation <a id="installation"></a>
|
@@ -72,7 +75,7 @@ dvc-diff --help
|
|
72
75
|
|
73
76
|
## Examples <a id="examples"></a>
|
74
77
|
|
75
|
-
### Parquet
|
78
|
+
### Parquet <a id="parquet-diff"></a>
|
76
79
|
See sample commands and output below for inspecting changes to [a DVC-tracked Parquet file][commit path] in [a given commit][commit].
|
77
80
|
|
78
81
|
Setup:
|
@@ -211,11 +214,112 @@ dvc-diff -r $commit^..$commit parquet_row_count $path
|
|
211
214
|
|
212
215
|
This time we get no output; [the given `$commit`][commit] didn't change the row count in the DVC-tracked Parquet file [`$path`][commit path].
|
213
216
|
|
217
|
+
### GZipped CSVs <a id="csv-gz"></a>
|
218
|
+
|
219
|
+
Here's a "one-liner" I used in [ctbk.dev][ctbk.dev gh], to normalize and compare headers of `.csv.gz.dvc` files between two commits:
|
220
|
+
|
221
|
+
```bash
|
222
|
+
# Save some `sed` substitution commands to file `seds`:
|
223
|
+
cat <<EOF >seds
|
224
|
+
s/station_//
|
225
|
+
s/latitude/lat/
|
226
|
+
s/longitude/lng/
|
227
|
+
s/starttime/started_at/
|
228
|
+
s/stoptime/ended_at/
|
229
|
+
s/usertype/member_casual/
|
230
|
+
EOF
|
231
|
+
# Commit range to diff; branch `c0` is an initial commit of some `.csv.gz` files, branch `c1` is a later commit after some updates
|
232
|
+
r=c0..c1
|
233
|
+
# List files changed in commit range `$r`, in the `s3/ctbk/csvs/` dir, piping through several post-processing commands:
|
234
|
+
gdno $r s3/ctbk/csvs/ | \
|
235
|
+
pel "ddcr $r guc h1 spc kq kcr snc 'sdf seds' sort"
|
236
|
+
```
|
237
|
+
|
238
|
+
<details>
|
239
|
+
<summary>
|
240
|
+
Explanation of aliases
|
241
|
+
</summary>
|
242
|
+
|
243
|
+
- [`gdno`] (`git diff --name-only`): list files changed in the given commit range and directory
|
244
|
+
- [`pel`]: [`parallel`] alias that prepends an `echo {}` to the command
|
245
|
+
- [`ddcr`] (`dvc-diff -cr`): colorized `diff` output, revision range `$r`
|
246
|
+
- [`guc`] (`gunzip -c`): uncompress the `.csv.gz` files
|
247
|
+
- [`h1`] (`head -n1`): only examine each file's header line
|
248
|
+
- [`spc`] (`tr , $'\n'`): **sp**lit the header line by **c**ommas (so each column name will be on one line, for easier `diff`ing below)
|
249
|
+
- [`kq`] (`tr -d '"'`): **k**ill **q**uote characters (in this case, header-column name quoting changed, but I don't care about that)
|
250
|
+
- [`kcr`] (`tr -d '\r'`): **k**ill **c**arriage **r**eturns (line endings also changed)
|
251
|
+
- [`snc`] (`sed -f 'snake_case.sed'`): snake-case column names
|
252
|
+
- [`sdf`] (`sed -f`): execute the `sed` substitution commands defined in the `seds` file above
|
253
|
+
- `sort`: sort the column names alphabetically (to identify missing or added columns, ignore rearrangements)
|
254
|
+
|
255
|
+
Note:
|
256
|
+
- Most of these are exported Bash functions, allowing them to be used inside the [`parallel`] command.
|
257
|
+
- I was able to build this pipeline iteratively, adding steps to normalize out the bits I didn't care about (and accumulating the `seds` commands).
|
258
|
+
</details>
|
259
|
+
|
260
|
+
Example output:
|
261
|
+
```diff
|
262
|
+
…
|
263
|
+
s3/ctbk/csvs/201910-citibike-tripdata.csv.gz.dvc:
|
264
|
+
s3/ctbk/csvs/201911-citibike-tripdata.csv.gz.dvc:
|
265
|
+
s3/ctbk/csvs/201912-citibike-tripdata.csv.gz.dvc:
|
266
|
+
s3/ctbk/csvs/202001-citibike-tripdata.csv.gz.dvc:
|
267
|
+
1,2d0
|
268
|
+
< bikeid
|
269
|
+
< birth_year
|
270
|
+
8d5
|
271
|
+
< gender
|
272
|
+
9a7,8
|
273
|
+
> ride_id
|
274
|
+
> rideable_type
|
275
|
+
15d13
|
276
|
+
< tripduration
|
277
|
+
s3/ctbk/csvs/202002-citibike-tripdata.csv.gz.dvc:
|
278
|
+
1,2d0
|
279
|
+
< bikeid
|
280
|
+
< birth_year
|
281
|
+
8d5
|
282
|
+
< gender
|
283
|
+
9a7,8
|
284
|
+
> ride_id
|
285
|
+
> rideable_type
|
286
|
+
15d13
|
287
|
+
< tripduration
|
288
|
+
s3/ctbk/csvs/202003-citibike-tripdata.csv.gz.dvc:
|
289
|
+
1,2d0
|
290
|
+
< bikeid
|
291
|
+
< birth_year
|
292
|
+
8d5
|
293
|
+
< gender
|
294
|
+
9a7,8
|
295
|
+
> ride_id
|
296
|
+
> rideable_type
|
297
|
+
15d13
|
298
|
+
< tripduration
|
299
|
+
…
|
300
|
+
```
|
301
|
+
|
302
|
+
This helped me see that the data update in question (`c0..c1`) dropped some fields (`bikeid, birth_year`, `gender`, `tripduration`) and added others (`ride_id`, `rideable_type`), for `202001` and later.
|
303
|
+
|
214
304
|
[DVC]: https://dvc.org/
|
305
|
+
[PyPI]: https://pypi.org/project/dvc-utils/
|
215
306
|
[`parquet2json`]: https://github.com/jupiter/parquet2json
|
216
307
|
[hudcostreets/nj-crashes]: https://github.com/hudcostreets/nj-crashes
|
217
308
|
[Parquet]: https://parquet.apache.org/
|
218
309
|
[commit]: https://github.com/hudcostreets/nj-crashes/commit/c8ae28e64f4917895d84074913f48e0a7afbc3d7
|
219
310
|
[commit path]: https://github.com/hudcostreets/nj-crashes/commit/c8ae28e64f4917895d84074913f48e0a7afbc3d7#diff-7f812dce61e0996354f4af414203e0933ccdfe9613cb406c40c1c41a14b9769c
|
220
311
|
[hudcostreets/nj-crashes]: https://github.com/hudcostreets/nj-crashes
|
312
|
+
[ctbk.dev gh]: https://github.com/neighbor-ryan/ctbk.dev
|
221
313
|
[`jq`]: https://jqlang.github.io/jq/
|
314
|
+
[`parallel`]: https://www.gnu.org/software/parallel/
|
315
|
+
|
316
|
+
[`gdno`]: https://github.com/ryan-williams/git-helpers/blob/96560df1406f41676f293becefb423895a755faf/diff/.gitconfig#L31
|
317
|
+
[`pel`]: https://github.com/ryan-williams/parallel-helpers/blob/e7ee109c4085c04036840ea78999cff73fcf9502/.parallel-rc#L6-L17
|
318
|
+
[`ddcr`]: https://github.com/ryan-williams/aws-helpers/blob/8a314f1e6b336833c772459de6b739f5c06a51a3/.dvc-rc#L84
|
319
|
+
[`guc`]: https://github.com/ryan-williams/zip-helpers/blob/c67d84fb06c0ab3609dacb68d900344d3b8e8f04/.zip-rc#L16
|
320
|
+
[`h1`]: https://github.com/ryan-williams/head-tail-helpers/blob/9715690f47ceeff6b6948b2093901f2b0830114b/.head-tail-rc#L3
|
321
|
+
[`spc`]: https://github.com/ryan-williams/col-helpers/blob/9493d003224249ee240d023f71ab03bdd4174b88/.cols-rc#L8
|
322
|
+
[`kq`]: https://github.com/ryan-williams/arg-helpers/blob/a8c60809f8878fa38b3c03614778fcf29132538e/.arg-rc#L115
|
323
|
+
[`kcr`]: https://github.com/ryan-williams/arg-helpers/blob/a8c60809f8878fa38b3c03614778fcf29132538e/.arg-rc#L118
|
324
|
+
[`snc`]: https://github.com/ryan-williams/case-helpers/blob/c40a62a9656f0d52d68fb3a108ae6bb3eed3c7bd/.case-rc#L9
|
325
|
+
[`sdf`]: https://github.com/ryan-williams/arg-helpers/blob/a8c60809f8878fa38b3c03614778fcf29132538e/.arg-rc#L138
|
@@ -1,13 +1,13 @@
|
|
1
|
-
from functools import cache
|
2
|
-
from os import environ as env, getcwd
|
3
|
-
from os.path import join, relpath
|
4
1
|
import shlex
|
5
|
-
from
|
2
|
+
from os import environ as env
|
3
|
+
from typing import Tuple
|
6
4
|
|
7
|
-
from click import option, argument, group
|
8
5
|
import click
|
9
|
-
import
|
10
|
-
from utz import
|
6
|
+
from click import option, argument, group
|
7
|
+
from utz import process, err
|
8
|
+
from qmdx import join_pipelines
|
9
|
+
|
10
|
+
from dvc_utils.path import dvc_paths, dvc_path as dvc_cache_path
|
11
11
|
|
12
12
|
|
13
13
|
@group()
|
@@ -15,57 +15,6 @@ def cli():
|
|
15
15
|
pass
|
16
16
|
|
17
17
|
|
18
|
-
def dvc_paths(path: str) -> Tuple[str, str]:
|
19
|
-
if path.endswith('.dvc'):
|
20
|
-
dvc_path = path
|
21
|
-
path = dvc_path[:-len('.dvc')]
|
22
|
-
else:
|
23
|
-
dvc_path = f'{path}.dvc'
|
24
|
-
return path, dvc_path
|
25
|
-
|
26
|
-
|
27
|
-
@cache
|
28
|
-
def get_git_root() -> str:
|
29
|
-
return process.line('git', 'rev-parse', '--show-toplevel', log=False)
|
30
|
-
|
31
|
-
|
32
|
-
@cache
|
33
|
-
def get_dir_path() -> str:
|
34
|
-
return relpath(getcwd(), get_git_root())
|
35
|
-
|
36
|
-
|
37
|
-
@cache
|
38
|
-
def dvc_cache_dir(log: bool = False) -> str:
|
39
|
-
dvc_cache_relpath = env.get('DVC_UTILS_CACHE_DIR')
|
40
|
-
if dvc_cache_relpath:
|
41
|
-
return join(get_git_root(), dvc_cache_relpath)
|
42
|
-
else:
|
43
|
-
return process.line('dvc', 'cache', 'dir', log=log)
|
44
|
-
|
45
|
-
|
46
|
-
def dvc_md5(git_ref: str, dvc_path: str, log: bool = False) -> str:
|
47
|
-
dir_path = get_dir_path()
|
48
|
-
dir_path = '' if dir_path == '.' else f'{dir_path}/'
|
49
|
-
dvc_spec = process.output('git', 'show', f'{git_ref}:{dir_path}{dvc_path}', log=err if log else None)
|
50
|
-
dvc_obj = yaml.safe_load(dvc_spec)
|
51
|
-
out = singleton(dvc_obj['outs'], dedupe=False)
|
52
|
-
md5 = out['md5']
|
53
|
-
return md5
|
54
|
-
|
55
|
-
|
56
|
-
def dvc_cache_path(ref: str, dvc_path: Optional[str] = None, log: bool = False) -> str:
|
57
|
-
if dvc_path:
|
58
|
-
md5 = dvc_md5(ref, dvc_path, log=log)
|
59
|
-
elif ':' in ref:
|
60
|
-
git_ref, dvc_path = ref.split(':', 1)
|
61
|
-
md5 = dvc_md5(git_ref, dvc_path, log=log)
|
62
|
-
else:
|
63
|
-
md5 = ref
|
64
|
-
dirname = md5[:2]
|
65
|
-
basename = md5[2:]
|
66
|
-
return join(dvc_cache_dir(log=log), 'files', 'md5', dirname, basename)
|
67
|
-
|
68
|
-
|
69
18
|
@cli.command('diff', short_help='Diff a DVC-tracked file at two commits (or one commit vs. current worktree), optionally passing both through another command first')
|
70
19
|
@option('-c', '--color', is_flag=True, help='Colorize the output')
|
71
20
|
@option('-r', '--refspec', default='HEAD', help='<commit 1>..<commit 2> (compare two commits) or <commit> (compare <commit> to the worktree)')
|
@@ -114,39 +63,32 @@ def dvc_utils_diff(
|
|
114
63
|
raise ValueError(f"Invalid refspec: {refspec}")
|
115
64
|
|
116
65
|
log = err if verbose else False
|
117
|
-
|
118
|
-
|
119
|
-
|
66
|
+
path1 = dvc_cache_path(before, dvc_path, log=log)
|
67
|
+
path2 = path if after is None else dvc_cache_path(after, dvc_path, log=log)
|
68
|
+
|
69
|
+
diff_args = [
|
70
|
+
*(['-w'] if ignore_whitespace else []),
|
71
|
+
*(['-U', str(unified)] if unified is not None else []),
|
72
|
+
*(['--color=always'] if color else []),
|
73
|
+
]
|
120
74
|
if cmds:
|
121
75
|
cmd, *sub_cmds = cmds
|
76
|
+
cmds1 = [ f'{cmd} {path1}', *sub_cmds ]
|
77
|
+
cmds2 = [ f'{cmd} {path2}', *sub_cmds ]
|
122
78
|
if not shell:
|
123
|
-
|
124
|
-
|
125
|
-
|
126
|
-
|
127
|
-
]
|
128
|
-
|
129
|
-
|
130
|
-
*sub_cmds,
|
131
|
-
]
|
132
|
-
shell_kwargs = {}
|
133
|
-
else:
|
134
|
-
before_cmds = [ f'{cmd} {before_path}', *sub_cmds ]
|
135
|
-
after_cmds = [ f'{cmd} {after_path}', *sub_cmds ]
|
136
|
-
shell_kwargs = dict(shell=shell)
|
137
|
-
|
138
|
-
diff_cmds(
|
139
|
-
before_cmds,
|
140
|
-
after_cmds,
|
79
|
+
cmds1 = [ shlex.split(cmd) for cmd in cmds1 ]
|
80
|
+
cmds2 = [ shlex.split(cmd) for cmd in cmds2 ]
|
81
|
+
|
82
|
+
join_pipelines(
|
83
|
+
base_cmd=['diff', *diff_args],
|
84
|
+
cmds1=cmds1,
|
85
|
+
cmds2=cmds2,
|
141
86
|
verbose=verbose,
|
142
|
-
|
143
|
-
unified=unified,
|
144
|
-
ignore_whitespace=ignore_whitespace,
|
87
|
+
shell=shell,
|
145
88
|
shell_executable=shell_executable,
|
146
|
-
**shell_kwargs,
|
147
89
|
)
|
148
90
|
else:
|
149
|
-
process.run('diff',
|
91
|
+
process.run('diff', *diff_args, path1, path2, log=log)
|
150
92
|
|
151
93
|
|
152
94
|
if __name__ == '__main__':
|
@@ -0,0 +1,60 @@
|
|
1
|
+
from functools import cache
|
2
|
+
from os import environ as env, getcwd
|
3
|
+
from os.path import join, relpath
|
4
|
+
from typing import Optional, Tuple
|
5
|
+
|
6
|
+
import yaml
|
7
|
+
from utz import process, err, singleton
|
8
|
+
|
9
|
+
|
10
|
+
def dvc_paths(path: str) -> Tuple[str, str]:
|
11
|
+
if path.endswith('.dvc'):
|
12
|
+
dvc_path = path
|
13
|
+
path = dvc_path[:-len('.dvc')]
|
14
|
+
else:
|
15
|
+
dvc_path = f'{path}.dvc'
|
16
|
+
return path, dvc_path
|
17
|
+
|
18
|
+
|
19
|
+
@cache
|
20
|
+
def get_git_root() -> str:
|
21
|
+
return process.line('git', 'rev-parse', '--show-toplevel', log=False)
|
22
|
+
|
23
|
+
|
24
|
+
@cache
|
25
|
+
def get_dir_path() -> str:
|
26
|
+
return relpath(getcwd(), get_git_root())
|
27
|
+
|
28
|
+
|
29
|
+
@cache
|
30
|
+
def dvc_cache_dir(log: bool = False) -> str:
|
31
|
+
dvc_cache_relpath = env.get('DVC_UTILS_CACHE_DIR')
|
32
|
+
if dvc_cache_relpath:
|
33
|
+
return join(get_git_root(), dvc_cache_relpath)
|
34
|
+
else:
|
35
|
+
return process.line('dvc', 'cache', 'dir', log=log)
|
36
|
+
|
37
|
+
|
38
|
+
def dvc_md5(git_ref: str, dvc_path: str, log: bool = False) -> str:
|
39
|
+
dir_path = get_dir_path()
|
40
|
+
dir_path = '' if dir_path == '.' else f'{dir_path}/'
|
41
|
+
dvc_spec = process.output('git', 'show', f'{git_ref}:{dir_path}{dvc_path}', log=err if log else None)
|
42
|
+
dvc_obj = yaml.safe_load(dvc_spec)
|
43
|
+
out = singleton(dvc_obj['outs'], dedupe=False)
|
44
|
+
md5 = out['md5']
|
45
|
+
return md5
|
46
|
+
|
47
|
+
|
48
|
+
def dvc_path(ref: str, dvc_path: Optional[str] = None, log: bool = False) -> str:
|
49
|
+
if dvc_path and not dvc_path.endswith('.dvc'):
|
50
|
+
dvc_path += '.dvc'
|
51
|
+
if dvc_path:
|
52
|
+
md5 = dvc_md5(ref, dvc_path, log=log)
|
53
|
+
elif ':' in ref:
|
54
|
+
git_ref, dvc_path = ref.split(':', 1)
|
55
|
+
md5 = dvc_md5(git_ref, dvc_path, log=log)
|
56
|
+
else:
|
57
|
+
md5 = ref
|
58
|
+
dirname = md5[:2]
|
59
|
+
basename = md5[2:]
|
60
|
+
return join(dvc_cache_dir(log=log), 'files', 'md5', dirname, basename)
|
@@ -1,6 +1,6 @@
|
|
1
1
|
Metadata-Version: 2.1
|
2
2
|
Name: dvc-utils
|
3
|
-
Version: 0.0
|
3
|
+
Version: 0.1.0
|
4
4
|
Summary: CLI for diffing DVC files at two commits (or one commit vs. current worktree), optionally passing both through another command first
|
5
5
|
Home-page: https://github.com/runsascoded/dvc-utils
|
6
6
|
Author: Ryan Williams
|
@@ -10,17 +10,20 @@ Description-Content-Type: text/markdown
|
|
10
10
|
License-File: LICENSE
|
11
11
|
|
12
12
|
# dvc-utils
|
13
|
-
|
13
|
+
Diff [DVC] files, optionally piping through other commands first.
|
14
|
+
|
15
|
+
[][PyPI]
|
14
16
|
|
15
17
|
<!-- toc -->
|
16
18
|
- [Installation](#installation)
|
17
19
|
- [Usage](#usage)
|
18
20
|
- [`dvc-diff`](#dvc-diff)
|
19
21
|
- [Examples](#examples)
|
20
|
-
- [Parquet
|
22
|
+
- [Parquet](#parquet-diff)
|
21
23
|
- [Schema diff](#parquet-schema-diff)
|
22
24
|
- [Row diff](#parquet-row-diff)
|
23
25
|
- [Row count diff](#parquet-row-count-diff)
|
26
|
+
- [GZipped CSVs](#csv-gz)
|
24
27
|
<!-- /toc -->
|
25
28
|
|
26
29
|
## Installation <a id="installation"></a>
|
@@ -83,7 +86,7 @@ dvc-diff --help
|
|
83
86
|
|
84
87
|
## Examples <a id="examples"></a>
|
85
88
|
|
86
|
-
### Parquet
|
89
|
+
### Parquet <a id="parquet-diff"></a>
|
87
90
|
See sample commands and output below for inspecting changes to [a DVC-tracked Parquet file][commit path] in [a given commit][commit].
|
88
91
|
|
89
92
|
Setup:
|
@@ -222,11 +225,112 @@ dvc-diff -r $commit^..$commit parquet_row_count $path
|
|
222
225
|
|
223
226
|
This time we get no output; [the given `$commit`][commit] didn't change the row count in the DVC-tracked Parquet file [`$path`][commit path].
|
224
227
|
|
228
|
+
### GZipped CSVs <a id="csv-gz"></a>
|
229
|
+
|
230
|
+
Here's a "one-liner" I used in [ctbk.dev][ctbk.dev gh], to normalize and compare headers of `.csv.gz.dvc` files between two commits:
|
231
|
+
|
232
|
+
```bash
|
233
|
+
# Save some `sed` substitution commands to file `seds`:
|
234
|
+
cat <<EOF >seds
|
235
|
+
s/station_//
|
236
|
+
s/latitude/lat/
|
237
|
+
s/longitude/lng/
|
238
|
+
s/starttime/started_at/
|
239
|
+
s/stoptime/ended_at/
|
240
|
+
s/usertype/member_casual/
|
241
|
+
EOF
|
242
|
+
# Commit range to diff; branch `c0` is an initial commit of some `.csv.gz` files, branch `c1` is a later commit after some updates
|
243
|
+
r=c0..c1
|
244
|
+
# List files changed in commit range `$r`, in the `s3/ctbk/csvs/` dir, piping through several post-processing commands:
|
245
|
+
gdno $r s3/ctbk/csvs/ | \
|
246
|
+
pel "ddcr $r guc h1 spc kq kcr snc 'sdf seds' sort"
|
247
|
+
```
|
248
|
+
|
249
|
+
<details>
|
250
|
+
<summary>
|
251
|
+
Explanation of aliases
|
252
|
+
</summary>
|
253
|
+
|
254
|
+
- [`gdno`] (`git diff --name-only`): list files changed in the given commit range and directory
|
255
|
+
- [`pel`]: [`parallel`] alias that prepends an `echo {}` to the command
|
256
|
+
- [`ddcr`] (`dvc-diff -cr`): colorized `diff` output, revision range `$r`
|
257
|
+
- [`guc`] (`gunzip -c`): uncompress the `.csv.gz` files
|
258
|
+
- [`h1`] (`head -n1`): only examine each file's header line
|
259
|
+
- [`spc`] (`tr , $'\n'`): **sp**lit the header line by **c**ommas (so each column name will be on one line, for easier `diff`ing below)
|
260
|
+
- [`kq`] (`tr -d '"'`): **k**ill **q**uote characters (in this case, header-column name quoting changed, but I don't care about that)
|
261
|
+
- [`kcr`] (`tr -d '\r'`): **k**ill **c**arriage **r**eturns (line endings also changed)
|
262
|
+
- [`snc`] (`sed -f 'snake_case.sed'`): snake-case column names
|
263
|
+
- [`sdf`] (`sed -f`): execute the `sed` substitution commands defined in the `seds` file above
|
264
|
+
- `sort`: sort the column names alphabetically (to identify missing or added columns, ignore rearrangements)
|
265
|
+
|
266
|
+
Note:
|
267
|
+
- Most of these are exported Bash functions, allowing them to be used inside the [`parallel`] command.
|
268
|
+
- I was able to build this pipeline iteratively, adding steps to normalize out the bits I didn't care about (and accumulating the `seds` commands).
|
269
|
+
</details>
|
270
|
+
|
271
|
+
Example output:
|
272
|
+
```diff
|
273
|
+
…
|
274
|
+
s3/ctbk/csvs/201910-citibike-tripdata.csv.gz.dvc:
|
275
|
+
s3/ctbk/csvs/201911-citibike-tripdata.csv.gz.dvc:
|
276
|
+
s3/ctbk/csvs/201912-citibike-tripdata.csv.gz.dvc:
|
277
|
+
s3/ctbk/csvs/202001-citibike-tripdata.csv.gz.dvc:
|
278
|
+
1,2d0
|
279
|
+
< bikeid
|
280
|
+
< birth_year
|
281
|
+
8d5
|
282
|
+
< gender
|
283
|
+
9a7,8
|
284
|
+
> ride_id
|
285
|
+
> rideable_type
|
286
|
+
15d13
|
287
|
+
< tripduration
|
288
|
+
s3/ctbk/csvs/202002-citibike-tripdata.csv.gz.dvc:
|
289
|
+
1,2d0
|
290
|
+
< bikeid
|
291
|
+
< birth_year
|
292
|
+
8d5
|
293
|
+
< gender
|
294
|
+
9a7,8
|
295
|
+
> ride_id
|
296
|
+
> rideable_type
|
297
|
+
15d13
|
298
|
+
< tripduration
|
299
|
+
s3/ctbk/csvs/202003-citibike-tripdata.csv.gz.dvc:
|
300
|
+
1,2d0
|
301
|
+
< bikeid
|
302
|
+
< birth_year
|
303
|
+
8d5
|
304
|
+
< gender
|
305
|
+
9a7,8
|
306
|
+
> ride_id
|
307
|
+
> rideable_type
|
308
|
+
15d13
|
309
|
+
< tripduration
|
310
|
+
…
|
311
|
+
```
|
312
|
+
|
313
|
+
This helped me see that the data update in question (`c0..c1`) dropped some fields (`bikeid, birth_year`, `gender`, `tripduration`) and added others (`ride_id`, `rideable_type`), for `202001` and later.
|
314
|
+
|
225
315
|
[DVC]: https://dvc.org/
|
316
|
+
[PyPI]: https://pypi.org/project/dvc-utils/
|
226
317
|
[`parquet2json`]: https://github.com/jupiter/parquet2json
|
227
318
|
[hudcostreets/nj-crashes]: https://github.com/hudcostreets/nj-crashes
|
228
319
|
[Parquet]: https://parquet.apache.org/
|
229
320
|
[commit]: https://github.com/hudcostreets/nj-crashes/commit/c8ae28e64f4917895d84074913f48e0a7afbc3d7
|
230
321
|
[commit path]: https://github.com/hudcostreets/nj-crashes/commit/c8ae28e64f4917895d84074913f48e0a7afbc3d7#diff-7f812dce61e0996354f4af414203e0933ccdfe9613cb406c40c1c41a14b9769c
|
231
322
|
[hudcostreets/nj-crashes]: https://github.com/hudcostreets/nj-crashes
|
323
|
+
[ctbk.dev gh]: https://github.com/neighbor-ryan/ctbk.dev
|
232
324
|
[`jq`]: https://jqlang.github.io/jq/
|
325
|
+
[`parallel`]: https://www.gnu.org/software/parallel/
|
326
|
+
|
327
|
+
[`gdno`]: https://github.com/ryan-williams/git-helpers/blob/96560df1406f41676f293becefb423895a755faf/diff/.gitconfig#L31
|
328
|
+
[`pel`]: https://github.com/ryan-williams/parallel-helpers/blob/e7ee109c4085c04036840ea78999cff73fcf9502/.parallel-rc#L6-L17
|
329
|
+
[`ddcr`]: https://github.com/ryan-williams/aws-helpers/blob/8a314f1e6b336833c772459de6b739f5c06a51a3/.dvc-rc#L84
|
330
|
+
[`guc`]: https://github.com/ryan-williams/zip-helpers/blob/c67d84fb06c0ab3609dacb68d900344d3b8e8f04/.zip-rc#L16
|
331
|
+
[`h1`]: https://github.com/ryan-williams/head-tail-helpers/blob/9715690f47ceeff6b6948b2093901f2b0830114b/.head-tail-rc#L3
|
332
|
+
[`spc`]: https://github.com/ryan-williams/col-helpers/blob/9493d003224249ee240d023f71ab03bdd4174b88/.cols-rc#L8
|
333
|
+
[`kq`]: https://github.com/ryan-williams/arg-helpers/blob/a8c60809f8878fa38b3c03614778fcf29132538e/.arg-rc#L115
|
334
|
+
[`kcr`]: https://github.com/ryan-williams/arg-helpers/blob/a8c60809f8878fa38b3c03614778fcf29132538e/.arg-rc#L118
|
335
|
+
[`snc`]: https://github.com/ryan-williams/case-helpers/blob/c40a62a9656f0d52d68fb3a108ae6bb3eed3c7bd/.case-rc#L9
|
336
|
+
[`sdf`]: https://github.com/ryan-williams/arg-helpers/blob/a8c60809f8878fa38b3c03614778fcf29132538e/.arg-rc#L138
|
@@ -2,9 +2,11 @@ LICENSE
|
|
2
2
|
README.md
|
3
3
|
setup.py
|
4
4
|
dvc_utils/__init__.py
|
5
|
-
dvc_utils/
|
5
|
+
dvc_utils/cli.py
|
6
|
+
dvc_utils/path.py
|
6
7
|
dvc_utils.egg-info/PKG-INFO
|
7
8
|
dvc_utils.egg-info/SOURCES.txt
|
8
9
|
dvc_utils.egg-info/dependency_links.txt
|
9
10
|
dvc_utils.egg-info/entry_points.txt
|
11
|
+
dvc_utils.egg-info/requires.txt
|
10
12
|
dvc_utils.egg-info/top_level.txt
|
@@ -2,15 +2,16 @@ from setuptools import setup
|
|
2
2
|
|
3
3
|
setup(
|
4
4
|
name='dvc-utils',
|
5
|
-
version="0.0
|
5
|
+
version="0.1.0",
|
6
6
|
description="CLI for diffing DVC files at two commits (or one commit vs. current worktree), optionally passing both through another command first",
|
7
7
|
long_description=open("README.md").read(),
|
8
8
|
long_description_content_type="text/markdown",
|
9
9
|
packages=['dvc_utils'],
|
10
|
+
install_requires=open("requirements.txt").read(),
|
10
11
|
entry_points={
|
11
12
|
'console_scripts': [
|
12
|
-
'dvc-utils = dvc_utils.
|
13
|
-
'dvc-diff = dvc_utils.
|
13
|
+
'dvc-utils = dvc_utils.cli:cli',
|
14
|
+
'dvc-diff = dvc_utils.cli:dvc_utils_diff',
|
14
15
|
],
|
15
16
|
},
|
16
17
|
license="MIT",
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|
File without changes
|