dvc-utils 0.0.8__tar.gz → 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.1
2
2
  Name: dvc-utils
3
- Version: 0.0.8
3
+ Version: 0.1.0
4
4
  Summary: CLI for diffing DVC files at two commits (or one commit vs. current worktree), optionally passing both through another command first
5
5
  Home-page: https://github.com/runsascoded/dvc-utils
6
6
  Author: Ryan Williams
@@ -10,17 +10,20 @@ Description-Content-Type: text/markdown
10
10
  License-File: LICENSE
11
11
 
12
12
  # dvc-utils
13
- CLI for diffing [DVC] files, optionally passing both through another command first
13
+ Diff [DVC] files, optionally piping through other commands first.
14
+
15
+ [![dvc-utils on PyPI](https://img.shields.io/pypi/v/dvc-utils?label=dvc-utils)][PyPI]
14
16
 
15
17
  <!-- toc -->
16
18
  - [Installation](#installation)
17
19
  - [Usage](#usage)
18
20
  - [`dvc-diff`](#dvc-diff)
19
21
  - [Examples](#examples)
20
- - [Parquet file](#parquet-diff)
22
+ - [Parquet](#parquet-diff)
21
23
  - [Schema diff](#parquet-schema-diff)
22
24
  - [Row diff](#parquet-row-diff)
23
25
  - [Row count diff](#parquet-row-count-diff)
26
+ - [GZipped CSVs](#csv-gz)
24
27
  <!-- /toc -->
25
28
 
26
29
  ## Installation <a id="installation"></a>
@@ -83,7 +86,7 @@ dvc-diff --help
83
86
 
84
87
  ## Examples <a id="examples"></a>
85
88
 
86
- ### Parquet file <a id="parquet-diff"></a>
89
+ ### Parquet <a id="parquet-diff"></a>
87
90
  See sample commands and output below for inspecting changes to [a DVC-tracked Parquet file][commit path] in [a given commit][commit].
88
91
 
89
92
  Setup:
@@ -222,11 +225,112 @@ dvc-diff -r $commit^..$commit parquet_row_count $path
222
225
 
223
226
  This time we get no output; [the given `$commit`][commit] didn't change the row count in the DVC-tracked Parquet file [`$path`][commit path].
224
227
 
228
+ ### GZipped CSVs <a id="csv-gz"></a>
229
+
230
+ Here's a "one-liner" I used in [ctbk.dev][ctbk.dev gh], to normalize and compare headers of `.csv.gz.dvc` files between two commits:
231
+
232
+ ```bash
233
+ # Save some `sed` substitution commands to file `seds`:
234
+ cat <<EOF >seds
235
+ s/station_//
236
+ s/latitude/lat/
237
+ s/longitude/lng/
238
+ s/starttime/started_at/
239
+ s/stoptime/ended_at/
240
+ s/usertype/member_casual/
241
+ EOF
242
+ # Commit range to diff; branch `c0` is an initial commit of some `.csv.gz` files, branch `c1` is a later commit after some updates
243
+ r=c0..c1
244
+ # List files changed in commit range `$r`, in the `s3/ctbk/csvs/` dir, piping through several post-processing commands:
245
+ gdno $r s3/ctbk/csvs/ | \
246
+ pel "ddcr $r guc h1 spc kq kcr snc 'sdf seds' sort"
247
+ ```
248
+
249
+ <details>
250
+ <summary>
251
+ Explanation of aliases
252
+ </summary>
253
+
254
+ - [`gdno`] (`git diff --name-only`): list files changed in the given commit range and directory
255
+ - [`pel`]: [`parallel`] alias that prepends an `echo {}` to the command
256
+ - [`ddcr`] (`dvc-diff -cr`): colorized `diff` output, revision range `$r`
257
+ - [`guc`] (`gunzip -c`): uncompress the `.csv.gz` files
258
+ - [`h1`] (`head -n1`): only examine each file's header line
259
+ - [`spc`] (`tr , $'\n'`): **sp**lit the header line by **c**ommas (so each column name will be on one line, for easier `diff`ing below)
260
+ - [`kq`] (`tr -d '"'`): **k**ill **q**uote characters (in this case, header-column name quoting changed, but I don't care about that)
261
+ - [`kcr`] (`tr -d '\r'`): **k**ill **c**arriage **r**eturns (line endings also changed)
262
+ - [`snc`] (`sed -f 'snake_case.sed'`): snake-case column names
263
+ - [`sdf`] (`sed -f`): execute the `sed` substitution commands defined in the `seds` file above
264
+ - `sort`: sort the column names alphabetically (to identify missing or added columns, ignore rearrangements)
265
+
266
+ Note:
267
+ - Most of these are exported Bash functions, allowing them to be used inside the [`parallel`] command.
268
+ - I was able to build this pipeline iteratively, adding steps to normalize out the bits I didn't care about (and accumulating the `seds` commands).
269
+ </details>
270
+
271
+ Example output:
272
+ ```diff
273
+
274
+ s3/ctbk/csvs/201910-citibike-tripdata.csv.gz.dvc:
275
+ s3/ctbk/csvs/201911-citibike-tripdata.csv.gz.dvc:
276
+ s3/ctbk/csvs/201912-citibike-tripdata.csv.gz.dvc:
277
+ s3/ctbk/csvs/202001-citibike-tripdata.csv.gz.dvc:
278
+ 1,2d0
279
+ < bikeid
280
+ < birth_year
281
+ 8d5
282
+ < gender
283
+ 9a7,8
284
+ > ride_id
285
+ > rideable_type
286
+ 15d13
287
+ < tripduration
288
+ s3/ctbk/csvs/202002-citibike-tripdata.csv.gz.dvc:
289
+ 1,2d0
290
+ < bikeid
291
+ < birth_year
292
+ 8d5
293
+ < gender
294
+ 9a7,8
295
+ > ride_id
296
+ > rideable_type
297
+ 15d13
298
+ < tripduration
299
+ s3/ctbk/csvs/202003-citibike-tripdata.csv.gz.dvc:
300
+ 1,2d0
301
+ < bikeid
302
+ < birth_year
303
+ 8d5
304
+ < gender
305
+ 9a7,8
306
+ > ride_id
307
+ > rideable_type
308
+ 15d13
309
+ < tripduration
310
+
311
+ ```
312
+
313
+ This helped me see that the data update in question (`c0..c1`) dropped some fields (`bikeid, birth_year`, `gender`, `tripduration`) and added others (`ride_id`, `rideable_type`), for `202001` and later.
314
+
225
315
  [DVC]: https://dvc.org/
316
+ [PyPI]: https://pypi.org/project/dvc-utils/
226
317
  [`parquet2json`]: https://github.com/jupiter/parquet2json
227
318
  [hudcostreets/nj-crashes]: https://github.com/hudcostreets/nj-crashes
228
319
  [Parquet]: https://parquet.apache.org/
229
320
  [commit]: https://github.com/hudcostreets/nj-crashes/commit/c8ae28e64f4917895d84074913f48e0a7afbc3d7
230
321
  [commit path]: https://github.com/hudcostreets/nj-crashes/commit/c8ae28e64f4917895d84074913f48e0a7afbc3d7#diff-7f812dce61e0996354f4af414203e0933ccdfe9613cb406c40c1c41a14b9769c
231
322
  [hudcostreets/nj-crashes]: https://github.com/hudcostreets/nj-crashes
323
+ [ctbk.dev gh]: https://github.com/neighbor-ryan/ctbk.dev
232
324
  [`jq`]: https://jqlang.github.io/jq/
325
+ [`parallel`]: https://www.gnu.org/software/parallel/
326
+
327
+ [`gdno`]: https://github.com/ryan-williams/git-helpers/blob/96560df1406f41676f293becefb423895a755faf/diff/.gitconfig#L31
328
+ [`pel`]: https://github.com/ryan-williams/parallel-helpers/blob/e7ee109c4085c04036840ea78999cff73fcf9502/.parallel-rc#L6-L17
329
+ [`ddcr`]: https://github.com/ryan-williams/aws-helpers/blob/8a314f1e6b336833c772459de6b739f5c06a51a3/.dvc-rc#L84
330
+ [`guc`]: https://github.com/ryan-williams/zip-helpers/blob/c67d84fb06c0ab3609dacb68d900344d3b8e8f04/.zip-rc#L16
331
+ [`h1`]: https://github.com/ryan-williams/head-tail-helpers/blob/9715690f47ceeff6b6948b2093901f2b0830114b/.head-tail-rc#L3
332
+ [`spc`]: https://github.com/ryan-williams/col-helpers/blob/9493d003224249ee240d023f71ab03bdd4174b88/.cols-rc#L8
333
+ [`kq`]: https://github.com/ryan-williams/arg-helpers/blob/a8c60809f8878fa38b3c03614778fcf29132538e/.arg-rc#L115
334
+ [`kcr`]: https://github.com/ryan-williams/arg-helpers/blob/a8c60809f8878fa38b3c03614778fcf29132538e/.arg-rc#L118
335
+ [`snc`]: https://github.com/ryan-williams/case-helpers/blob/c40a62a9656f0d52d68fb3a108ae6bb3eed3c7bd/.case-rc#L9
336
+ [`sdf`]: https://github.com/ryan-williams/arg-helpers/blob/a8c60809f8878fa38b3c03614778fcf29132538e/.arg-rc#L138
@@ -1,15 +1,18 @@
1
1
  # dvc-utils
2
- CLI for diffing [DVC] files, optionally passing both through another command first
2
+ Diff [DVC] files, optionally piping through other commands first.
3
+
4
+ [![dvc-utils on PyPI](https://img.shields.io/pypi/v/dvc-utils?label=dvc-utils)][PyPI]
3
5
 
4
6
  <!-- toc -->
5
7
  - [Installation](#installation)
6
8
  - [Usage](#usage)
7
9
  - [`dvc-diff`](#dvc-diff)
8
10
  - [Examples](#examples)
9
- - [Parquet file](#parquet-diff)
11
+ - [Parquet](#parquet-diff)
10
12
  - [Schema diff](#parquet-schema-diff)
11
13
  - [Row diff](#parquet-row-diff)
12
14
  - [Row count diff](#parquet-row-count-diff)
15
+ - [GZipped CSVs](#csv-gz)
13
16
  <!-- /toc -->
14
17
 
15
18
  ## Installation <a id="installation"></a>
@@ -72,7 +75,7 @@ dvc-diff --help
72
75
 
73
76
  ## Examples <a id="examples"></a>
74
77
 
75
- ### Parquet file <a id="parquet-diff"></a>
78
+ ### Parquet <a id="parquet-diff"></a>
76
79
  See sample commands and output below for inspecting changes to [a DVC-tracked Parquet file][commit path] in [a given commit][commit].
77
80
 
78
81
  Setup:
@@ -211,11 +214,112 @@ dvc-diff -r $commit^..$commit parquet_row_count $path
211
214
 
212
215
  This time we get no output; [the given `$commit`][commit] didn't change the row count in the DVC-tracked Parquet file [`$path`][commit path].
213
216
 
217
+ ### GZipped CSVs <a id="csv-gz"></a>
218
+
219
+ Here's a "one-liner" I used in [ctbk.dev][ctbk.dev gh], to normalize and compare headers of `.csv.gz.dvc` files between two commits:
220
+
221
+ ```bash
222
+ # Save some `sed` substitution commands to file `seds`:
223
+ cat <<EOF >seds
224
+ s/station_//
225
+ s/latitude/lat/
226
+ s/longitude/lng/
227
+ s/starttime/started_at/
228
+ s/stoptime/ended_at/
229
+ s/usertype/member_casual/
230
+ EOF
231
+ # Commit range to diff; branch `c0` is an initial commit of some `.csv.gz` files, branch `c1` is a later commit after some updates
232
+ r=c0..c1
233
+ # List files changed in commit range `$r`, in the `s3/ctbk/csvs/` dir, piping through several post-processing commands:
234
+ gdno $r s3/ctbk/csvs/ | \
235
+ pel "ddcr $r guc h1 spc kq kcr snc 'sdf seds' sort"
236
+ ```
237
+
238
+ <details>
239
+ <summary>
240
+ Explanation of aliases
241
+ </summary>
242
+
243
+ - [`gdno`] (`git diff --name-only`): list files changed in the given commit range and directory
244
+ - [`pel`]: [`parallel`] alias that prepends an `echo {}` to the command
245
+ - [`ddcr`] (`dvc-diff -cr`): colorized `diff` output, revision range `$r`
246
+ - [`guc`] (`gunzip -c`): uncompress the `.csv.gz` files
247
+ - [`h1`] (`head -n1`): only examine each file's header line
248
+ - [`spc`] (`tr , $'\n'`): **sp**lit the header line by **c**ommas (so each column name will be on one line, for easier `diff`ing below)
249
+ - [`kq`] (`tr -d '"'`): **k**ill **q**uote characters (in this case, header-column name quoting changed, but I don't care about that)
250
+ - [`kcr`] (`tr -d '\r'`): **k**ill **c**arriage **r**eturns (line endings also changed)
251
+ - [`snc`] (`sed -f 'snake_case.sed'`): snake-case column names
252
+ - [`sdf`] (`sed -f`): execute the `sed` substitution commands defined in the `seds` file above
253
+ - `sort`: sort the column names alphabetically (to identify missing or added columns, ignore rearrangements)
254
+
255
+ Note:
256
+ - Most of these are exported Bash functions, allowing them to be used inside the [`parallel`] command.
257
+ - I was able to build this pipeline iteratively, adding steps to normalize out the bits I didn't care about (and accumulating the `seds` commands).
258
+ </details>
259
+
260
+ Example output:
261
+ ```diff
262
+
263
+ s3/ctbk/csvs/201910-citibike-tripdata.csv.gz.dvc:
264
+ s3/ctbk/csvs/201911-citibike-tripdata.csv.gz.dvc:
265
+ s3/ctbk/csvs/201912-citibike-tripdata.csv.gz.dvc:
266
+ s3/ctbk/csvs/202001-citibike-tripdata.csv.gz.dvc:
267
+ 1,2d0
268
+ < bikeid
269
+ < birth_year
270
+ 8d5
271
+ < gender
272
+ 9a7,8
273
+ > ride_id
274
+ > rideable_type
275
+ 15d13
276
+ < tripduration
277
+ s3/ctbk/csvs/202002-citibike-tripdata.csv.gz.dvc:
278
+ 1,2d0
279
+ < bikeid
280
+ < birth_year
281
+ 8d5
282
+ < gender
283
+ 9a7,8
284
+ > ride_id
285
+ > rideable_type
286
+ 15d13
287
+ < tripduration
288
+ s3/ctbk/csvs/202003-citibike-tripdata.csv.gz.dvc:
289
+ 1,2d0
290
+ < bikeid
291
+ < birth_year
292
+ 8d5
293
+ < gender
294
+ 9a7,8
295
+ > ride_id
296
+ > rideable_type
297
+ 15d13
298
+ < tripduration
299
+
300
+ ```
301
+
302
+ This helped me see that the data update in question (`c0..c1`) dropped some fields (`bikeid, birth_year`, `gender`, `tripduration`) and added others (`ride_id`, `rideable_type`), for `202001` and later.
303
+
214
304
  [DVC]: https://dvc.org/
305
+ [PyPI]: https://pypi.org/project/dvc-utils/
215
306
  [`parquet2json`]: https://github.com/jupiter/parquet2json
216
307
  [hudcostreets/nj-crashes]: https://github.com/hudcostreets/nj-crashes
217
308
  [Parquet]: https://parquet.apache.org/
218
309
  [commit]: https://github.com/hudcostreets/nj-crashes/commit/c8ae28e64f4917895d84074913f48e0a7afbc3d7
219
310
  [commit path]: https://github.com/hudcostreets/nj-crashes/commit/c8ae28e64f4917895d84074913f48e0a7afbc3d7#diff-7f812dce61e0996354f4af414203e0933ccdfe9613cb406c40c1c41a14b9769c
220
311
  [hudcostreets/nj-crashes]: https://github.com/hudcostreets/nj-crashes
312
+ [ctbk.dev gh]: https://github.com/neighbor-ryan/ctbk.dev
221
313
  [`jq`]: https://jqlang.github.io/jq/
314
+ [`parallel`]: https://www.gnu.org/software/parallel/
315
+
316
+ [`gdno`]: https://github.com/ryan-williams/git-helpers/blob/96560df1406f41676f293becefb423895a755faf/diff/.gitconfig#L31
317
+ [`pel`]: https://github.com/ryan-williams/parallel-helpers/blob/e7ee109c4085c04036840ea78999cff73fcf9502/.parallel-rc#L6-L17
318
+ [`ddcr`]: https://github.com/ryan-williams/aws-helpers/blob/8a314f1e6b336833c772459de6b739f5c06a51a3/.dvc-rc#L84
319
+ [`guc`]: https://github.com/ryan-williams/zip-helpers/blob/c67d84fb06c0ab3609dacb68d900344d3b8e8f04/.zip-rc#L16
320
+ [`h1`]: https://github.com/ryan-williams/head-tail-helpers/blob/9715690f47ceeff6b6948b2093901f2b0830114b/.head-tail-rc#L3
321
+ [`spc`]: https://github.com/ryan-williams/col-helpers/blob/9493d003224249ee240d023f71ab03bdd4174b88/.cols-rc#L8
322
+ [`kq`]: https://github.com/ryan-williams/arg-helpers/blob/a8c60809f8878fa38b3c03614778fcf29132538e/.arg-rc#L115
323
+ [`kcr`]: https://github.com/ryan-williams/arg-helpers/blob/a8c60809f8878fa38b3c03614778fcf29132538e/.arg-rc#L118
324
+ [`snc`]: https://github.com/ryan-williams/case-helpers/blob/c40a62a9656f0d52d68fb3a108ae6bb3eed3c7bd/.case-rc#L9
325
+ [`sdf`]: https://github.com/ryan-williams/arg-helpers/blob/a8c60809f8878fa38b3c03614778fcf29132538e/.arg-rc#L138
@@ -0,0 +1,2 @@
1
+ from . import cli, path
2
+ from .path import dvc_cache_dir, dvc_md5, dvc_paths, dvc_path
@@ -1,13 +1,13 @@
1
- from functools import cache
2
- from os import environ as env, getcwd
3
- from os.path import join, relpath
4
1
  import shlex
5
- from typing import Optional, Tuple
2
+ from os import environ as env
3
+ from typing import Tuple
6
4
 
7
- from click import option, argument, group
8
5
  import click
9
- import yaml
10
- from utz import diff_cmds, process, err, singleton
6
+ from click import option, argument, group
7
+ from utz import process, err
8
+ from qmdx import join_pipelines
9
+
10
+ from dvc_utils.path import dvc_paths, dvc_path as dvc_cache_path
11
11
 
12
12
 
13
13
  @group()
@@ -15,57 +15,6 @@ def cli():
15
15
  pass
16
16
 
17
17
 
18
- def dvc_paths(path: str) -> Tuple[str, str]:
19
- if path.endswith('.dvc'):
20
- dvc_path = path
21
- path = dvc_path[:-len('.dvc')]
22
- else:
23
- dvc_path = f'{path}.dvc'
24
- return path, dvc_path
25
-
26
-
27
- @cache
28
- def get_git_root() -> str:
29
- return process.line('git', 'rev-parse', '--show-toplevel', log=False)
30
-
31
-
32
- @cache
33
- def get_dir_path() -> str:
34
- return relpath(getcwd(), get_git_root())
35
-
36
-
37
- @cache
38
- def dvc_cache_dir(log: bool = False) -> str:
39
- dvc_cache_relpath = env.get('DVC_UTILS_CACHE_DIR')
40
- if dvc_cache_relpath:
41
- return join(get_git_root(), dvc_cache_relpath)
42
- else:
43
- return process.line('dvc', 'cache', 'dir', log=log)
44
-
45
-
46
- def dvc_md5(git_ref: str, dvc_path: str, log: bool = False) -> str:
47
- dir_path = get_dir_path()
48
- dir_path = '' if dir_path == '.' else f'{dir_path}/'
49
- dvc_spec = process.output('git', 'show', f'{git_ref}:{dir_path}{dvc_path}', log=err if log else None)
50
- dvc_obj = yaml.safe_load(dvc_spec)
51
- out = singleton(dvc_obj['outs'], dedupe=False)
52
- md5 = out['md5']
53
- return md5
54
-
55
-
56
- def dvc_cache_path(ref: str, dvc_path: Optional[str] = None, log: bool = False) -> str:
57
- if dvc_path:
58
- md5 = dvc_md5(ref, dvc_path, log=log)
59
- elif ':' in ref:
60
- git_ref, dvc_path = ref.split(':', 1)
61
- md5 = dvc_md5(git_ref, dvc_path, log=log)
62
- else:
63
- md5 = ref
64
- dirname = md5[:2]
65
- basename = md5[2:]
66
- return join(dvc_cache_dir(log=log), 'files', 'md5', dirname, basename)
67
-
68
-
69
18
  @cli.command('diff', short_help='Diff a DVC-tracked file at two commits (or one commit vs. current worktree), optionally passing both through another command first')
70
19
  @option('-c', '--color', is_flag=True, help='Colorize the output')
71
20
  @option('-r', '--refspec', default='HEAD', help='<commit 1>..<commit 2> (compare two commits) or <commit> (compare <commit> to the worktree)')
@@ -114,39 +63,32 @@ def dvc_utils_diff(
114
63
  raise ValueError(f"Invalid refspec: {refspec}")
115
64
 
116
65
  log = err if verbose else False
117
- before_path = dvc_cache_path(before, dvc_path, log=log)
118
- after_path = path if after is None else dvc_cache_path(after, dvc_path, log=log)
119
-
66
+ path1 = dvc_cache_path(before, dvc_path, log=log)
67
+ path2 = path if after is None else dvc_cache_path(after, dvc_path, log=log)
68
+
69
+ diff_args = [
70
+ *(['-w'] if ignore_whitespace else []),
71
+ *(['-U', str(unified)] if unified is not None else []),
72
+ *(['--color=always'] if color else []),
73
+ ]
120
74
  if cmds:
121
75
  cmd, *sub_cmds = cmds
76
+ cmds1 = [ f'{cmd} {path1}', *sub_cmds ]
77
+ cmds2 = [ f'{cmd} {path2}', *sub_cmds ]
122
78
  if not shell:
123
- sub_cmds = [ shlex.split(c) for c in sub_cmds ]
124
- before_cmds = [
125
- shlex.split(f'{cmd} {before_path}'),
126
- *sub_cmds,
127
- ]
128
- after_cmds = [
129
- shlex.split(f'{cmd} {after_path}'),
130
- *sub_cmds,
131
- ]
132
- shell_kwargs = {}
133
- else:
134
- before_cmds = [ f'{cmd} {before_path}', *sub_cmds ]
135
- after_cmds = [ f'{cmd} {after_path}', *sub_cmds ]
136
- shell_kwargs = dict(shell=shell)
137
-
138
- diff_cmds(
139
- before_cmds,
140
- after_cmds,
79
+ cmds1 = [ shlex.split(cmd) for cmd in cmds1 ]
80
+ cmds2 = [ shlex.split(cmd) for cmd in cmds2 ]
81
+
82
+ join_pipelines(
83
+ base_cmd=['diff', *diff_args],
84
+ cmds1=cmds1,
85
+ cmds2=cmds2,
141
86
  verbose=verbose,
142
- color=color,
143
- unified=unified,
144
- ignore_whitespace=ignore_whitespace,
87
+ shell=shell,
145
88
  shell_executable=shell_executable,
146
- **shell_kwargs,
147
89
  )
148
90
  else:
149
- process.run('diff', before_path, after_path, log=log)
91
+ process.run('diff', *diff_args, path1, path2, log=log)
150
92
 
151
93
 
152
94
  if __name__ == '__main__':
@@ -0,0 +1,60 @@
1
+ from functools import cache
2
+ from os import environ as env, getcwd
3
+ from os.path import join, relpath
4
+ from typing import Optional, Tuple
5
+
6
+ import yaml
7
+ from utz import process, err, singleton
8
+
9
+
10
+ def dvc_paths(path: str) -> Tuple[str, str]:
11
+ if path.endswith('.dvc'):
12
+ dvc_path = path
13
+ path = dvc_path[:-len('.dvc')]
14
+ else:
15
+ dvc_path = f'{path}.dvc'
16
+ return path, dvc_path
17
+
18
+
19
+ @cache
20
+ def get_git_root() -> str:
21
+ return process.line('git', 'rev-parse', '--show-toplevel', log=False)
22
+
23
+
24
+ @cache
25
+ def get_dir_path() -> str:
26
+ return relpath(getcwd(), get_git_root())
27
+
28
+
29
+ @cache
30
+ def dvc_cache_dir(log: bool = False) -> str:
31
+ dvc_cache_relpath = env.get('DVC_UTILS_CACHE_DIR')
32
+ if dvc_cache_relpath:
33
+ return join(get_git_root(), dvc_cache_relpath)
34
+ else:
35
+ return process.line('dvc', 'cache', 'dir', log=log)
36
+
37
+
38
+ def dvc_md5(git_ref: str, dvc_path: str, log: bool = False) -> str:
39
+ dir_path = get_dir_path()
40
+ dir_path = '' if dir_path == '.' else f'{dir_path}/'
41
+ dvc_spec = process.output('git', 'show', f'{git_ref}:{dir_path}{dvc_path}', log=err if log else None)
42
+ dvc_obj = yaml.safe_load(dvc_spec)
43
+ out = singleton(dvc_obj['outs'], dedupe=False)
44
+ md5 = out['md5']
45
+ return md5
46
+
47
+
48
+ def dvc_path(ref: str, dvc_path: Optional[str] = None, log: bool = False) -> str:
49
+ if dvc_path and not dvc_path.endswith('.dvc'):
50
+ dvc_path += '.dvc'
51
+ if dvc_path:
52
+ md5 = dvc_md5(ref, dvc_path, log=log)
53
+ elif ':' in ref:
54
+ git_ref, dvc_path = ref.split(':', 1)
55
+ md5 = dvc_md5(git_ref, dvc_path, log=log)
56
+ else:
57
+ md5 = ref
58
+ dirname = md5[:2]
59
+ basename = md5[2:]
60
+ return join(dvc_cache_dir(log=log), 'files', 'md5', dirname, basename)
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.1
2
2
  Name: dvc-utils
3
- Version: 0.0.8
3
+ Version: 0.1.0
4
4
  Summary: CLI for diffing DVC files at two commits (or one commit vs. current worktree), optionally passing both through another command first
5
5
  Home-page: https://github.com/runsascoded/dvc-utils
6
6
  Author: Ryan Williams
@@ -10,17 +10,20 @@ Description-Content-Type: text/markdown
10
10
  License-File: LICENSE
11
11
 
12
12
  # dvc-utils
13
- CLI for diffing [DVC] files, optionally passing both through another command first
13
+ Diff [DVC] files, optionally piping through other commands first.
14
+
15
+ [![dvc-utils on PyPI](https://img.shields.io/pypi/v/dvc-utils?label=dvc-utils)][PyPI]
14
16
 
15
17
  <!-- toc -->
16
18
  - [Installation](#installation)
17
19
  - [Usage](#usage)
18
20
  - [`dvc-diff`](#dvc-diff)
19
21
  - [Examples](#examples)
20
- - [Parquet file](#parquet-diff)
22
+ - [Parquet](#parquet-diff)
21
23
  - [Schema diff](#parquet-schema-diff)
22
24
  - [Row diff](#parquet-row-diff)
23
25
  - [Row count diff](#parquet-row-count-diff)
26
+ - [GZipped CSVs](#csv-gz)
24
27
  <!-- /toc -->
25
28
 
26
29
  ## Installation <a id="installation"></a>
@@ -83,7 +86,7 @@ dvc-diff --help
83
86
 
84
87
  ## Examples <a id="examples"></a>
85
88
 
86
- ### Parquet file <a id="parquet-diff"></a>
89
+ ### Parquet <a id="parquet-diff"></a>
87
90
  See sample commands and output below for inspecting changes to [a DVC-tracked Parquet file][commit path] in [a given commit][commit].
88
91
 
89
92
  Setup:
@@ -222,11 +225,112 @@ dvc-diff -r $commit^..$commit parquet_row_count $path
222
225
 
223
226
  This time we get no output; [the given `$commit`][commit] didn't change the row count in the DVC-tracked Parquet file [`$path`][commit path].
224
227
 
228
+ ### GZipped CSVs <a id="csv-gz"></a>
229
+
230
+ Here's a "one-liner" I used in [ctbk.dev][ctbk.dev gh], to normalize and compare headers of `.csv.gz.dvc` files between two commits:
231
+
232
+ ```bash
233
+ # Save some `sed` substitution commands to file `seds`:
234
+ cat <<EOF >seds
235
+ s/station_//
236
+ s/latitude/lat/
237
+ s/longitude/lng/
238
+ s/starttime/started_at/
239
+ s/stoptime/ended_at/
240
+ s/usertype/member_casual/
241
+ EOF
242
+ # Commit range to diff; branch `c0` is an initial commit of some `.csv.gz` files, branch `c1` is a later commit after some updates
243
+ r=c0..c1
244
+ # List files changed in commit range `$r`, in the `s3/ctbk/csvs/` dir, piping through several post-processing commands:
245
+ gdno $r s3/ctbk/csvs/ | \
246
+ pel "ddcr $r guc h1 spc kq kcr snc 'sdf seds' sort"
247
+ ```
248
+
249
+ <details>
250
+ <summary>
251
+ Explanation of aliases
252
+ </summary>
253
+
254
+ - [`gdno`] (`git diff --name-only`): list files changed in the given commit range and directory
255
+ - [`pel`]: [`parallel`] alias that prepends an `echo {}` to the command
256
+ - [`ddcr`] (`dvc-diff -cr`): colorized `diff` output, revision range `$r`
257
+ - [`guc`] (`gunzip -c`): uncompress the `.csv.gz` files
258
+ - [`h1`] (`head -n1`): only examine each file's header line
259
+ - [`spc`] (`tr , $'\n'`): **sp**lit the header line by **c**ommas (so each column name will be on one line, for easier `diff`ing below)
260
+ - [`kq`] (`tr -d '"'`): **k**ill **q**uote characters (in this case, header-column name quoting changed, but I don't care about that)
261
+ - [`kcr`] (`tr -d '\r'`): **k**ill **c**arriage **r**eturns (line endings also changed)
262
+ - [`snc`] (`sed -f 'snake_case.sed'`): snake-case column names
263
+ - [`sdf`] (`sed -f`): execute the `sed` substitution commands defined in the `seds` file above
264
+ - `sort`: sort the column names alphabetically (to identify missing or added columns, ignore rearrangements)
265
+
266
+ Note:
267
+ - Most of these are exported Bash functions, allowing them to be used inside the [`parallel`] command.
268
+ - I was able to build this pipeline iteratively, adding steps to normalize out the bits I didn't care about (and accumulating the `seds` commands).
269
+ </details>
270
+
271
+ Example output:
272
+ ```diff
273
+
274
+ s3/ctbk/csvs/201910-citibike-tripdata.csv.gz.dvc:
275
+ s3/ctbk/csvs/201911-citibike-tripdata.csv.gz.dvc:
276
+ s3/ctbk/csvs/201912-citibike-tripdata.csv.gz.dvc:
277
+ s3/ctbk/csvs/202001-citibike-tripdata.csv.gz.dvc:
278
+ 1,2d0
279
+ < bikeid
280
+ < birth_year
281
+ 8d5
282
+ < gender
283
+ 9a7,8
284
+ > ride_id
285
+ > rideable_type
286
+ 15d13
287
+ < tripduration
288
+ s3/ctbk/csvs/202002-citibike-tripdata.csv.gz.dvc:
289
+ 1,2d0
290
+ < bikeid
291
+ < birth_year
292
+ 8d5
293
+ < gender
294
+ 9a7,8
295
+ > ride_id
296
+ > rideable_type
297
+ 15d13
298
+ < tripduration
299
+ s3/ctbk/csvs/202003-citibike-tripdata.csv.gz.dvc:
300
+ 1,2d0
301
+ < bikeid
302
+ < birth_year
303
+ 8d5
304
+ < gender
305
+ 9a7,8
306
+ > ride_id
307
+ > rideable_type
308
+ 15d13
309
+ < tripduration
310
+
311
+ ```
312
+
313
+ This helped me see that the data update in question (`c0..c1`) dropped some fields (`bikeid, birth_year`, `gender`, `tripduration`) and added others (`ride_id`, `rideable_type`), for `202001` and later.
314
+
225
315
  [DVC]: https://dvc.org/
316
+ [PyPI]: https://pypi.org/project/dvc-utils/
226
317
  [`parquet2json`]: https://github.com/jupiter/parquet2json
227
318
  [hudcostreets/nj-crashes]: https://github.com/hudcostreets/nj-crashes
228
319
  [Parquet]: https://parquet.apache.org/
229
320
  [commit]: https://github.com/hudcostreets/nj-crashes/commit/c8ae28e64f4917895d84074913f48e0a7afbc3d7
230
321
  [commit path]: https://github.com/hudcostreets/nj-crashes/commit/c8ae28e64f4917895d84074913f48e0a7afbc3d7#diff-7f812dce61e0996354f4af414203e0933ccdfe9613cb406c40c1c41a14b9769c
231
322
  [hudcostreets/nj-crashes]: https://github.com/hudcostreets/nj-crashes
323
+ [ctbk.dev gh]: https://github.com/neighbor-ryan/ctbk.dev
232
324
  [`jq`]: https://jqlang.github.io/jq/
325
+ [`parallel`]: https://www.gnu.org/software/parallel/
326
+
327
+ [`gdno`]: https://github.com/ryan-williams/git-helpers/blob/96560df1406f41676f293becefb423895a755faf/diff/.gitconfig#L31
328
+ [`pel`]: https://github.com/ryan-williams/parallel-helpers/blob/e7ee109c4085c04036840ea78999cff73fcf9502/.parallel-rc#L6-L17
329
+ [`ddcr`]: https://github.com/ryan-williams/aws-helpers/blob/8a314f1e6b336833c772459de6b739f5c06a51a3/.dvc-rc#L84
330
+ [`guc`]: https://github.com/ryan-williams/zip-helpers/blob/c67d84fb06c0ab3609dacb68d900344d3b8e8f04/.zip-rc#L16
331
+ [`h1`]: https://github.com/ryan-williams/head-tail-helpers/blob/9715690f47ceeff6b6948b2093901f2b0830114b/.head-tail-rc#L3
332
+ [`spc`]: https://github.com/ryan-williams/col-helpers/blob/9493d003224249ee240d023f71ab03bdd4174b88/.cols-rc#L8
333
+ [`kq`]: https://github.com/ryan-williams/arg-helpers/blob/a8c60809f8878fa38b3c03614778fcf29132538e/.arg-rc#L115
334
+ [`kcr`]: https://github.com/ryan-williams/arg-helpers/blob/a8c60809f8878fa38b3c03614778fcf29132538e/.arg-rc#L118
335
+ [`snc`]: https://github.com/ryan-williams/case-helpers/blob/c40a62a9656f0d52d68fb3a108ae6bb3eed3c7bd/.case-rc#L9
336
+ [`sdf`]: https://github.com/ryan-williams/arg-helpers/blob/a8c60809f8878fa38b3c03614778fcf29132538e/.arg-rc#L138
@@ -2,9 +2,11 @@ LICENSE
2
2
  README.md
3
3
  setup.py
4
4
  dvc_utils/__init__.py
5
- dvc_utils/main.py
5
+ dvc_utils/cli.py
6
+ dvc_utils/path.py
6
7
  dvc_utils.egg-info/PKG-INFO
7
8
  dvc_utils.egg-info/SOURCES.txt
8
9
  dvc_utils.egg-info/dependency_links.txt
9
10
  dvc_utils.egg-info/entry_points.txt
11
+ dvc_utils.egg-info/requires.txt
10
12
  dvc_utils.egg-info/top_level.txt
@@ -0,0 +1,3 @@
1
+ [console_scripts]
2
+ dvc-diff = dvc_utils.cli:dvc_utils_diff
3
+ dvc-utils = dvc_utils.cli:cli
@@ -0,0 +1,4 @@
1
+ click
2
+ pyyaml
3
+ qmdx
4
+ utz>=0.11.3
@@ -2,15 +2,16 @@ from setuptools import setup
2
2
 
3
3
  setup(
4
4
  name='dvc-utils',
5
- version="0.0.8",
5
+ version="0.1.0",
6
6
  description="CLI for diffing DVC files at two commits (or one commit vs. current worktree), optionally passing both through another command first",
7
7
  long_description=open("README.md").read(),
8
8
  long_description_content_type="text/markdown",
9
9
  packages=['dvc_utils'],
10
+ install_requires=open("requirements.txt").read(),
10
11
  entry_points={
11
12
  'console_scripts': [
12
- 'dvc-utils = dvc_utils.main:cli',
13
- 'dvc-diff = dvc_utils.main:dvc_utils_diff',
13
+ 'dvc-utils = dvc_utils.cli:cli',
14
+ 'dvc-diff = dvc_utils.cli:dvc_utils_diff',
14
15
  ],
15
16
  },
16
17
  license="MIT",
File without changes
@@ -1,3 +0,0 @@
1
- [console_scripts]
2
- dvc-diff = dvc_utils.main:dvc_utils_diff
3
- dvc-utils = dvc_utils.main:cli
File without changes
File without changes