dvc-utils 0.1.0__tar.gz → 0.2.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.1
2
2
  Name: dvc-utils
3
- Version: 0.1.0
3
+ Version: 0.2.0
4
4
  Summary: CLI for diffing DVC files at two commits (or one commit vs. current worktree), optionally passing both through another command first
5
5
  Home-page: https://github.com/runsascoded/dvc-utils
6
6
  Author: Ryan Williams
@@ -66,25 +66,146 @@ dvc-diff --help
66
66
  # optional) at HEAD (last committed value) vs. the current worktree content.
67
67
  #
68
68
  # Options:
69
- # -c, --color Colorize the output
70
- # -r, --refspec TEXT <commit 1>..<commit 2> (compare two commits) or
71
- # <commit> (compare <commit> to the worktree)
72
- # -s, --shell-executable TEXT Shell to use for executing commands; defaults
73
- # to $SHELL (/bin/bash)
74
- # -S, --no-shell Don't pass `shell=True` to Python
75
- # `subprocess`es
76
- # -U, --unified INTEGER Number of lines of context to show (passes
77
- # through to `diff`)
78
- # -v, --verbose Log intermediate commands to stderr
79
- # -w, --ignore-whitespace Ignore whitespace differences (pass `-w` to
80
- # `diff`)
81
- # -x, --exec-cmd TEXT Command(s) to execute before diffing; alternate
82
- # syntax to passing commands as positional
83
- # arguments
84
- # --help Show this message and exit.
69
+ # -c, --color / -C, --no-color Force or prevent colorized output
70
+ # -r, --refspec TEXT <commit 1>..<commit 2> (compare two commits)
71
+ # or <commit> (compare <commit> to the worktree)
72
+ # -R, --ref TEXT Shorthand for `-r <ref>^..<ref>`, i.e. inspect
73
+ # a specific commit (vs. its parent)
74
+ # -s, --shell-executable TEXT Shell to use for executing commands; defaults
75
+ # to $SHELL
76
+ # -S, --no-shell Don't pass `shell=True` to Python
77
+ # `subprocess`es
78
+ # -U, --unified INTEGER Number of lines of context to show (passes
79
+ # through to `diff`)
80
+ # -v, --verbose Log intermediate commands to stderr
81
+ # -w, --ignore-whitespace Ignore whitespace differences (pass `-w` to
82
+ # `diff`)
83
+ # -x, --exec-cmd TEXT Command(s) to execute before diffing;
84
+ # alternate syntax to passing commands as
85
+ # positional arguments
86
+ # --help Show this message and exit.
85
87
  ```
86
88
 
87
89
  ## Examples <a id="examples"></a>
90
+ These examples are verified with [`mdcmd`] and `$BMDF_WORKDIR=test/data`
91
+
92
+ ([`test/data`] is a clone of [ryan-williams/dvc-helpers@test], which contains simple DVC-tracked files used for testing [`git-diff-dvc.sh`])
93
+
94
+ [`8ec2060`] added a DVC-tracked text file, `test.txt`:
95
+
96
+ <!-- `bmdf -- dvc-diff -R 8ec2060 test.txt` -->
97
+ ```bash
98
+ dvc-diff -R 8ec2060 test.txt
99
+ # 0a1,10
100
+ # > 1
101
+ # > 2
102
+ # > 3
103
+ # > 4
104
+ # > 5
105
+ # > 6
106
+ # > 7
107
+ # > 8
108
+ # > 9
109
+ # > 10
110
+ ```
111
+
112
+ [`0455b50`] appended some lines to `test.txt`:
113
+
114
+ <!-- `bmdf -- dvc-diff -R 0455b50 test.txt` -->
115
+ ```bash
116
+ dvc-diff -R 0455b50 test.txt
117
+ # 10a11,15
118
+ # > 11
119
+ # > 12
120
+ # > 13
121
+ # > 14
122
+ # > 15
123
+ ```
124
+
125
+ [`f92c1d2`] added `test.parquet`:
126
+
127
+ <!-- `bmdf -- dvc-diff -R f92c1d2 pqa test.parquet` -->
128
+ ```bash
129
+ dvc-diff -R f92c1d2 pqa test.parquet
130
+ # 0a1,27
131
+ # > MD5: 4379600b26647a50dfcd0daa824e8219
132
+ # > 1635 bytes
133
+ # > 5 rows
134
+ # > message schema {
135
+ # > OPTIONAL INT64 num;
136
+ # > OPTIONAL BYTE_ARRAY str (STRING);
137
+ # > }
138
+ # > {
139
+ # > "num": 111,
140
+ # > "str": "aaa"
141
+ # > }
142
+ # > {
143
+ # > "num": 222,
144
+ # > "str": "bbb"
145
+ # > }
146
+ # > {
147
+ # > "num": 333,
148
+ # > "str": "ccc"
149
+ # > }
150
+ # > {
151
+ # > "num": 444,
152
+ # > "str": "ddd"
153
+ # > }
154
+ # > {
155
+ # > "num": 555,
156
+ # > "str": "eee"
157
+ # > }
158
+ ```
159
+
160
+ [`f29e52a`] updated `test.parquet`:
161
+
162
+ <!-- `bmdf -- dvc-diff -R f29e52a pqa test.parquet` -->
163
+ ```bash
164
+ dvc-diff -R f29e52a pqa test.parquet
165
+ # 1,3c1,3
166
+ # < MD5: 4379600b26647a50dfcd0daa824e8219
167
+ # < 1635 bytes
168
+ # < 5 rows
169
+ # ---
170
+ # > MD5: be082c87786f3364ca9efec061a3cc21
171
+ # > 1622 bytes
172
+ # > 8 rows
173
+ # 5c5
174
+ # < OPTIONAL INT64 num;
175
+ # ---
176
+ # > OPTIONAL INT32 num;
177
+ # 26a27,38
178
+ # > }
179
+ # > {
180
+ # > "num": 666,
181
+ # > "str": "fff"
182
+ # > }
183
+ # > {
184
+ # > "num": 777,
185
+ # > "str": "ggg"
186
+ # > }
187
+ # > {
188
+ # > "num": 888,
189
+ # > "str": "hhh"
190
+ ```
191
+
192
+ [`3257258`] added a DVC-tracked directory `data/`, including `test.{txt,parquet}`), and removed the top-level `test.{txt,parquet}`.
193
+
194
+ <!-- `bmdf -- dvc-diff -R 3257258 data` -->
195
+ ```bash
196
+ dvc-diff -R 3257258 data
197
+ # test.parquet: None -> c07bba3fae2b64207aa92f422506e4a2
198
+ # test.txt: None -> e20b902b49a98b1a05ed62804c757f94
199
+ ```
200
+
201
+ [`ae8638a`] changed values in `data/test.parquet`, and added rows to `data/test.txt`:
202
+
203
+ <!-- `bmdf -- dvc-diff -R ae8638a data` -->
204
+ ```bash
205
+ dvc-diff -R ae8638a data
206
+ # test.parquet: c07bba3fae2b64207aa92f422506e4a2 -> f46dd86f608b1dc00993056c9fc55e6e
207
+ # test.txt: e20b902b49a98b1a05ed62804c757f94 -> 9306ec0709cc72558045559ada26573b
208
+ ```
88
209
 
89
210
  ### Parquet <a id="parquet-diff"></a>
90
211
  See sample commands and output below for inspecting changes to [a DVC-tracked Parquet file][commit path] in [a given commit][commit].
@@ -334,3 +455,15 @@ This helped me see that the data update in question (`c0..c1`) dropped some fiel
334
455
  [`kcr`]: https://github.com/ryan-williams/arg-helpers/blob/a8c60809f8878fa38b3c03614778fcf29132538e/.arg-rc#L118
335
456
  [`snc`]: https://github.com/ryan-williams/case-helpers/blob/c40a62a9656f0d52d68fb3a108ae6bb3eed3c7bd/.case-rc#L9
336
457
  [`sdf`]: https://github.com/ryan-williams/arg-helpers/blob/a8c60809f8878fa38b3c03614778fcf29132538e/.arg-rc#L138
458
+
459
+ [`mdcmd`]: https://github.com/runsascoded/bash-markdown-fence?tab=readme-ov-file#bmdf
460
+ [`test/data`]: test/data
461
+ [ryan-williams/dvc-helpers@test]: https://github.com/ryan-williams/dvc-helpers/tree/test
462
+ [`git-diff-dvc.sh`]: https://github.com/ryan-williams/dvc-helpers/blob/main/git-diff-dvc.sh
463
+
464
+ [`8ec2060`]: https://github.com/ryan-williams/dvc-helpers/commit/8ec2060
465
+ [`0455b50`]: https://github.com/ryan-williams/dvc-helpers/commit/0455b50
466
+ [`f92c1d2`]: https://github.com/ryan-williams/dvc-helpers/commit/f92c1d2
467
+ [`f29e52a`]: https://github.com/ryan-williams/dvc-helpers/commit/f29e52a
468
+ [`3257258`]: https://github.com/ryan-williams/dvc-helpers/commit/3257258
469
+ [`ae8638a`]: https://github.com/ryan-williams/dvc-helpers/commit/ae8638a
@@ -55,25 +55,146 @@ dvc-diff --help
55
55
  # optional) at HEAD (last committed value) vs. the current worktree content.
56
56
  #
57
57
  # Options:
58
- # -c, --color Colorize the output
59
- # -r, --refspec TEXT <commit 1>..<commit 2> (compare two commits) or
60
- # <commit> (compare <commit> to the worktree)
61
- # -s, --shell-executable TEXT Shell to use for executing commands; defaults
62
- # to $SHELL (/bin/bash)
63
- # -S, --no-shell Don't pass `shell=True` to Python
64
- # `subprocess`es
65
- # -U, --unified INTEGER Number of lines of context to show (passes
66
- # through to `diff`)
67
- # -v, --verbose Log intermediate commands to stderr
68
- # -w, --ignore-whitespace Ignore whitespace differences (pass `-w` to
69
- # `diff`)
70
- # -x, --exec-cmd TEXT Command(s) to execute before diffing; alternate
71
- # syntax to passing commands as positional
72
- # arguments
73
- # --help Show this message and exit.
58
+ # -c, --color / -C, --no-color Force or prevent colorized output
59
+ # -r, --refspec TEXT <commit 1>..<commit 2> (compare two commits)
60
+ # or <commit> (compare <commit> to the worktree)
61
+ # -R, --ref TEXT Shorthand for `-r <ref>^..<ref>`, i.e. inspect
62
+ # a specific commit (vs. its parent)
63
+ # -s, --shell-executable TEXT Shell to use for executing commands; defaults
64
+ # to $SHELL
65
+ # -S, --no-shell Don't pass `shell=True` to Python
66
+ # `subprocess`es
67
+ # -U, --unified INTEGER Number of lines of context to show (passes
68
+ # through to `diff`)
69
+ # -v, --verbose Log intermediate commands to stderr
70
+ # -w, --ignore-whitespace Ignore whitespace differences (pass `-w` to
71
+ # `diff`)
72
+ # -x, --exec-cmd TEXT Command(s) to execute before diffing;
73
+ # alternate syntax to passing commands as
74
+ # positional arguments
75
+ # --help Show this message and exit.
74
76
  ```
75
77
 
76
78
  ## Examples <a id="examples"></a>
79
+ These examples are verified with [`mdcmd`] and `$BMDF_WORKDIR=test/data`
80
+
81
+ ([`test/data`] is a clone of [ryan-williams/dvc-helpers@test], which contains simple DVC-tracked files used for testing [`git-diff-dvc.sh`])
82
+
83
+ [`8ec2060`] added a DVC-tracked text file, `test.txt`:
84
+
85
+ <!-- `bmdf -- dvc-diff -R 8ec2060 test.txt` -->
86
+ ```bash
87
+ dvc-diff -R 8ec2060 test.txt
88
+ # 0a1,10
89
+ # > 1
90
+ # > 2
91
+ # > 3
92
+ # > 4
93
+ # > 5
94
+ # > 6
95
+ # > 7
96
+ # > 8
97
+ # > 9
98
+ # > 10
99
+ ```
100
+
101
+ [`0455b50`] appended some lines to `test.txt`:
102
+
103
+ <!-- `bmdf -- dvc-diff -R 0455b50 test.txt` -->
104
+ ```bash
105
+ dvc-diff -R 0455b50 test.txt
106
+ # 10a11,15
107
+ # > 11
108
+ # > 12
109
+ # > 13
110
+ # > 14
111
+ # > 15
112
+ ```
113
+
114
+ [`f92c1d2`] added `test.parquet`:
115
+
116
+ <!-- `bmdf -- dvc-diff -R f92c1d2 pqa test.parquet` -->
117
+ ```bash
118
+ dvc-diff -R f92c1d2 pqa test.parquet
119
+ # 0a1,27
120
+ # > MD5: 4379600b26647a50dfcd0daa824e8219
121
+ # > 1635 bytes
122
+ # > 5 rows
123
+ # > message schema {
124
+ # > OPTIONAL INT64 num;
125
+ # > OPTIONAL BYTE_ARRAY str (STRING);
126
+ # > }
127
+ # > {
128
+ # > "num": 111,
129
+ # > "str": "aaa"
130
+ # > }
131
+ # > {
132
+ # > "num": 222,
133
+ # > "str": "bbb"
134
+ # > }
135
+ # > {
136
+ # > "num": 333,
137
+ # > "str": "ccc"
138
+ # > }
139
+ # > {
140
+ # > "num": 444,
141
+ # > "str": "ddd"
142
+ # > }
143
+ # > {
144
+ # > "num": 555,
145
+ # > "str": "eee"
146
+ # > }
147
+ ```
148
+
149
+ [`f29e52a`] updated `test.parquet`:
150
+
151
+ <!-- `bmdf -- dvc-diff -R f29e52a pqa test.parquet` -->
152
+ ```bash
153
+ dvc-diff -R f29e52a pqa test.parquet
154
+ # 1,3c1,3
155
+ # < MD5: 4379600b26647a50dfcd0daa824e8219
156
+ # < 1635 bytes
157
+ # < 5 rows
158
+ # ---
159
+ # > MD5: be082c87786f3364ca9efec061a3cc21
160
+ # > 1622 bytes
161
+ # > 8 rows
162
+ # 5c5
163
+ # < OPTIONAL INT64 num;
164
+ # ---
165
+ # > OPTIONAL INT32 num;
166
+ # 26a27,38
167
+ # > }
168
+ # > {
169
+ # > "num": 666,
170
+ # > "str": "fff"
171
+ # > }
172
+ # > {
173
+ # > "num": 777,
174
+ # > "str": "ggg"
175
+ # > }
176
+ # > {
177
+ # > "num": 888,
178
+ # > "str": "hhh"
179
+ ```
180
+
181
+ [`3257258`] added a DVC-tracked directory `data/`, including `test.{txt,parquet}`), and removed the top-level `test.{txt,parquet}`.
182
+
183
+ <!-- `bmdf -- dvc-diff -R 3257258 data` -->
184
+ ```bash
185
+ dvc-diff -R 3257258 data
186
+ # test.parquet: None -> c07bba3fae2b64207aa92f422506e4a2
187
+ # test.txt: None -> e20b902b49a98b1a05ed62804c757f94
188
+ ```
189
+
190
+ [`ae8638a`] changed values in `data/test.parquet`, and added rows to `data/test.txt`:
191
+
192
+ <!-- `bmdf -- dvc-diff -R ae8638a data` -->
193
+ ```bash
194
+ dvc-diff -R ae8638a data
195
+ # test.parquet: c07bba3fae2b64207aa92f422506e4a2 -> f46dd86f608b1dc00993056c9fc55e6e
196
+ # test.txt: e20b902b49a98b1a05ed62804c757f94 -> 9306ec0709cc72558045559ada26573b
197
+ ```
77
198
 
78
199
  ### Parquet <a id="parquet-diff"></a>
79
200
  See sample commands and output below for inspecting changes to [a DVC-tracked Parquet file][commit path] in [a given commit][commit].
@@ -323,3 +444,15 @@ This helped me see that the data update in question (`c0..c1`) dropped some fiel
323
444
  [`kcr`]: https://github.com/ryan-williams/arg-helpers/blob/a8c60809f8878fa38b3c03614778fcf29132538e/.arg-rc#L118
324
445
  [`snc`]: https://github.com/ryan-williams/case-helpers/blob/c40a62a9656f0d52d68fb3a108ae6bb3eed3c7bd/.case-rc#L9
325
446
  [`sdf`]: https://github.com/ryan-williams/arg-helpers/blob/a8c60809f8878fa38b3c03614778fcf29132538e/.arg-rc#L138
447
+
448
+ [`mdcmd`]: https://github.com/runsascoded/bash-markdown-fence?tab=readme-ov-file#bmdf
449
+ [`test/data`]: test/data
450
+ [ryan-williams/dvc-helpers@test]: https://github.com/ryan-williams/dvc-helpers/tree/test
451
+ [`git-diff-dvc.sh`]: https://github.com/ryan-williams/dvc-helpers/blob/main/git-diff-dvc.sh
452
+
453
+ [`8ec2060`]: https://github.com/ryan-williams/dvc-helpers/commit/8ec2060
454
+ [`0455b50`]: https://github.com/ryan-williams/dvc-helpers/commit/0455b50
455
+ [`f92c1d2`]: https://github.com/ryan-williams/dvc-helpers/commit/f92c1d2
456
+ [`f29e52a`]: https://github.com/ryan-williams/dvc-helpers/commit/f29e52a
457
+ [`3257258`]: https://github.com/ryan-williams/dvc-helpers/commit/3257258
458
+ [`ae8638a`]: https://github.com/ryan-williams/dvc-helpers/commit/ae8638a
@@ -1,12 +1,13 @@
1
- from setuptools import setup
1
+ from setuptools import setup, find_packages
2
2
 
3
3
  setup(
4
4
  name='dvc-utils',
5
- version="0.1.0",
5
+ version="0.2.0",
6
6
  description="CLI for diffing DVC files at two commits (or one commit vs. current worktree), optionally passing both through another command first",
7
7
  long_description=open("README.md").read(),
8
8
  long_description_content_type="text/markdown",
9
- packages=['dvc_utils'],
9
+ package_dir={"": "src"},
10
+ packages=find_packages(where="src"),
10
11
  install_requires=open("requirements.txt").read(),
11
12
  entry_points={
12
13
  'console_scripts': [
@@ -0,0 +1,128 @@
1
+ import json
2
+ import shlex
3
+ from os import listdir
4
+ from os.path import isdir, join
5
+ from typing import Tuple
6
+
7
+ import click
8
+ from click import option, argument, group
9
+ from qmdx import join_pipelines
10
+ from utz import process, err, hash_file
11
+
12
+ from dvc_utils.path import dvc_paths, dvc_cache_path
13
+
14
+
15
+ @group()
16
+ def cli():
17
+ pass
18
+
19
+
20
+ @cli.command('diff', short_help='Diff a DVC-tracked file at two commits (or one commit vs. current worktree), optionally passing both through another command first')
21
+ @option('-c/-C', '--color/--no-color', default=None, help='Force or prevent colorized output')
22
+ @option('-r', '--refspec', help='<commit 1>..<commit 2> (compare two commits) or <commit> (compare <commit> to the worktree)')
23
+ @option('-R', '--ref', help='Shorthand for `-r <ref>^..<ref>`, i.e. inspect a specific commit (vs. its parent)')
24
+ @option('-s', '--shell-executable', help=f'Shell to use for executing commands; defaults to $SHELL')
25
+ @option('-S', '--no-shell', is_flag=True, help="Don't pass `shell=True` to Python `subprocess`es")
26
+ @option('-U', '--unified', type=int, help='Number of lines of context to show (passes through to `diff`)')
27
+ @option('-v', '--verbose', is_flag=True, help="Log intermediate commands to stderr")
28
+ @option('-w', '--ignore-whitespace', is_flag=True, help="Ignore whitespace differences (pass `-w` to `diff`)")
29
+ @option('-x', '--exec-cmd', 'exec_cmds', multiple=True, help='Command(s) to execute before diffing; alternate syntax to passing commands as positional arguments')
30
+ @argument('args', metavar='[exec_cmd...] <path>', nargs=-1)
31
+ def dvc_utils_diff(
32
+ color: bool | None,
33
+ refspec: str | None,
34
+ ref: str | None,
35
+ shell_executable: str | None,
36
+ no_shell: bool,
37
+ unified: int | None,
38
+ verbose: bool,
39
+ ignore_whitespace: bool,
40
+ exec_cmds: Tuple[str, ...],
41
+ args: Tuple[str, ...],
42
+ ):
43
+ """Diff a file at two commits (or one commit vs. current worktree), optionally passing both through `cmd` first
44
+
45
+ Examples:
46
+
47
+ dvc-utils diff -r HEAD^..HEAD wc -l foo.dvc # Compare the number of lines (`wc -l`) in `foo` (the file referenced by `foo.dvc`) at the previous vs. current commit (`HEAD^..HEAD`).
48
+
49
+ dvc-utils diff md5sum foo # Diff the `md5sum` of `foo` (".dvc" extension is optional) at HEAD (last committed value) vs. the current worktree content.
50
+ """
51
+ if not args:
52
+ raise click.UsageError('Must specify [cmd...] <path>')
53
+
54
+ shell = not no_shell
55
+ *cmds, path = args
56
+ cmds = list(exec_cmds) + cmds
57
+
58
+ path, dvc_path = dvc_paths(path)
59
+
60
+ if refspec and ref:
61
+ raise ValueError("Specify -r/--refspec xor -R/--ref")
62
+ if ref:
63
+ refspec = f'{ref}^..{ref}'
64
+ elif not refspec:
65
+ refspec = 'HEAD'
66
+
67
+ pcs = refspec.split('..', 1)
68
+ if len(pcs) == 1:
69
+ before = pcs[0]
70
+ after = None
71
+ elif len(pcs) == 2:
72
+ before, after = pcs
73
+ else:
74
+ raise ValueError(f"Invalid refspec: {refspec}")
75
+
76
+ log = err if verbose else False
77
+ path1 = dvc_cache_path(before, dvc_path, log=log)
78
+ path2 = (path if after is None else dvc_cache_path(after, dvc_path, log=log))
79
+
80
+ if isdir(path):
81
+ dir_json1 = dir_json2 = {}
82
+ if path1:
83
+ with open(path1, 'r') as f:
84
+ obj = json.load(f)
85
+ dir_json1 = { e["relpath"]: e["md5"] for e in obj }
86
+ if path2:
87
+ if path2 == path and after is None:
88
+ dir_json2 = {}
89
+ for file in listdir(path2):
90
+ md5 = hash_file(join(path2, file), hash_name='md5')
91
+ dir_json2[file] = md5
92
+ else:
93
+ with open(path2, 'r') as f:
94
+ dir_json2 = { obj["relpath"]: obj["md5"] for obj in json.load(f) }
95
+ for relpath in sorted(set(dir_json1) | set(dir_json2)):
96
+ md5_1 = dir_json1.get(relpath)
97
+ md5_2 = dir_json2.get(relpath)
98
+ if md5_1 != md5_2:
99
+ print(f'{relpath}: {md5_1} -> {md5_2}')
100
+ else:
101
+ diff_args = [
102
+ *(['-w'] if ignore_whitespace else []),
103
+ *(['-U', str(unified)] if unified is not None else []),
104
+ *(['--color=always'] if color is True else ['--color=never'] if color is False else []),
105
+ ]
106
+ if cmds:
107
+ cmd, *sub_cmds = cmds
108
+ cmds1 = [ 'cat /dev/null' ] if path1 is None else [ f'{cmd} {path1 or "/dev/null"}', *sub_cmds ]
109
+ cmds2 = [ 'cat /dev/null' ] if path2 is None else [ f'{cmd} {path2 or "/dev/null"}', *sub_cmds ]
110
+ if not shell:
111
+ cmds1 = [ shlex.split(cmd) for cmd in cmds1 ]
112
+ cmds2 = [ shlex.split(cmd) for cmd in cmds2 ]
113
+
114
+ join_pipelines(
115
+ base_cmd=['diff', *diff_args],
116
+ cmds1=cmds1,
117
+ cmds2=cmds2,
118
+ verbose=verbose,
119
+ shell=shell,
120
+ executable=shell_executable,
121
+ )
122
+ else:
123
+ res = process.run('diff', *diff_args, path1 or '/dev/null', path2 or '/dev/null', log=log, check=False)
124
+ exit(res.returncode)
125
+
126
+
127
+ if __name__ == '__main__':
128
+ cli()
@@ -0,0 +1,101 @@
1
+ from __future__ import annotations
2
+
3
+ import json
4
+ from functools import cache
5
+ from os import environ as env, getcwd
6
+ from os.path import join, relpath, dirname, basename, sep
7
+ from subprocess import DEVNULL
8
+ from typing import Tuple
9
+
10
+ import yaml
11
+ from utz import process, err, singleton
12
+
13
+
14
+ def dvc_paths(path: str) -> Tuple[str, str]:
15
+ if path.endswith(sep):
16
+ path = path[:-len(sep)]
17
+ if path.endswith('.dvc'):
18
+ dvc_path = path
19
+ path = dvc_path[:-len('.dvc')]
20
+ else:
21
+ dvc_path = f'{path}.dvc'
22
+ return path, dvc_path
23
+
24
+
25
+ @cache
26
+ def get_git_root() -> str:
27
+ return process.line('git', 'rev-parse', '--show-toplevel', log=False)
28
+
29
+
30
+ @cache
31
+ def get_dir_path() -> str:
32
+ return relpath(getcwd(), get_git_root())
33
+
34
+
35
+ @cache
36
+ def dvc_cache_dir(log: bool = False) -> str:
37
+ dvc_cache_relpath = env.get('DVC_UTILS_CACHE_DIR')
38
+ if dvc_cache_relpath:
39
+ return join(get_git_root(), dvc_cache_relpath)
40
+ else:
41
+ return process.line('dvc', 'cache', 'dir', log=log)
42
+
43
+
44
+ def dvc_md5(
45
+ git_ref: str,
46
+ dvc_path: str,
47
+ log: bool = False,
48
+ ) -> str | None:
49
+ dir_path = get_dir_path()
50
+ dir_path = '' if dir_path == '.' else f'{dir_path}{sep}'
51
+ dvc_path = f"{dir_path}{dvc_path}"
52
+ dvc_spec = process.output('git', 'show', f'{git_ref}:{dvc_path}', log=err if log else None, err_ok=True, stderr=DEVNULL)
53
+ if dvc_spec is None:
54
+ cur_dir = dirname(dvc_path)
55
+ relpath = basename(dvc_path)
56
+ if relpath.endswith(".dvc"):
57
+ relpath = relpath[:-len(".dvc")]
58
+ while cur_dir and cur_dir != '.':
59
+ dir_cache_path = dvc_cache_path(ref=git_ref, dvc_path=f"{cur_dir}.dvc", log=log)
60
+ if dir_cache_path:
61
+ with open(dir_cache_path, 'r') as f:
62
+ dir_entries = json.load(f)
63
+ md5s = [ e["md5"] for e in dir_entries if e["relpath"] == relpath ]
64
+ if len(md5s) == 1:
65
+ return md5s[0]
66
+ else:
67
+ raise RuntimeError(f"{relpath=} not found in DVC-tracked dir {cur_dir}")
68
+ relpath = join(basename(cur_dir), relpath)
69
+ cur_dir = dirname(cur_dir)
70
+ return None
71
+ dvc_obj = yaml.safe_load(dvc_spec)
72
+ out = singleton(dvc_obj['outs'], dedupe=False)
73
+ md5 = out['md5']
74
+ return md5
75
+
76
+
77
+ def dvc_path(
78
+ ref: str,
79
+ dvc_path: str | None = None,
80
+ log: bool = False,
81
+ ) -> str | None:
82
+ if dvc_path and not dvc_path.endswith('.dvc'):
83
+ dvc_path += '.dvc'
84
+
85
+ if dvc_path:
86
+ md5 = dvc_md5(ref, dvc_path, log=log)
87
+ elif ':' in ref:
88
+ git_ref, dvc_path = ref.split(':', 1)
89
+ md5 = dvc_md5(git_ref, dvc_path, log=log)
90
+ else:
91
+ md5 = ref
92
+
93
+ if md5 is None:
94
+ return None
95
+ else:
96
+ dirname = md5[:2]
97
+ basename = md5[2:]
98
+ return join(dvc_cache_dir(log=log), 'files', 'md5', dirname, basename)
99
+
100
+
101
+ dvc_cache_path = dvc_path
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.1
2
2
  Name: dvc-utils
3
- Version: 0.1.0
3
+ Version: 0.2.0
4
4
  Summary: CLI for diffing DVC files at two commits (or one commit vs. current worktree), optionally passing both through another command first
5
5
  Home-page: https://github.com/runsascoded/dvc-utils
6
6
  Author: Ryan Williams
@@ -66,25 +66,146 @@ dvc-diff --help
66
66
  # optional) at HEAD (last committed value) vs. the current worktree content.
67
67
  #
68
68
  # Options:
69
- # -c, --color Colorize the output
70
- # -r, --refspec TEXT <commit 1>..<commit 2> (compare two commits) or
71
- # <commit> (compare <commit> to the worktree)
72
- # -s, --shell-executable TEXT Shell to use for executing commands; defaults
73
- # to $SHELL (/bin/bash)
74
- # -S, --no-shell Don't pass `shell=True` to Python
75
- # `subprocess`es
76
- # -U, --unified INTEGER Number of lines of context to show (passes
77
- # through to `diff`)
78
- # -v, --verbose Log intermediate commands to stderr
79
- # -w, --ignore-whitespace Ignore whitespace differences (pass `-w` to
80
- # `diff`)
81
- # -x, --exec-cmd TEXT Command(s) to execute before diffing; alternate
82
- # syntax to passing commands as positional
83
- # arguments
84
- # --help Show this message and exit.
69
+ # -c, --color / -C, --no-color Force or prevent colorized output
70
+ # -r, --refspec TEXT <commit 1>..<commit 2> (compare two commits)
71
+ # or <commit> (compare <commit> to the worktree)
72
+ # -R, --ref TEXT Shorthand for `-r <ref>^..<ref>`, i.e. inspect
73
+ # a specific commit (vs. its parent)
74
+ # -s, --shell-executable TEXT Shell to use for executing commands; defaults
75
+ # to $SHELL
76
+ # -S, --no-shell Don't pass `shell=True` to Python
77
+ # `subprocess`es
78
+ # -U, --unified INTEGER Number of lines of context to show (passes
79
+ # through to `diff`)
80
+ # -v, --verbose Log intermediate commands to stderr
81
+ # -w, --ignore-whitespace Ignore whitespace differences (pass `-w` to
82
+ # `diff`)
83
+ # -x, --exec-cmd TEXT Command(s) to execute before diffing;
84
+ # alternate syntax to passing commands as
85
+ # positional arguments
86
+ # --help Show this message and exit.
85
87
  ```
86
88
 
87
89
  ## Examples <a id="examples"></a>
90
+ These examples are verified with [`mdcmd`] and `$BMDF_WORKDIR=test/data`
91
+
92
+ ([`test/data`] is a clone of [ryan-williams/dvc-helpers@test], which contains simple DVC-tracked files used for testing [`git-diff-dvc.sh`])
93
+
94
+ [`8ec2060`] added a DVC-tracked text file, `test.txt`:
95
+
96
+ <!-- `bmdf -- dvc-diff -R 8ec2060 test.txt` -->
97
+ ```bash
98
+ dvc-diff -R 8ec2060 test.txt
99
+ # 0a1,10
100
+ # > 1
101
+ # > 2
102
+ # > 3
103
+ # > 4
104
+ # > 5
105
+ # > 6
106
+ # > 7
107
+ # > 8
108
+ # > 9
109
+ # > 10
110
+ ```
111
+
112
+ [`0455b50`] appended some lines to `test.txt`:
113
+
114
+ <!-- `bmdf -- dvc-diff -R 0455b50 test.txt` -->
115
+ ```bash
116
+ dvc-diff -R 0455b50 test.txt
117
+ # 10a11,15
118
+ # > 11
119
+ # > 12
120
+ # > 13
121
+ # > 14
122
+ # > 15
123
+ ```
124
+
125
+ [`f92c1d2`] added `test.parquet`:
126
+
127
+ <!-- `bmdf -- dvc-diff -R f92c1d2 pqa test.parquet` -->
128
+ ```bash
129
+ dvc-diff -R f92c1d2 pqa test.parquet
130
+ # 0a1,27
131
+ # > MD5: 4379600b26647a50dfcd0daa824e8219
132
+ # > 1635 bytes
133
+ # > 5 rows
134
+ # > message schema {
135
+ # > OPTIONAL INT64 num;
136
+ # > OPTIONAL BYTE_ARRAY str (STRING);
137
+ # > }
138
+ # > {
139
+ # > "num": 111,
140
+ # > "str": "aaa"
141
+ # > }
142
+ # > {
143
+ # > "num": 222,
144
+ # > "str": "bbb"
145
+ # > }
146
+ # > {
147
+ # > "num": 333,
148
+ # > "str": "ccc"
149
+ # > }
150
+ # > {
151
+ # > "num": 444,
152
+ # > "str": "ddd"
153
+ # > }
154
+ # > {
155
+ # > "num": 555,
156
+ # > "str": "eee"
157
+ # > }
158
+ ```
159
+
160
+ [`f29e52a`] updated `test.parquet`:
161
+
162
+ <!-- `bmdf -- dvc-diff -R f29e52a pqa test.parquet` -->
163
+ ```bash
164
+ dvc-diff -R f29e52a pqa test.parquet
165
+ # 1,3c1,3
166
+ # < MD5: 4379600b26647a50dfcd0daa824e8219
167
+ # < 1635 bytes
168
+ # < 5 rows
169
+ # ---
170
+ # > MD5: be082c87786f3364ca9efec061a3cc21
171
+ # > 1622 bytes
172
+ # > 8 rows
173
+ # 5c5
174
+ # < OPTIONAL INT64 num;
175
+ # ---
176
+ # > OPTIONAL INT32 num;
177
+ # 26a27,38
178
+ # > }
179
+ # > {
180
+ # > "num": 666,
181
+ # > "str": "fff"
182
+ # > }
183
+ # > {
184
+ # > "num": 777,
185
+ # > "str": "ggg"
186
+ # > }
187
+ # > {
188
+ # > "num": 888,
189
+ # > "str": "hhh"
190
+ ```
191
+
192
+ [`3257258`] added a DVC-tracked directory `data/`, including `test.{txt,parquet}`), and removed the top-level `test.{txt,parquet}`.
193
+
194
+ <!-- `bmdf -- dvc-diff -R 3257258 data` -->
195
+ ```bash
196
+ dvc-diff -R 3257258 data
197
+ # test.parquet: None -> c07bba3fae2b64207aa92f422506e4a2
198
+ # test.txt: None -> e20b902b49a98b1a05ed62804c757f94
199
+ ```
200
+
201
+ [`ae8638a`] changed values in `data/test.parquet`, and added rows to `data/test.txt`:
202
+
203
+ <!-- `bmdf -- dvc-diff -R ae8638a data` -->
204
+ ```bash
205
+ dvc-diff -R ae8638a data
206
+ # test.parquet: c07bba3fae2b64207aa92f422506e4a2 -> f46dd86f608b1dc00993056c9fc55e6e
207
+ # test.txt: e20b902b49a98b1a05ed62804c757f94 -> 9306ec0709cc72558045559ada26573b
208
+ ```
88
209
 
89
210
  ### Parquet <a id="parquet-diff"></a>
90
211
  See sample commands and output below for inspecting changes to [a DVC-tracked Parquet file][commit path] in [a given commit][commit].
@@ -334,3 +455,15 @@ This helped me see that the data update in question (`c0..c1`) dropped some fiel
334
455
  [`kcr`]: https://github.com/ryan-williams/arg-helpers/blob/a8c60809f8878fa38b3c03614778fcf29132538e/.arg-rc#L118
335
456
  [`snc`]: https://github.com/ryan-williams/case-helpers/blob/c40a62a9656f0d52d68fb3a108ae6bb3eed3c7bd/.case-rc#L9
336
457
  [`sdf`]: https://github.com/ryan-williams/arg-helpers/blob/a8c60809f8878fa38b3c03614778fcf29132538e/.arg-rc#L138
458
+
459
+ [`mdcmd`]: https://github.com/runsascoded/bash-markdown-fence?tab=readme-ov-file#bmdf
460
+ [`test/data`]: test/data
461
+ [ryan-williams/dvc-helpers@test]: https://github.com/ryan-williams/dvc-helpers/tree/test
462
+ [`git-diff-dvc.sh`]: https://github.com/ryan-williams/dvc-helpers/blob/main/git-diff-dvc.sh
463
+
464
+ [`8ec2060`]: https://github.com/ryan-williams/dvc-helpers/commit/8ec2060
465
+ [`0455b50`]: https://github.com/ryan-williams/dvc-helpers/commit/0455b50
466
+ [`f92c1d2`]: https://github.com/ryan-williams/dvc-helpers/commit/f92c1d2
467
+ [`f29e52a`]: https://github.com/ryan-williams/dvc-helpers/commit/f29e52a
468
+ [`3257258`]: https://github.com/ryan-williams/dvc-helpers/commit/3257258
469
+ [`ae8638a`]: https://github.com/ryan-williams/dvc-helpers/commit/ae8638a
@@ -0,0 +1,12 @@
1
+ LICENSE
2
+ README.md
3
+ setup.py
4
+ src/dvc_utils/__init__.py
5
+ src/dvc_utils/cli.py
6
+ src/dvc_utils/path.py
7
+ src/dvc_utils.egg-info/PKG-INFO
8
+ src/dvc_utils.egg-info/SOURCES.txt
9
+ src/dvc_utils.egg-info/dependency_links.txt
10
+ src/dvc_utils.egg-info/entry_points.txt
11
+ src/dvc_utils.egg-info/requires.txt
12
+ src/dvc_utils.egg-info/top_level.txt
@@ -0,0 +1,4 @@
1
+ click
2
+ pyyaml
3
+ qmdx>=0.0.5
4
+ utz>=0.13.0
@@ -1,95 +0,0 @@
1
- import shlex
2
- from os import environ as env
3
- from typing import Tuple
4
-
5
- import click
6
- from click import option, argument, group
7
- from utz import process, err
8
- from qmdx import join_pipelines
9
-
10
- from dvc_utils.path import dvc_paths, dvc_path as dvc_cache_path
11
-
12
-
13
- @group()
14
- def cli():
15
- pass
16
-
17
-
18
- @cli.command('diff', short_help='Diff a DVC-tracked file at two commits (or one commit vs. current worktree), optionally passing both through another command first')
19
- @option('-c', '--color', is_flag=True, help='Colorize the output')
20
- @option('-r', '--refspec', default='HEAD', help='<commit 1>..<commit 2> (compare two commits) or <commit> (compare <commit> to the worktree)')
21
- @option('-s', '--shell-executable', help=f'Shell to use for executing commands; defaults to $SHELL ({env.get("SHELL")})')
22
- @option('-S', '--no-shell', is_flag=True, help="Don't pass `shell=True` to Python `subprocess`es")
23
- @option('-U', '--unified', type=int, help='Number of lines of context to show (passes through to `diff`)')
24
- @option('-v', '--verbose', is_flag=True, help="Log intermediate commands to stderr")
25
- @option('-w', '--ignore-whitespace', is_flag=True, help="Ignore whitespace differences (pass `-w` to `diff`)")
26
- @option('-x', '--exec-cmd', 'exec_cmds', multiple=True, help='Command(s) to execute before diffing; alternate syntax to passing commands as positional arguments')
27
- @argument('args', metavar='[exec_cmd...] <path>', nargs=-1)
28
- def dvc_utils_diff(
29
- color: bool,
30
- refspec: str | None,
31
- shell_executable: str | None,
32
- no_shell: bool,
33
- unified: int | None,
34
- verbose: bool,
35
- ignore_whitespace: bool,
36
- exec_cmds: Tuple[str, ...],
37
- args: Tuple[str, ...],
38
- ):
39
- """Diff a file at two commits (or one commit vs. current worktree), optionally passing both through `cmd` first
40
-
41
- Examples:
42
-
43
- dvc-utils diff -r HEAD^..HEAD wc -l foo.dvc # Compare the number of lines (`wc -l`) in `foo` (the file referenced by `foo.dvc`) at the previous vs. current commit (`HEAD^..HEAD`).
44
-
45
- dvc-utils diff md5sum foo # Diff the `md5sum` of `foo` (".dvc" extension is optional) at HEAD (last committed value) vs. the current worktree content.
46
- """
47
- if not args:
48
- raise click.UsageError('Must specify [cmd...] <path>')
49
-
50
- shell = not no_shell
51
- *cmds, path = args
52
- cmds = list(exec_cmds) + cmds
53
-
54
- path, dvc_path = dvc_paths(path)
55
-
56
- pcs = refspec.split('..', 1)
57
- if len(pcs) == 1:
58
- before = pcs[0]
59
- after = None
60
- elif len(pcs) == 2:
61
- before, after = pcs
62
- else:
63
- raise ValueError(f"Invalid refspec: {refspec}")
64
-
65
- log = err if verbose else False
66
- path1 = dvc_cache_path(before, dvc_path, log=log)
67
- path2 = path if after is None else dvc_cache_path(after, dvc_path, log=log)
68
-
69
- diff_args = [
70
- *(['-w'] if ignore_whitespace else []),
71
- *(['-U', str(unified)] if unified is not None else []),
72
- *(['--color=always'] if color else []),
73
- ]
74
- if cmds:
75
- cmd, *sub_cmds = cmds
76
- cmds1 = [ f'{cmd} {path1}', *sub_cmds ]
77
- cmds2 = [ f'{cmd} {path2}', *sub_cmds ]
78
- if not shell:
79
- cmds1 = [ shlex.split(cmd) for cmd in cmds1 ]
80
- cmds2 = [ shlex.split(cmd) for cmd in cmds2 ]
81
-
82
- join_pipelines(
83
- base_cmd=['diff', *diff_args],
84
- cmds1=cmds1,
85
- cmds2=cmds2,
86
- verbose=verbose,
87
- shell=shell,
88
- shell_executable=shell_executable,
89
- )
90
- else:
91
- process.run('diff', *diff_args, path1, path2, log=log)
92
-
93
-
94
- if __name__ == '__main__':
95
- cli()
@@ -1,60 +0,0 @@
1
- from functools import cache
2
- from os import environ as env, getcwd
3
- from os.path import join, relpath
4
- from typing import Optional, Tuple
5
-
6
- import yaml
7
- from utz import process, err, singleton
8
-
9
-
10
- def dvc_paths(path: str) -> Tuple[str, str]:
11
- if path.endswith('.dvc'):
12
- dvc_path = path
13
- path = dvc_path[:-len('.dvc')]
14
- else:
15
- dvc_path = f'{path}.dvc'
16
- return path, dvc_path
17
-
18
-
19
- @cache
20
- def get_git_root() -> str:
21
- return process.line('git', 'rev-parse', '--show-toplevel', log=False)
22
-
23
-
24
- @cache
25
- def get_dir_path() -> str:
26
- return relpath(getcwd(), get_git_root())
27
-
28
-
29
- @cache
30
- def dvc_cache_dir(log: bool = False) -> str:
31
- dvc_cache_relpath = env.get('DVC_UTILS_CACHE_DIR')
32
- if dvc_cache_relpath:
33
- return join(get_git_root(), dvc_cache_relpath)
34
- else:
35
- return process.line('dvc', 'cache', 'dir', log=log)
36
-
37
-
38
- def dvc_md5(git_ref: str, dvc_path: str, log: bool = False) -> str:
39
- dir_path = get_dir_path()
40
- dir_path = '' if dir_path == '.' else f'{dir_path}/'
41
- dvc_spec = process.output('git', 'show', f'{git_ref}:{dir_path}{dvc_path}', log=err if log else None)
42
- dvc_obj = yaml.safe_load(dvc_spec)
43
- out = singleton(dvc_obj['outs'], dedupe=False)
44
- md5 = out['md5']
45
- return md5
46
-
47
-
48
- def dvc_path(ref: str, dvc_path: Optional[str] = None, log: bool = False) -> str:
49
- if dvc_path and not dvc_path.endswith('.dvc'):
50
- dvc_path += '.dvc'
51
- if dvc_path:
52
- md5 = dvc_md5(ref, dvc_path, log=log)
53
- elif ':' in ref:
54
- git_ref, dvc_path = ref.split(':', 1)
55
- md5 = dvc_md5(git_ref, dvc_path, log=log)
56
- else:
57
- md5 = ref
58
- dirname = md5[:2]
59
- basename = md5[2:]
60
- return join(dvc_cache_dir(log=log), 'files', 'md5', dirname, basename)
@@ -1,12 +0,0 @@
1
- LICENSE
2
- README.md
3
- setup.py
4
- dvc_utils/__init__.py
5
- dvc_utils/cli.py
6
- dvc_utils/path.py
7
- dvc_utils.egg-info/PKG-INFO
8
- dvc_utils.egg-info/SOURCES.txt
9
- dvc_utils.egg-info/dependency_links.txt
10
- dvc_utils.egg-info/entry_points.txt
11
- dvc_utils.egg-info/requires.txt
12
- dvc_utils.egg-info/top_level.txt
@@ -1,4 +0,0 @@
1
- click
2
- pyyaml
3
- qmdx
4
- utz>=0.11.3
File without changes
File without changes