git2xml 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
git2xml-0.1.0/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 Amit Ben-Ari
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
git2xml-0.1.0/PKG-INFO ADDED
@@ -0,0 +1,349 @@
1
+ Metadata-Version: 2.4
2
+ Name: git2xml
3
+ Version: 0.1.0
4
+ Summary: Generate contextual XML briefs for Git Commits and PRs.
5
+ Author-email: Amit Ben-Ari <amit@hivetrail.com>
6
+ License-Expression: MIT
7
+ Project-URL: Homepage, https://github.com/a-benari/git2xml
8
+ Project-URL: Repository, https://github.com/a-benari/git2xml
9
+ Project-URL: Issues, https://github.com/a-benari/git2xml/issues
10
+ Project-URL: Changelog, https://github.com/a-benari/git2xml/blob/main/CHANGELOG.md
11
+ Keywords: git,xml,llm,context,diff,pull-request,cli
12
+ Classifier: Development Status :: 4 - Beta
13
+ Classifier: Environment :: Console
14
+ Classifier: Intended Audience :: Developers
15
+ Classifier: Operating System :: OS Independent
16
+ Classifier: Programming Language :: Python :: 3
17
+ Classifier: Programming Language :: Python :: 3.9
18
+ Classifier: Programming Language :: Python :: 3.10
19
+ Classifier: Programming Language :: Python :: 3.11
20
+ Classifier: Programming Language :: Python :: 3.12
21
+ Classifier: Programming Language :: Python :: 3.13
22
+ Classifier: Topic :: Software Development :: Version Control :: Git
23
+ Requires-Python: >=3.9
24
+ Description-Content-Type: text/markdown
25
+ License-File: LICENSE
26
+ Provides-Extra: dev
27
+ Requires-Dist: pytest>=7.0.0; extra == "dev"
28
+ Requires-Dist: pyright>=1.1.350; extra == "dev"
29
+ Requires-Dist: ruff>=0.6; extra == "dev"
30
+ Dynamic: license-file
31
+
32
+ # git2xml
33
+
34
+ A zero-dependency CLI that generates structured XML briefs of your Git commits and pull requests - ready to paste into Claude, ChatGPT, or any LLM that benefits from clean context.
35
+
36
+ ## Why this exists
37
+
38
+ LLMs work better with structured context than with raw blobs of text. When you ask Claude to write a PR description, pasting `git diff` output is workable but lossy - it strips staging information, mixes binary and text files into a mess, and doesn't separate file contents from their diffs. `git2xml` solves this by formatting your git state into XML that LLMs can parse cleanly, with explicit file paths, statuses, diffs, and content sections.
39
+
40
+ The result: better-quality output from your AI assistant with less prompt engineering on your end.
41
+
42
+ ## Features
43
+
44
+ - **AI-ready output**: Produces XML structured specifically for LLM consumption, with explicit file paths, change statuses, diffs, and content sections that models parse reliably.
45
+ - **One command per use case**: `git2xml commit` for current changes, `git2xml pr` for branch-vs-base - no flag-juggling for common workflows.
46
+ - **Zero dependencies**: Built entirely on the Python standard library. No supply-chain surface beyond Python itself.
47
+ - **Robust binary detection**: Automatically excludes binary files using BOM detection and statistical character analysis to prevent XML corruption.
48
+ - **Smart XML escaping**: Safely wraps code containing CDATA terminators using dynamic Markdown fencing.
49
+ - **Staging-aware**: Differentiates between staged, unstaged, and untracked files for accurate commit briefs.
50
+ - **Context-budget controls**: Per-file content and diff size caps (`--max-size`, `--max-diff-size`) keep oversized files and runaway diffs out of your prompt while still recording that the change happened.
51
+ - **Usable as a library**: A small typed Python API returns the brief as a string (sync or async) for use inside scripts, agents, and LLM pipelines - not just the CLI.
52
+
53
+ ## Requirements
54
+
55
+ Python 3.9 or higher. No other dependencies.
56
+
57
+ ## Installation
58
+
59
+ ```bash
60
+ pip install git2xml
61
+ ```
62
+
63
+ ## Usage
64
+
65
+ Run from inside any local Git repository, or target one with `--repo PATH`
66
+
67
+ ### Generate a commit brief
68
+
69
+ Summarizes your currently modified files (or staged files if using the `--staged` flag) against `HEAD`.
70
+
71
+ ```bash
72
+ git2xml commit
73
+ ```
74
+
75
+ Outputs to `commit_brief.xml` by default, written to the directory you ran the command from.
76
+
77
+ ### Generate a pull request brief
78
+
79
+ Summarizes all changes on your current branch against a base branch (defaults to `main`).
80
+
81
+ ```bash
82
+ git2xml pr --base main --output my_pr_summary.xml
83
+ ```
84
+
85
+ ### Content-control flags
86
+
87
+ Four optional flags let you shape what ends up in the brief:
88
+
89
+ | Flag | Description |
90
+ | ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
91
+ | `--no-untracked` | Exclude untracked (new, un-`git add`ed) files from a commit brief. No-op when `--staged` is set (staged mode already excludes them) or in PR mode (no untracked files exist there). |
92
+ | `--max-size N` | Override the per-file **content** size threshold (in bytes), above which file _content_ is omitted and replaced with a reason string. Does not apply to diffs - that is `--max-diff-size` (see "Size limits: content vs. diffs"). The file's `<diff>` is still emitted, so the change stays visible. Defaults to 5 MiB (`5242880`). Must be a positive integer; `--max-size 0` or a negative value exits with an error. |
93
+ | `--max-diff-size N` | Override the per-file **diff** size threshold (in bytes, UTF-8). A diff larger than this is dropped from the output - its `<diff>` slot renders `status="omitted"` with a reason while the `<content>` stays. Unlike `--max-size`, this is output-shaping, not a pre-fetch guard (a diff has no size git can report before computing it). Defaults to 1 MiB (`1048576`). Must be `>= 0`; `--max-diff-size 0` disables the cap (diffs are always included in full). |
94
+ | `--no-content` | Produce a **diff-only** brief - all `<content>` bodies are suppressed and every file is represented by its `<diff>`. For newly added and untracked files (which have no prior version to diff against), the diff _is_ the full file content shown as added (`+`) lines - so a diff-only brief still captures new files completely. |
95
+ | `--strict-xml` | Generate strict XML 1.0 output - escape control characters and split CDATA terminators. If False (default), prioritize exact file fidelity, falling back to markdown fencing when a CDATA terminator is present. See the **XML Compliance vs. File Fidelity** section below for more details. |
96
+
97
+ These flags compose freely with each other and with `--staged`:
98
+
99
+ ```bash
100
+ git2xml commit --no-untracked # omit untracked files
101
+ git2xml commit --max-size 102400 # cap content at 100 KiB
102
+ git2xml commit --max-diff-size 262144 # drop any single diff over 256 KiB
103
+ git2xml commit --no-content # diffs only, no file bodies
104
+ git2xml commit --no-untracked --no-content # combine: drop untracked, diffs only
105
+ ```
106
+
107
+ > **Note - new files under `--no-content`:** Normally a brand-new file's change is
108
+ > carried by its `<content>`. Because `--no-content` suppresses content, git2xml
109
+ > instead emits the file's **add-diff** (every line shown as an added `+` line), so
110
+ > the file's contents are still present in the brief - just rendered as a diff rather
111
+ > than a content block. Untracked files (not yet `git add`ed) are diffed against an
112
+ > empty file to produce the same result. This applies only under `--no-content`; in
113
+ > the default mode new files render as normal `<content>`.
114
+
115
+ ### Size limits: content vs. diffs
116
+
117
+ git2xml caps two things independently - file **content** (`--max-size`) and a
118
+ single file's **diff** (`--max-diff-size`). Same unit (bytes), different mechanics.
119
+
120
+ `--max-size` caps **file content**. Content size is read from git's metadata
121
+ (`ls-tree` / `cat-file`) or the filesystem _before_ the file is loaded, so an
122
+ oversized file is detected and skipped without ever being read into memory - the
123
+ guard prevents the work. The file's `<file>` element and `<diff>` are still
124
+ emitted, so the change stays visible.
125
+
126
+ `--max-diff-size` caps a single file's **diff**. Unlike content, a diff has no
127
+ size git can report in advance - it exists only once git computes it - so the cap
128
+ can't prevent the work the way `--max-size` does. Instead the diff is streamed and
129
+ abandoned once it crosses the limit (git2xml stops reading rather than buffering
130
+ the whole thing), then dropped from the output: its `<diff>` slot renders
131
+ `status="omitted"` with a reason while the `<content>` stays. This keeps a runaway
132
+ diff - a big generated or vendored file, or a large _deleted_ file whose only
133
+ payload is its diff - out of your context budget. Defaults to 1 MiB; pass
134
+ `--max-diff-size 0` to disable it and always include diffs in full.
135
+
136
+ ### Execution options
137
+
138
+ | Flag | Description |
139
+ | -------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
140
+ | `--git-timeout N` | Per-git-command timeout in seconds. Raise it for very large repos where a single diff/log can take a while. Default: 30. |
141
+ | `--diff-semaphore-limit N` | Max number of diffs fetched concurrently. Default: 20. Lower it to reduce load; raise it for more parallelism on fast disks. |
142
+ | `--verbose`/`-v` | Verbose logging. Logs per-file and per-commit progress, as well as debug log messages. |
143
+ | `--hide-repo-path` | Emit only the repository's directory name in the root `<{commit,pr}_brief repo="...">` attribute instead of its absolute local path. Use when sharing briefs externally. Individual file `path` attributes are always repo-relative and unaffected. Default: off (the absolute path is emitted). |
144
+
145
+ ## Output location
146
+
147
+ The brief is written to the directory you ran the command from, using the name from
148
+ `--output` (or the `{command}_brief.xml` default). A relative `--output` is resolved
149
+ against your current directory; an absolute path is honored as given. Note this is
150
+ independent of `--repo`: pointing `--repo` at another repository still writes the
151
+ brief to where you invoked the command, not into that repository.
152
+
153
+ ## Use as a Python library
154
+
155
+ Beyond the CLI, `git2xml` exposes a small programmatic API that **returns the brief
156
+ as a string** (nothing is written to disk), so you can feed it straight into an LLM
157
+ call, an agent pipeline, or any tool that assembles context.
158
+
159
+ ```python
160
+ import git2xml
161
+ from git2xml import Git2xmlConfig
162
+
163
+ # Synchronous - for plain scripts
164
+ xml = git2xml.generate_commit_brief_sync(Git2xmlConfig(repo="/path/to/repo"))
165
+
166
+ # A PR brief against a base branch
167
+ xml = git2xml.generate_pr_brief_sync(Git2xmlConfig(repo=".", base="develop"))
168
+ ```
169
+
170
+ The engine is `asyncio`-based, so async callers (agents, web handlers) can await the
171
+ native coroutines directly instead of blocking their event loop:
172
+
173
+ ```python
174
+ import asyncio
175
+ import git2xml
176
+ from git2xml import Git2xmlConfig
177
+
178
+ async def main():
179
+ xml = await git2xml.generate_commit_brief(Git2xmlConfig(repo=".", staged=True))
180
+ # ... hand `xml` to your model / agent ...
181
+
182
+ asyncio.run(main())
183
+ ```
184
+
185
+ > **Windows note:** the async functions spawn `git` via asyncio subprocesses,
186
+ > which on Windows require the `ProactorEventLoop`. `asyncio.run(...)` (above)
187
+ > selects it for you, so the normal case needs no action. Only if you supply
188
+ > your own event loop on Windows must it be a `ProactorEventLoop` - the
189
+ > `SelectorEventLoop` cannot create subprocesses and the call will fail. The
190
+ > sync wrappers and the CLI are unaffected.
191
+
192
+ All options live on the typed `Git2xmlConfig` object - the same settings the CLI
193
+ flags map to (`repo`, `base`, `staged`, `strict_xml`, `no_untracked`, `max_size`,
194
+ `max_diff_size`, `no_content`, `git_timeout`, `diff_semaphore_limit`, `hide_repo_path`). The **function name selects the
195
+ mode**, so you never set `command` yourself:
196
+
197
+ ```python
198
+ config = Git2xmlConfig(repo=".", base="main", strict_xml=True, max_size=100_000)
199
+ xml = git2xml.generate_pr_brief_sync(config)
200
+ ```
201
+
202
+ ### API reference
203
+
204
+ | Function | Sync/Async | Returns |
205
+ | ------------------------------------ | ---------- | ---------- |
206
+ | `generate_commit_brief(config)` | async | XML string |
207
+ | `generate_pr_brief(config)` | async | XML string |
208
+ | `generate_commit_brief_sync(config)` | sync | XML string |
209
+ | `generate_pr_brief_sync(config)` | sync | XML string |
210
+
211
+ - Each returns the brief as a string, or an **empty string `""`** when there is
212
+ nothing to summarize (a clean working tree, or no commits between the branch and
213
+ its base).
214
+ - Failures raise `git2xml.Git2xmlError`, or a more specific subclass:
215
+ `NotAGitRepositoryError`, `GitNotInstalledError`, `GitCommandError`.
216
+ - The `*_sync` helpers cannot be called from inside a running event loop (e.g. a
217
+ Jupyter cell or an async handler); use the async variants there - they raise a
218
+ clear `RuntimeError` if misused.
219
+
220
+ ## Example output
221
+
222
+ A commit brief for one new file and one modified file looks like this. Content and
223
+ diffs are wrapped in `CDATA` so source is embedded verbatim; the repository name is
224
+ emitted as a `<name>` element, and added files carry their full contents as `<content>`
225
+ (no diff is needed since the content is the whole change):
226
+
227
+ ```xml
228
+ <commit_brief repo="/Users/dev/myapp">
229
+ <name>myapp</name>
230
+ <file path="src/tests/test_auth.py" status="added">
231
+ <content format="cdata"><![CDATA[# New file contents
232
+ ]]></content>
233
+ </file>
234
+ <file path="src/auth.py" status="modified">
235
+ <content format="cdata"><![CDATA[def verify_token(token):
236
+ return token in VALID_TOKENS and not is_expired(token)
237
+ ]]></content>
238
+ <diff format="cdata"><![CDATA[@@ -1,2 +1,2 @@
239
+ def verify_token(token):
240
+ - return token in VALID_TOKENS
241
+ + return token in VALID_TOKENS and not is_expired(token)]]></diff>
242
+ </file>
243
+ <file path="src/config_loader.py" status="added">
244
+ <content format="cdata"><![CDATA[Symlink pointing to: ../shared/config_loader.py]]></content>
245
+ </file>
246
+ <file path="assets/logo.png" status="modified" reason="omitted - binary file detected" />
247
+ </commit_brief>
248
+ ```
249
+
250
+ > The `repo` attribute shows the absolute path by default; run with
251
+ > `--hide-repo-path` to emit just the directory name (`repo="myapp"`) when
252
+ > sharing the brief externally.
253
+
254
+ A file whose content is omitted by `--max-size` still carries its `<diff>` and an
255
+ explanatory `reason`:
256
+
257
+ ```xml
258
+ <file path="data/big.csv" status="modified" reason="omitted - file exceeds 5242880 bytes">
259
+ <diff format="cdata"><![CDATA[@@ ... @@]]></diff>
260
+ </file>
261
+ ```
262
+
263
+ A file whose diff is dropped by `--max-diff-size` keeps its `<content>` and marks
264
+ the omission on the diff slot:
265
+
266
+ ```xml
267
+ <file path="vendor/bundle.js" status="modified">
268
+ <content format="cdata"><![CDATA[/* ... file contents ... */]]></content>
269
+ <diff status="omitted" reason="diff exceeds 1048576 bytes" />
270
+ </file>
271
+ ```
272
+
273
+ PR mode wraps the same `<file>` elements and additionally emits a `<commit_log>` of the
274
+ branch's commits.
275
+
276
+ ## XML Compliance vs. File Fidelity
277
+
278
+ By default, `git2xml` prioritizes **exact file fidelity** over strict XML 1.0 compliance. AI models (like Claude) read raw token streams and do not use strict XML parsers.
279
+
280
+ - **Control Characters:** Literal control bytes (e.g., `0x00–0x08`, `0x0B`, `0x0C`, `0x0E–0x1F`) in your source code are passed through exactly as they appear in `<content>` and `<diff>` bodies. This also applies to control bytes inside _attribute_ values (a file path or commit author): in default mode they pass through unescaped, so a control byte there (such as a stray newline in a path) can break attribute well-formedness. `--strict-xml` escapes control characters in attributes too.
281
+ - **CDATA Terminators:** If a file contains the literal string `]]>`, `git2xml` avoids splitting the tag (which alters the raw text the LLM sees) and instead falls back to dynamic Markdown fencing (`format="fenced"`).
282
+ - **Invalid UTF-8:** Text is decoded as UTF-8 on a best-effort basis. Bytes that aren't valid UTF-8 are replaced with the Unicode replacement character (U+FFFD, `�`) rather than causing an error. Files git detects as binary are omitted entirely, so this affects only text files containing occasional malformed bytes.
283
+
284
+ If you are piping this output into a strict automated XML parser (like `xml.etree` or a CI/CD pipeline) rather than an LLM, you can use the `--strict-xml` flag. This will force strict XML 1.0 compliance by replacing control characters with their string representations (e.g., `\x1b`) and safely splitting CDATA terminators (`]]]]><![CDATA[>`).
285
+
286
+ ## Symlinks and file content
287
+
288
+ `git2xml` mirrors git's own behavior for symbolic links: it emits the link's
289
+ **target path** as the content (e.g. `Symlink pointing to: ../shared/config.py`),
290
+ never the contents of the file the link points to. It does not follow or
291
+ dereference symlinks, so a link pointing outside the repository is recorded as
292
+ a path, not read.
293
+
294
+ More broadly, `git2xml` includes file contents and diffs **verbatim, exactly as
295
+ git sees them**. It does not scan, filter, or redact content for secrets or
296
+ sensitive data - that is deliberately out of scope for a zero-dependency git
297
+ formatter. Review generated briefs before pasting them into any external tool,
298
+ and use `.gitignore` (git2xml respects it for untracked files) or `--no-untracked`
299
+ to keep files out of the output.
300
+
301
+ One path is included by default: the root element's `repo` attribute carries the
302
+ **absolute local path** of the repository (e.g. `repo="/Users/dev/myapp"`), so a
303
+ brief records where on your machine it was generated. The per-file `path`
304
+ attributes are always repo-relative and never absolute. If you are pasting briefs
305
+ into a third-party tool and would rather not disclose your local path (which can
306
+ reveal a username or directory layout), pass `--hide-repo-path` to emit only the
307
+ repository's directory name (`repo="myapp"`) instead. The repository name is also
308
+ always available separately in the `<name>` element.
309
+
310
+ ## Security: run against repositories you trust
311
+
312
+ `git2xml` works by invoking your local `git` to read a repository's diffs, blobs,
313
+ and status. It therefore inherits git's normal behavior of running
314
+ repository-defined programs during otherwise read-only operations - for example a
315
+ `textconv` or external-diff driver referenced from `.gitattributes` and defined in
316
+ the repository's `.git/config`, or an `fsmonitor` command. A repository you don't
317
+ control - especially one delivered as an archive that ships its own `.git`
318
+ directory, rather than a fresh clone - can therefore cause code to run on your
319
+ machine when you point `git2xml` at it. This is a property of git itself, not
320
+ specific to `git2xml`.
321
+
322
+ One caveat to the "read-only" framing: in `--staged` mode, `git2xml` runs
323
+ `git write-tree` to read staged-file metadata in a single batch. This writes a
324
+ tree object into `.git/objects`, but it leaves your index, working tree, and
325
+ `HEAD` untouched, and the unreferenced object is reclaimed by git's normal `gc`.
326
+
327
+ The practical guidance is the same as for running any git command: **only run
328
+ `git2xml` against repositories you trust.** To inspect an untrusted repository, do
329
+ it in a throwaway sandbox (a container or VM) rather than on your primary machine.
330
+
331
+ ## Why XML (not JSON)?
332
+
333
+ XML was chosen because LLMs - Claude in particular - parse structured XML tags more reliably than nested JSON when the content includes code with embedded quotes, brackets, and special characters. CDATA sections let you embed raw source code verbatim without escaping, which matters when you're feeding diffs and file contents into a prompt.
334
+
335
+ If you have a strong reason to want JSON output, open an issue - it's a reasonable addition.
336
+
337
+ ## Origin
338
+
339
+ I built `git2xml` while working on [HiveTrail Mesh](https://hivetrail.com/mesh) - a desktop app that assembles structured LLM context from multiple sources (Notion, GitHub Issues, local files, git repos). The git-handling component turned out to be useful as a standalone CLI, so I extracted it under MIT license.
340
+
341
+ If you find this tool helpful and want the same approach applied across your full developer context - not just git - check out HiveTrail Mesh.
342
+
343
+ ## Contributing
344
+
345
+ Issues and PRs welcome. This is a small utility, so expect light maintenance - but reasonable bug reports and improvements will be reviewed and merged.
346
+
347
+ ## License
348
+
349
+ MIT - see [LICENSE](./LICENSE) for details.