xmlcst 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- xmlcst-0.1.0/.github/workflows/ci.yml +24 -0
- xmlcst-0.1.0/.github/workflows/publish.yml +37 -0
- xmlcst-0.1.0/.gitignore +6 -0
- xmlcst-0.1.0/DEV.md +171 -0
- xmlcst-0.1.0/LICENSE +21 -0
- xmlcst-0.1.0/PKG-INFO +217 -0
- xmlcst-0.1.0/README.md +193 -0
- xmlcst-0.1.0/SPEC.md +718 -0
- xmlcst-0.1.0/dev.sh +58 -0
- xmlcst-0.1.0/pyproject.toml +42 -0
- xmlcst-0.1.0/samples/bump_pom_version/bump_pom_version.py +54 -0
- xmlcst-0.1.0/samples/bump_pom_version/pom.xml +36 -0
- xmlcst-0.1.0/src/xmlcst/__init__.py +42 -0
- xmlcst-0.1.0/src/xmlcst/_api.py +47 -0
- xmlcst-0.1.0/src/xmlcst/_builder.py +319 -0
- xmlcst-0.1.0/src/xmlcst/_entities.py +41 -0
- xmlcst-0.1.0/src/xmlcst/_errors.py +19 -0
- xmlcst-0.1.0/src/xmlcst/_nodes.py +811 -0
- xmlcst-0.1.0/src/xmlcst/_tokenizer.py +315 -0
- xmlcst-0.1.0/src/xmlcst/_tokens.py +47 -0
- xmlcst-0.1.0/src/xmlcst/py.typed +0 -0
- xmlcst-0.1.0/tests/__init__.py +0 -0
- xmlcst-0.1.0/tests/test_encoding.py +44 -0
- xmlcst-0.1.0/tests/test_mutation.py +158 -0
- xmlcst-0.1.0/tests/test_roundtrip.py +116 -0
- xmlcst-0.1.0/tests/test_tokenizer.py +134 -0
- xmlcst-0.1.0/tests/test_tree.py +169 -0
|
@@ -0,0 +1,24 @@
|
|
|
1
|
+
name: CI
|
|
2
|
+
|
|
3
|
+
on:
|
|
4
|
+
push:
|
|
5
|
+
branches: [main]
|
|
6
|
+
pull_request:
|
|
7
|
+
branches: [main]
|
|
8
|
+
|
|
9
|
+
jobs:
|
|
10
|
+
check:
|
|
11
|
+
runs-on: ubuntu-latest
|
|
12
|
+
strategy:
|
|
13
|
+
matrix:
|
|
14
|
+
python-version: ["3.12", "3.13"]
|
|
15
|
+
steps:
|
|
16
|
+
- uses: actions/checkout@v4
|
|
17
|
+
- uses: actions/setup-python@v5
|
|
18
|
+
with:
|
|
19
|
+
python-version: ${{ matrix.python-version }}
|
|
20
|
+
- run: pip install -e ".[dev]"
|
|
21
|
+
- run: ruff format --check src/ tests/ samples/
|
|
22
|
+
- run: ruff check src/ tests/ samples/
|
|
23
|
+
- run: basedpyright
|
|
24
|
+
- run: pytest tests/ -v
|
|
@@ -0,0 +1,37 @@
|
|
|
1
|
+
name: Publish
|
|
2
|
+
|
|
3
|
+
on:
|
|
4
|
+
push:
|
|
5
|
+
tags: ["v*"]
|
|
6
|
+
|
|
7
|
+
jobs:
|
|
8
|
+
check:
|
|
9
|
+
runs-on: ubuntu-latest
|
|
10
|
+
strategy:
|
|
11
|
+
matrix:
|
|
12
|
+
python-version: ["3.12", "3.13"]
|
|
13
|
+
steps:
|
|
14
|
+
- uses: actions/checkout@v4
|
|
15
|
+
- uses: actions/setup-python@v5
|
|
16
|
+
with:
|
|
17
|
+
python-version: ${{ matrix.python-version }}
|
|
18
|
+
- run: pip install -e ".[dev]"
|
|
19
|
+
- run: ruff format --check src/ tests/ samples/
|
|
20
|
+
- run: ruff check src/ tests/ samples/
|
|
21
|
+
- run: basedpyright
|
|
22
|
+
- run: pytest tests/ -v
|
|
23
|
+
|
|
24
|
+
publish:
|
|
25
|
+
needs: check
|
|
26
|
+
runs-on: ubuntu-latest
|
|
27
|
+
permissions:
|
|
28
|
+
id-token: write
|
|
29
|
+
environment: pypi
|
|
30
|
+
steps:
|
|
31
|
+
- uses: actions/checkout@v4
|
|
32
|
+
- uses: actions/setup-python@v5
|
|
33
|
+
with:
|
|
34
|
+
python-version: "3.13"
|
|
35
|
+
- run: pip install build
|
|
36
|
+
- run: python -m build
|
|
37
|
+
- uses: pypa/gh-action-pypi-publish@release/v1
|
xmlcst-0.1.0/.gitignore
ADDED
xmlcst-0.1.0/DEV.md
ADDED
|
@@ -0,0 +1,171 @@
|
|
|
1
|
+
# Developer Guide
|
|
2
|
+
|
|
3
|
+
This guide is for contributors who want to work on `xmlcst` itself.
|
|
4
|
+
|
|
5
|
+
## Prerequisites
|
|
6
|
+
|
|
7
|
+
- Python 3.12+
|
|
8
|
+
- Git
|
|
9
|
+
|
|
10
|
+
No other system dependencies are required. The project is pure Python.
|
|
11
|
+
|
|
12
|
+
## Setup
|
|
13
|
+
|
|
14
|
+
```bash
|
|
15
|
+
git clone <repo-url>
|
|
16
|
+
cd xmlcst
|
|
17
|
+
./dev.sh test
|
|
18
|
+
```
|
|
19
|
+
|
|
20
|
+
On first run, `dev.sh` automatically creates a virtual environment at `.venv/`, installs the package in editable mode with dev dependencies (`pytest`, `ruff`), and runs the test suite.
|
|
21
|
+
|
|
22
|
+
## Development Commands
|
|
23
|
+
|
|
24
|
+
All development tasks go through `dev.sh`:
|
|
25
|
+
|
|
26
|
+
| Command | Description |
|
|
27
|
+
|---|---|
|
|
28
|
+
| `./dev.sh check` | Run all checks: format, lint, typecheck, test (fails on first error) |
|
|
29
|
+
| `./dev.sh test` | Run the test suite with pytest (verbose output) |
|
|
30
|
+
| `./dev.sh lint` | Run ruff check on `src/` and `tests/` |
|
|
31
|
+
| `./dev.sh format` | Run ruff format on `src/` and `tests/` |
|
|
32
|
+
| `./dev.sh typecheck` | Run basedpyright on `src/` and `tests/` |
|
|
33
|
+
| `./dev.sh clean` | Remove `__pycache__/`, `.pytest_cache/`, `dist/`, `*.egg-info` |
|
|
34
|
+
|
|
35
|
+
Extra arguments are forwarded. For example, to run a single test:
|
|
36
|
+
|
|
37
|
+
```bash
|
|
38
|
+
./dev.sh test -k test_roundtrip_simple_element
|
|
39
|
+
```
|
|
40
|
+
|
|
41
|
+
## Project Structure
|
|
42
|
+
|
|
43
|
+
```
|
|
44
|
+
pyproject.toml Build config (hatchling), dev dependencies, pytest settings
|
|
45
|
+
dev.sh Development task runner
|
|
46
|
+
SPEC.md Full specification (source of truth for behaviour)
|
|
47
|
+
README.md Consumer-facing documentation
|
|
48
|
+
src/xmlcst/
|
|
49
|
+
__init__.py Public API exports
|
|
50
|
+
py.typed PEP 561 marker (ships inline types)
|
|
51
|
+
_api.py parse(), parse_bytes(), parse_file()
|
|
52
|
+
_tokenizer.py Lexical tokenizer: str -> list[Token]
|
|
53
|
+
_tokens.py Token dataclass and TokenType enum (22 types)
|
|
54
|
+
_builder.py Tree builder: list[Token] -> Document
|
|
55
|
+
_nodes.py All node types (Document, Element, Attribute, etc.)
|
|
56
|
+
_entities.py Entity/character reference encoding and decoding
|
|
57
|
+
_errors.py ParseError exception
|
|
58
|
+
tests/
|
|
59
|
+
test_roundtrip.py Core invariant: parse(x).to_string() == x
|
|
60
|
+
test_tokenizer.py Token stream correctness
|
|
61
|
+
test_tree.py Tree navigation and property access
|
|
62
|
+
test_mutation.py Editing, dirty-tracking, formatting ownership
|
|
63
|
+
test_encoding.py BOM handling, bytes vs string input
|
|
64
|
+
samples/
|
|
65
|
+
bump_pom_version/ Maven POM version bumper (see README.md)
|
|
66
|
+
```
|
|
67
|
+
|
|
68
|
+
## Architecture
|
|
69
|
+
|
|
70
|
+
The implementation follows the dual-layer model described in [SPEC.md](SPEC.md) sections 6-8.
|
|
71
|
+
|
|
72
|
+
### Layer 1: Tokenization
|
|
73
|
+
|
|
74
|
+
`_tokenizer.py` converts a source string into a flat list of `Token` objects. Each token stores its type, raw text, and byte offset. The fundamental invariant:
|
|
75
|
+
|
|
76
|
+
```python
|
|
77
|
+
assert source == "".join(t.text for t in tokenize(source))
|
|
78
|
+
```
|
|
79
|
+
|
|
80
|
+
Every byte of the input belongs to exactly one token. The 22 token types are defined in `_tokens.py`.
|
|
81
|
+
|
|
82
|
+
### Layer 2: Tree Building
|
|
83
|
+
|
|
84
|
+
`_builder.py` walks the token list and constructs a tree of `Node` objects (defined in `_nodes.py`). Each node stores:
|
|
85
|
+
|
|
86
|
+
- A **token span** `(start, end)` -- indices into the token list identifying the tokens that produced this node
|
|
87
|
+
- A **dirty flag** -- initially `False`, set to `True` when any property is modified
|
|
88
|
+
|
|
89
|
+
### Serialization
|
|
90
|
+
|
|
91
|
+
When serializing, each node checks whether it can use its token span:
|
|
92
|
+
|
|
93
|
+
- **Clean nodes** (unmodified): join their original tokens -- guaranteed byte-identical to source
|
|
94
|
+
- **Dirty nodes** (modified): rebuild from current properties and formatting metadata
|
|
95
|
+
- **New nodes** (created programmatically): rebuild with inferred or default formatting
|
|
96
|
+
|
|
97
|
+
This is the mechanism that produces minimal diffs on edit.
|
|
98
|
+
|
|
99
|
+
## Conventions
|
|
100
|
+
|
|
101
|
+
### Code Style
|
|
102
|
+
|
|
103
|
+
- Formatted with `ruff format` (default line length: 88)
|
|
104
|
+
- Linted with `ruff check`
|
|
105
|
+
- Type-checked with `basedpyright` (strict mode with select rules disabled in `pyproject.toml`)
|
|
106
|
+
- Run `./dev.sh check` before committing to validate formatting, lint, types, and tests in one step
|
|
107
|
+
|
|
108
|
+
### Module Naming
|
|
109
|
+
|
|
110
|
+
- Public API is exported from `__init__.py`
|
|
111
|
+
- Internal modules use an underscore prefix: `_tokenizer.py`, `_nodes.py`, `_builder.py`, etc.
|
|
112
|
+
- Private methods and attributes use a single underscore prefix
|
|
113
|
+
|
|
114
|
+
### Testing
|
|
115
|
+
|
|
116
|
+
- Tests live in `tests/` and use `pytest`
|
|
117
|
+
- Test files are grouped by feature area
|
|
118
|
+
- The round-trip test (`test_roundtrip.py`) is the most important invariant: any change must not break `parse(x).to_string() == x`
|
|
119
|
+
- Prefer testing through the public API (`xmlcst.parse`, `doc.to_string()`) rather than reaching into internal methods
|
|
120
|
+
- When adding a new feature, add both a round-trip test case and specific behavioural tests
|
|
121
|
+
|
|
122
|
+
## Common Workflows
|
|
123
|
+
|
|
124
|
+
### Adding a New Node Type
|
|
125
|
+
|
|
126
|
+
1. Define the class in `_nodes.py` following existing patterns (`__slots__`, property accessors with dirty-tracking, `_rebuild()` method)
|
|
127
|
+
2. Add any new token types to the `TokenType` enum in `_tokens.py`
|
|
128
|
+
3. Update the tokenizer in `_tokenizer.py` to emit the new tokens
|
|
129
|
+
4. Update the tree builder in `_builder.py` to construct the new node
|
|
130
|
+
5. Export from `__init__.py` and add to `__all__`
|
|
131
|
+
6. Add round-trip and tree-level tests
|
|
132
|
+
|
|
133
|
+
### Adding a New Feature
|
|
134
|
+
|
|
135
|
+
1. Check whether the behaviour is already specified in [SPEC.md](SPEC.md)
|
|
136
|
+
2. If not, update the specification first (spec-driven development)
|
|
137
|
+
3. Implement against the spec
|
|
138
|
+
4. Verify that all existing round-trip tests still pass
|
|
139
|
+
|
|
140
|
+
## Build and Packaging
|
|
141
|
+
|
|
142
|
+
- **Build system**: [hatchling](https://hatch.pypa.io/)
|
|
143
|
+
- **Package layout**: `src/xmlcst/` (src layout)
|
|
144
|
+
- **Build**: `python -m build` (produces wheel and sdist in `dist/`)
|
|
145
|
+
- **No compiled extensions** -- the wheel is pure Python and platform-independent
|
|
146
|
+
- **Type annotations** -- the package ships inline types with a `py.typed` marker (PEP 561); type checkers recognize annotations without stubs
|
|
147
|
+
|
|
148
|
+
## CI
|
|
149
|
+
|
|
150
|
+
GitHub Actions runs the full check suite (format, lint, typecheck, test) on every push to `main` and every pull request, across Python 3.12 and 3.13. See `.github/workflows/ci.yml`.
|
|
151
|
+
|
|
152
|
+
## Releasing
|
|
153
|
+
|
|
154
|
+
1. Update the version in `pyproject.toml`
|
|
155
|
+
2. Commit: `git commit -am "Bump version to X.Y.Z"`
|
|
156
|
+
3. Tag: `git tag vX.Y.Z`
|
|
157
|
+
4. Push: `git push && git push --tags`
|
|
158
|
+
|
|
159
|
+
The [`publish`](.github/workflows/publish.yml) workflow runs CI checks, builds the package, and publishes to PyPI automatically via [trusted publishing](https://docs.pypi.org/trusted-publishers/).
|
|
160
|
+
|
|
161
|
+
## Specification
|
|
162
|
+
|
|
163
|
+
[SPEC.md](SPEC.md) is the source of truth for all behavioural decisions:
|
|
164
|
+
|
|
165
|
+
- Fidelity guarantees (what must be preserved)
|
|
166
|
+
- Token model (every token type and its semantics)
|
|
167
|
+
- Tree model (node types, dirty-tracking, token spans)
|
|
168
|
+
- Mutation semantics (formatting ownership, inferred formatting)
|
|
169
|
+
- Serialization modes (exact, normalized)
|
|
170
|
+
- Error handling
|
|
171
|
+
- Future roadmap
|
xmlcst-0.1.0/LICENSE
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 Richard Cook
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
xmlcst-0.1.0/PKG-INFO
ADDED
|
@@ -0,0 +1,217 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: xmlcst
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: Full-fidelity XML parser with lossless round-trip editing
|
|
5
|
+
Project-URL: Repository, https://github.com/rcook/xmlcst
|
|
6
|
+
Project-URL: Issues, https://github.com/rcook/xmlcst/issues
|
|
7
|
+
Author: Richard Cook
|
|
8
|
+
License-Expression: MIT
|
|
9
|
+
License-File: LICENSE
|
|
10
|
+
Keywords: cst,editing,parser,round-trip,xml
|
|
11
|
+
Classifier: Development Status :: 3 - Alpha
|
|
12
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
13
|
+
Classifier: Programming Language :: Python :: 3
|
|
14
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
15
|
+
Classifier: Programming Language :: Python :: 3.13
|
|
16
|
+
Classifier: Topic :: Text Processing :: Markup :: XML
|
|
17
|
+
Classifier: Typing :: Typed
|
|
18
|
+
Requires-Python: >=3.12
|
|
19
|
+
Provides-Extra: dev
|
|
20
|
+
Requires-Dist: basedpyright; extra == 'dev'
|
|
21
|
+
Requires-Dist: pytest; extra == 'dev'
|
|
22
|
+
Requires-Dist: ruff; extra == 'dev'
|
|
23
|
+
Description-Content-Type: text/markdown
|
|
24
|
+
|
|
25
|
+
# xmlcst
|
|
26
|
+
|
|
27
|
+
Full-fidelity XML concrete syntax tree for Python -- parse, edit, and serialize with zero formatting loss.
|
|
28
|
+
|
|
29
|
+
## Why xmlcst?
|
|
30
|
+
|
|
31
|
+
Existing Python XML libraries (ElementTree, lxml, minidom) parse XML into a semantic tree that discards lexical details: whitespace, comment placement, attribute quote styles, entity reference forms, and more. When you serialize back, the output differs from the input even if you changed nothing.
|
|
32
|
+
|
|
33
|
+
`xmlcst` takes a different approach. It treats XML as **source text first** and semantic structure second, producing a concrete syntax tree (CST) that retains every byte of the original document. When you edit a single attribute, only that attribute changes in the output -- surrounding formatting, comments, and whitespace remain untouched.
|
|
34
|
+
|
|
35
|
+
This makes `xmlcst` ideal for programmatic editing of XML configuration files (Maven POMs, `.csproj` files, Spring configs, Android manifests) where changes must produce minimal, reviewable diffs.
|
|
36
|
+
|
|
37
|
+
### How xmlcst compares
|
|
38
|
+
|
|
39
|
+
| Feature | ElementTree | lxml | minidom | xmlcst |
|
|
40
|
+
|---|---|---|---|---|
|
|
41
|
+
| Attribute order | Partial | Partial | Partial | Preserved |
|
|
42
|
+
| Quote style (`'` vs `"`) | No | No | No | Preserved |
|
|
43
|
+
| Whitespace / indentation | No | No | No | Preserved |
|
|
44
|
+
| Comments | No | Yes | Yes | Preserved |
|
|
45
|
+
| Entity reference form | No | No | No | Preserved |
|
|
46
|
+
| CDATA vs escaped text | No | Yes | Yes | Preserved |
|
|
47
|
+
| Empty-element syntax (`<x/>` vs `<x />`) | No | No | No | Preserved |
|
|
48
|
+
| Byte-identical round-trip | No | No | No | Yes |
|
|
49
|
+
|
|
50
|
+
The closest conceptual analogue is [ruamel.yaml](https://yaml.readthedocs.io/) -- a round-trip-capable YAML library -- applied to XML.
|
|
51
|
+
|
|
52
|
+
## Installation
|
|
53
|
+
|
|
54
|
+
```
|
|
55
|
+
pip install xmlcst
|
|
56
|
+
```
|
|
57
|
+
|
|
58
|
+
Requires Python 3.12+. Pure Python -- no compiled dependencies. Ships [PEP 561](https://peps.python.org/pep-0561/) type annotations for full mypy / pyright support.
|
|
59
|
+
|
|
60
|
+
## Quick Start
|
|
61
|
+
|
|
62
|
+
### Parse and round-trip
|
|
63
|
+
|
|
64
|
+
```python
|
|
65
|
+
import xmlcst
|
|
66
|
+
|
|
67
|
+
source = '<project xmlns="http://maven.apache.org/POM/4.0.0">\n <version>1.0</version>\n</project>'
|
|
68
|
+
|
|
69
|
+
doc = xmlcst.parse(source)
|
|
70
|
+
assert doc.to_string() == source # byte-identical round-trip
|
|
71
|
+
```
|
|
72
|
+
|
|
73
|
+
### Edit an attribute (minimal diff)
|
|
74
|
+
|
|
75
|
+
```python
|
|
76
|
+
doc = xmlcst.parse('<root version="1.0" author="alice"/>')
|
|
77
|
+
doc.root.attributes["version"] = "2.0"
|
|
78
|
+
print(doc.to_string())
|
|
79
|
+
# <root version="2.0" author="alice"/>
|
|
80
|
+
# Only the value changed -- quotes, whitespace, other attributes untouched
|
|
81
|
+
```
|
|
82
|
+
|
|
83
|
+
### Navigate the tree
|
|
84
|
+
|
|
85
|
+
```python
|
|
86
|
+
doc = xmlcst.parse("""\
|
|
87
|
+
<project>
|
|
88
|
+
<dependencies>
|
|
89
|
+
<dependency>
|
|
90
|
+
<groupId>junit</groupId>
|
|
91
|
+
</dependency>
|
|
92
|
+
</dependencies>
|
|
93
|
+
</project>""")
|
|
94
|
+
|
|
95
|
+
deps = doc.root.find("dependencies")
|
|
96
|
+
dep = deps.find("dependency")
|
|
97
|
+
group = dep.find("groupId")
|
|
98
|
+
print(group.children[0].content) # "junit" (a Text node)
|
|
99
|
+
|
|
100
|
+
# Or search recursively
|
|
101
|
+
dep2 = doc.root.find_recursive("dependency")
|
|
102
|
+
all_deps = doc.root.findall_recursive("dependency")
|
|
103
|
+
```
|
|
104
|
+
|
|
105
|
+
### Add and remove elements
|
|
106
|
+
|
|
107
|
+
```python
|
|
108
|
+
doc = xmlcst.parse("<root>\n <a/>\n <b/>\n</root>")
|
|
109
|
+
doc.root.append(xmlcst.Element("c"))
|
|
110
|
+
print(doc.to_string())
|
|
111
|
+
# <root>
|
|
112
|
+
# <a/>
|
|
113
|
+
# <b/>
|
|
114
|
+
# <c/>
|
|
115
|
+
# </root>
|
|
116
|
+
```
|
|
117
|
+
|
|
118
|
+
### Access formatting metadata
|
|
119
|
+
|
|
120
|
+
```python
|
|
121
|
+
doc = xmlcst.parse('<root id = "1" name=\'foo\'/>')
|
|
122
|
+
attr = doc.root.attributes["id"]
|
|
123
|
+
print(attr.raw_value) # "1"
|
|
124
|
+
print(attr.quote) # '"'
|
|
125
|
+
print(attr.leading_whitespace) # " "
|
|
126
|
+
print(attr.eq_whitespace) # (" ", " ")
|
|
127
|
+
```
|
|
128
|
+
|
|
129
|
+
### Work with entity references
|
|
130
|
+
|
|
131
|
+
```python
|
|
132
|
+
doc = xmlcst.parse("<root>a & b</root>")
|
|
133
|
+
text = doc.root.children[0]
|
|
134
|
+
print(text.content) # "a & b" (raw, as in the source)
|
|
135
|
+
print(text.decoded_content()) # "a & b" (entities resolved)
|
|
136
|
+
|
|
137
|
+
text.set_content("x < y") # auto-escapes
|
|
138
|
+
print(text.content) # "x < y"
|
|
139
|
+
```
|
|
140
|
+
|
|
141
|
+
## Sample Application
|
|
142
|
+
|
|
143
|
+
The [`samples/bump_pom_version/`](samples/bump_pom_version/) directory contains a complete example: a Maven POM version bumper that reads a `pom.xml`, increments the patch version, and writes the file back. Only the version string changes -- all comments, whitespace, attribute quoting, and other formatting are preserved exactly.
|
|
144
|
+
|
|
145
|
+
```bash
|
|
146
|
+
python samples/bump_pom_version/bump_pom_version.py
|
|
147
|
+
# 1.2.3 -> 1.2.4
|
|
148
|
+
```
|
|
149
|
+
|
|
150
|
+
The script accepts an optional path argument to operate on any POM file:
|
|
151
|
+
|
|
152
|
+
```bash
|
|
153
|
+
python samples/bump_pom_version/bump_pom_version.py /path/to/your/pom.xml
|
|
154
|
+
```
|
|
155
|
+
|
|
156
|
+
## API Overview
|
|
157
|
+
|
|
158
|
+
### Parsing
|
|
159
|
+
|
|
160
|
+
| Function | Input | Returns |
|
|
161
|
+
|---|---|---|
|
|
162
|
+
| `xmlcst.parse(text)` | `str` | `Document` |
|
|
163
|
+
| `xmlcst.parse_bytes(data)` | `bytes` | `Document` |
|
|
164
|
+
| `xmlcst.parse_file(path)` | `str \| Path` | `Document` |
|
|
165
|
+
|
|
166
|
+
All parse functions raise `xmlcst.ParseError` on malformed input. The error includes `message`, `line`, `column`, and `offset` attributes.
|
|
167
|
+
|
|
168
|
+
### Node Types
|
|
169
|
+
|
|
170
|
+
| Type | Description |
|
|
171
|
+
|---|---|
|
|
172
|
+
| `Document` | Root container; holds all top-level nodes |
|
|
173
|
+
| `Element` | An XML element with tag, attributes, and children |
|
|
174
|
+
| `Attribute` | Name-value pair with formatting metadata (quote style, whitespace) |
|
|
175
|
+
| `AttributeList` | Ordered collection with dict-like access by name |
|
|
176
|
+
| `Text` | Character data (entity references preserved in raw form) |
|
|
177
|
+
| `Whitespace` | Whitespace-only character data between markup |
|
|
178
|
+
| `Comment` | `<!-- ... -->` |
|
|
179
|
+
| `ProcessingInstruction` | `<?target data?>` |
|
|
180
|
+
| `CData` | `<![CDATA[...]]>` |
|
|
181
|
+
| `Doctype` | `<!DOCTYPE ...>` (preserved verbatim) |
|
|
182
|
+
| `XmlDeclaration` | `<?xml version="1.0" ...?>` |
|
|
183
|
+
|
|
184
|
+
### Serialization
|
|
185
|
+
|
|
186
|
+
| Method | Description |
|
|
187
|
+
|---|---|
|
|
188
|
+
| `doc.to_string()` | Exact round-trip serialization (default) |
|
|
189
|
+
| `doc.to_string(mode="normalized")` | Pretty-printed with consistent formatting |
|
|
190
|
+
| `doc.to_bytes()` | UTF-8 encoded; BOM preserved if present in input |
|
|
191
|
+
| `doc.write(path)` | Write to file |
|
|
192
|
+
|
|
193
|
+
## Design
|
|
194
|
+
|
|
195
|
+
`xmlcst` uses a dual-layer architecture:
|
|
196
|
+
|
|
197
|
+
1. **Token stream** (Layer 1) -- a lossless sequence of tokens covering every byte of the input. The fundamental invariant: `"".join(t.text for t in tokens) == source`.
|
|
198
|
+
2. **Tree API** (Layer 2) -- mutable nodes backed by the token stream. Each node tracks a token span and a dirty flag.
|
|
199
|
+
|
|
200
|
+
Unmodified nodes serialize by replaying their original tokens (byte-identical). Modified nodes rebuild from their current properties. This guarantees that edits produce the smallest possible diff.
|
|
201
|
+
|
|
202
|
+
See [SPEC.md](SPEC.md) for the full specification.
|
|
203
|
+
|
|
204
|
+
## Limitations (v1)
|
|
205
|
+
|
|
206
|
+
- UTF-8 encoding only
|
|
207
|
+
- XML 1.0 well-formed documents only (no error recovery)
|
|
208
|
+
- No DTD validation or schema support
|
|
209
|
+
- No XPath query engine
|
|
210
|
+
- No streaming / SAX-style parsing
|
|
211
|
+
- Pure Python (no compiled acceleration)
|
|
212
|
+
|
|
213
|
+
See the [future roadmap](SPEC.md#15-future-roadmap) in the specification for planned enhancements.
|
|
214
|
+
|
|
215
|
+
## License
|
|
216
|
+
|
|
217
|
+
MIT
|
xmlcst-0.1.0/README.md
ADDED
|
@@ -0,0 +1,193 @@
|
|
|
1
|
+
# xmlcst
|
|
2
|
+
|
|
3
|
+
Full-fidelity XML concrete syntax tree for Python -- parse, edit, and serialize with zero formatting loss.
|
|
4
|
+
|
|
5
|
+
## Why xmlcst?
|
|
6
|
+
|
|
7
|
+
Existing Python XML libraries (ElementTree, lxml, minidom) parse XML into a semantic tree that discards lexical details: whitespace, comment placement, attribute quote styles, entity reference forms, and more. When you serialize back, the output differs from the input even if you changed nothing.
|
|
8
|
+
|
|
9
|
+
`xmlcst` takes a different approach. It treats XML as **source text first** and semantic structure second, producing a concrete syntax tree (CST) that retains every byte of the original document. When you edit a single attribute, only that attribute changes in the output -- surrounding formatting, comments, and whitespace remain untouched.
|
|
10
|
+
|
|
11
|
+
This makes `xmlcst` ideal for programmatic editing of XML configuration files (Maven POMs, `.csproj` files, Spring configs, Android manifests) where changes must produce minimal, reviewable diffs.
|
|
12
|
+
|
|
13
|
+
### How xmlcst compares
|
|
14
|
+
|
|
15
|
+
| Feature | ElementTree | lxml | minidom | xmlcst |
|
|
16
|
+
|---|---|---|---|---|
|
|
17
|
+
| Attribute order | Partial | Partial | Partial | Preserved |
|
|
18
|
+
| Quote style (`'` vs `"`) | No | No | No | Preserved |
|
|
19
|
+
| Whitespace / indentation | No | No | No | Preserved |
|
|
20
|
+
| Comments | No | Yes | Yes | Preserved |
|
|
21
|
+
| Entity reference form | No | No | No | Preserved |
|
|
22
|
+
| CDATA vs escaped text | No | Yes | Yes | Preserved |
|
|
23
|
+
| Empty-element syntax (`<x/>` vs `<x />`) | No | No | No | Preserved |
|
|
24
|
+
| Byte-identical round-trip | No | No | No | Yes |
|
|
25
|
+
|
|
26
|
+
The closest conceptual analogue is [ruamel.yaml](https://yaml.readthedocs.io/) -- a round-trip-capable YAML library -- applied to XML.
|
|
27
|
+
|
|
28
|
+
## Installation
|
|
29
|
+
|
|
30
|
+
```
|
|
31
|
+
pip install xmlcst
|
|
32
|
+
```
|
|
33
|
+
|
|
34
|
+
Requires Python 3.12+. Pure Python -- no compiled dependencies. Ships [PEP 561](https://peps.python.org/pep-0561/) type annotations for full mypy / pyright support.
|
|
35
|
+
|
|
36
|
+
## Quick Start
|
|
37
|
+
|
|
38
|
+
### Parse and round-trip
|
|
39
|
+
|
|
40
|
+
```python
|
|
41
|
+
import xmlcst
|
|
42
|
+
|
|
43
|
+
source = '<project xmlns="http://maven.apache.org/POM/4.0.0">\n <version>1.0</version>\n</project>'
|
|
44
|
+
|
|
45
|
+
doc = xmlcst.parse(source)
|
|
46
|
+
assert doc.to_string() == source # byte-identical round-trip
|
|
47
|
+
```
|
|
48
|
+
|
|
49
|
+
### Edit an attribute (minimal diff)
|
|
50
|
+
|
|
51
|
+
```python
|
|
52
|
+
doc = xmlcst.parse('<root version="1.0" author="alice"/>')
|
|
53
|
+
doc.root.attributes["version"] = "2.0"
|
|
54
|
+
print(doc.to_string())
|
|
55
|
+
# <root version="2.0" author="alice"/>
|
|
56
|
+
# Only the value changed -- quotes, whitespace, other attributes untouched
|
|
57
|
+
```
|
|
58
|
+
|
|
59
|
+
### Navigate the tree
|
|
60
|
+
|
|
61
|
+
```python
|
|
62
|
+
doc = xmlcst.parse("""\
|
|
63
|
+
<project>
|
|
64
|
+
<dependencies>
|
|
65
|
+
<dependency>
|
|
66
|
+
<groupId>junit</groupId>
|
|
67
|
+
</dependency>
|
|
68
|
+
</dependencies>
|
|
69
|
+
</project>""")
|
|
70
|
+
|
|
71
|
+
deps = doc.root.find("dependencies")
|
|
72
|
+
dep = deps.find("dependency")
|
|
73
|
+
group = dep.find("groupId")
|
|
74
|
+
print(group.children[0].content) # "junit" (a Text node)
|
|
75
|
+
|
|
76
|
+
# Or search recursively
|
|
77
|
+
dep2 = doc.root.find_recursive("dependency")
|
|
78
|
+
all_deps = doc.root.findall_recursive("dependency")
|
|
79
|
+
```
|
|
80
|
+
|
|
81
|
+
### Add and remove elements
|
|
82
|
+
|
|
83
|
+
```python
|
|
84
|
+
doc = xmlcst.parse("<root>\n <a/>\n <b/>\n</root>")
|
|
85
|
+
doc.root.append(xmlcst.Element("c"))
|
|
86
|
+
print(doc.to_string())
|
|
87
|
+
# <root>
|
|
88
|
+
# <a/>
|
|
89
|
+
# <b/>
|
|
90
|
+
# <c/>
|
|
91
|
+
# </root>
|
|
92
|
+
```
|
|
93
|
+
|
|
94
|
+
### Access formatting metadata
|
|
95
|
+
|
|
96
|
+
```python
|
|
97
|
+
doc = xmlcst.parse('<root id = "1" name=\'foo\'/>')
|
|
98
|
+
attr = doc.root.attributes["id"]
|
|
99
|
+
print(attr.raw_value) # "1"
|
|
100
|
+
print(attr.quote) # '"'
|
|
101
|
+
print(attr.leading_whitespace) # " "
|
|
102
|
+
print(attr.eq_whitespace) # (" ", " ")
|
|
103
|
+
```
|
|
104
|
+
|
|
105
|
+
### Work with entity references
|
|
106
|
+
|
|
107
|
+
```python
|
|
108
|
+
doc = xmlcst.parse("<root>a & b</root>")
|
|
109
|
+
text = doc.root.children[0]
|
|
110
|
+
print(text.content) # "a & b" (raw, as in the source)
|
|
111
|
+
print(text.decoded_content()) # "a & b" (entities resolved)
|
|
112
|
+
|
|
113
|
+
text.set_content("x < y") # auto-escapes
|
|
114
|
+
print(text.content) # "x < y"
|
|
115
|
+
```
|
|
116
|
+
|
|
117
|
+
## Sample Application
|
|
118
|
+
|
|
119
|
+
The [`samples/bump_pom_version/`](samples/bump_pom_version/) directory contains a complete example: a Maven POM version bumper that reads a `pom.xml`, increments the patch version, and writes the file back. Only the version string changes -- all comments, whitespace, attribute quoting, and other formatting are preserved exactly.
|
|
120
|
+
|
|
121
|
+
```bash
|
|
122
|
+
python samples/bump_pom_version/bump_pom_version.py
|
|
123
|
+
# 1.2.3 -> 1.2.4
|
|
124
|
+
```
|
|
125
|
+
|
|
126
|
+
The script accepts an optional path argument to operate on any POM file:
|
|
127
|
+
|
|
128
|
+
```bash
|
|
129
|
+
python samples/bump_pom_version/bump_pom_version.py /path/to/your/pom.xml
|
|
130
|
+
```
|
|
131
|
+
|
|
132
|
+
## API Overview
|
|
133
|
+
|
|
134
|
+
### Parsing
|
|
135
|
+
|
|
136
|
+
| Function | Input | Returns |
|
|
137
|
+
|---|---|---|
|
|
138
|
+
| `xmlcst.parse(text)` | `str` | `Document` |
|
|
139
|
+
| `xmlcst.parse_bytes(data)` | `bytes` | `Document` |
|
|
140
|
+
| `xmlcst.parse_file(path)` | `str \| Path` | `Document` |
|
|
141
|
+
|
|
142
|
+
All parse functions raise `xmlcst.ParseError` on malformed input. The error includes `message`, `line`, `column`, and `offset` attributes.
|
|
143
|
+
|
|
144
|
+
### Node Types
|
|
145
|
+
|
|
146
|
+
| Type | Description |
|
|
147
|
+
|---|---|
|
|
148
|
+
| `Document` | Root container; holds all top-level nodes |
|
|
149
|
+
| `Element` | An XML element with tag, attributes, and children |
|
|
150
|
+
| `Attribute` | Name-value pair with formatting metadata (quote style, whitespace) |
|
|
151
|
+
| `AttributeList` | Ordered collection with dict-like access by name |
|
|
152
|
+
| `Text` | Character data (entity references preserved in raw form) |
|
|
153
|
+
| `Whitespace` | Whitespace-only character data between markup |
|
|
154
|
+
| `Comment` | `<!-- ... -->` |
|
|
155
|
+
| `ProcessingInstruction` | `<?target data?>` |
|
|
156
|
+
| `CData` | `<![CDATA[...]]>` |
|
|
157
|
+
| `Doctype` | `<!DOCTYPE ...>` (preserved verbatim) |
|
|
158
|
+
| `XmlDeclaration` | `<?xml version="1.0" ...?>` |
|
|
159
|
+
|
|
160
|
+
### Serialization
|
|
161
|
+
|
|
162
|
+
| Method | Description |
|
|
163
|
+
|---|---|
|
|
164
|
+
| `doc.to_string()` | Exact round-trip serialization (default) |
|
|
165
|
+
| `doc.to_string(mode="normalized")` | Pretty-printed with consistent formatting |
|
|
166
|
+
| `doc.to_bytes()` | UTF-8 encoded; BOM preserved if present in input |
|
|
167
|
+
| `doc.write(path)` | Write to file |
|
|
168
|
+
|
|
169
|
+
## Design
|
|
170
|
+
|
|
171
|
+
`xmlcst` uses a dual-layer architecture:
|
|
172
|
+
|
|
173
|
+
1. **Token stream** (Layer 1) -- a lossless sequence of tokens covering every byte of the input. The fundamental invariant: `"".join(t.text for t in tokens) == source`.
|
|
174
|
+
2. **Tree API** (Layer 2) -- mutable nodes backed by the token stream. Each node tracks a token span and a dirty flag.
|
|
175
|
+
|
|
176
|
+
Unmodified nodes serialize by replaying their original tokens (byte-identical). Modified nodes rebuild from their current properties. This guarantees that edits produce the smallest possible diff.
|
|
177
|
+
|
|
178
|
+
See [SPEC.md](SPEC.md) for the full specification.
|
|
179
|
+
|
|
180
|
+
## Limitations (v1)
|
|
181
|
+
|
|
182
|
+
- UTF-8 encoding only
|
|
183
|
+
- XML 1.0 well-formed documents only (no error recovery)
|
|
184
|
+
- No DTD validation or schema support
|
|
185
|
+
- No XPath query engine
|
|
186
|
+
- No streaming / SAX-style parsing
|
|
187
|
+
- Pure Python (no compiled acceleration)
|
|
188
|
+
|
|
189
|
+
See the [future roadmap](SPEC.md#15-future-roadmap) in the specification for planned enhancements.
|
|
190
|
+
|
|
191
|
+
## License
|
|
192
|
+
|
|
193
|
+
MIT
|