xmlcst 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,24 @@
1
+ name: CI
2
+
3
+ on:
4
+ push:
5
+ branches: [main]
6
+ pull_request:
7
+ branches: [main]
8
+
9
+ jobs:
10
+ check:
11
+ runs-on: ubuntu-latest
12
+ strategy:
13
+ matrix:
14
+ python-version: ["3.12", "3.13"]
15
+ steps:
16
+ - uses: actions/checkout@v4
17
+ - uses: actions/setup-python@v5
18
+ with:
19
+ python-version: ${{ matrix.python-version }}
20
+ - run: pip install -e ".[dev]"
21
+ - run: ruff format --check src/ tests/ samples/
22
+ - run: ruff check src/ tests/ samples/
23
+ - run: basedpyright
24
+ - run: pytest tests/ -v
@@ -0,0 +1,37 @@
1
+ name: Publish
2
+
3
+ on:
4
+ push:
5
+ tags: ["v*"]
6
+
7
+ jobs:
8
+ check:
9
+ runs-on: ubuntu-latest
10
+ strategy:
11
+ matrix:
12
+ python-version: ["3.12", "3.13"]
13
+ steps:
14
+ - uses: actions/checkout@v4
15
+ - uses: actions/setup-python@v5
16
+ with:
17
+ python-version: ${{ matrix.python-version }}
18
+ - run: pip install -e ".[dev]"
19
+ - run: ruff format --check src/ tests/ samples/
20
+ - run: ruff check src/ tests/ samples/
21
+ - run: basedpyright
22
+ - run: pytest tests/ -v
23
+
24
+ publish:
25
+ needs: check
26
+ runs-on: ubuntu-latest
27
+ permissions:
28
+ id-token: write
29
+ environment: pypi
30
+ steps:
31
+ - uses: actions/checkout@v4
32
+ - uses: actions/setup-python@v5
33
+ with:
34
+ python-version: "3.13"
35
+ - run: pip install build
36
+ - run: python -m build
37
+ - uses: pypa/gh-action-pypi-publish@release/v1
@@ -0,0 +1,6 @@
1
+ __pycache__/
2
+ *.pyc
3
+ .venv/
4
+ .pytest_cache/
5
+ dist/
6
+ *.egg-info
xmlcst-0.1.0/DEV.md ADDED
@@ -0,0 +1,171 @@
1
+ # Developer Guide
2
+
3
+ This guide is for contributors who want to work on `xmlcst` itself.
4
+
5
+ ## Prerequisites
6
+
7
+ - Python 3.12+
8
+ - Git
9
+
10
+ No other system dependencies are required. The project is pure Python.
11
+
12
+ ## Setup
13
+
14
+ ```bash
15
+ git clone <repo-url>
16
+ cd xmlcst
17
+ ./dev.sh test
18
+ ```
19
+
20
+ On first run, `dev.sh` automatically creates a virtual environment at `.venv/`, installs the package in editable mode with dev dependencies (`pytest`, `ruff`), and runs the test suite.
21
+
22
+ ## Development Commands
23
+
24
+ All development tasks go through `dev.sh`:
25
+
26
+ | Command | Description |
27
+ |---|---|
28
+ | `./dev.sh check` | Run all checks: format, lint, typecheck, test (fails on first error) |
29
+ | `./dev.sh test` | Run the test suite with pytest (verbose output) |
30
+ | `./dev.sh lint` | Run ruff check on `src/` and `tests/` |
31
+ | `./dev.sh format` | Run ruff format on `src/` and `tests/` |
32
+ | `./dev.sh typecheck` | Run basedpyright on `src/` and `tests/` |
33
+ | `./dev.sh clean` | Remove `__pycache__/`, `.pytest_cache/`, `dist/`, `*.egg-info` |
34
+
35
+ Extra arguments are forwarded. For example, to run a single test:
36
+
37
+ ```bash
38
+ ./dev.sh test -k test_roundtrip_simple_element
39
+ ```
40
+
41
+ ## Project Structure
42
+
43
+ ```
44
+ pyproject.toml Build config (hatchling), dev dependencies, pytest settings
45
+ dev.sh Development task runner
46
+ SPEC.md Full specification (source of truth for behaviour)
47
+ README.md Consumer-facing documentation
48
+ src/xmlcst/
49
+ __init__.py Public API exports
50
+ py.typed PEP 561 marker (ships inline types)
51
+ _api.py parse(), parse_bytes(), parse_file()
52
+ _tokenizer.py Lexical tokenizer: str -> list[Token]
53
+ _tokens.py Token dataclass and TokenType enum (22 types)
54
+ _builder.py Tree builder: list[Token] -> Document
55
+ _nodes.py All node types (Document, Element, Attribute, etc.)
56
+ _entities.py Entity/character reference encoding and decoding
57
+ _errors.py ParseError exception
58
+ tests/
59
+ test_roundtrip.py Core invariant: parse(x).to_string() == x
60
+ test_tokenizer.py Token stream correctness
61
+ test_tree.py Tree navigation and property access
62
+ test_mutation.py Editing, dirty-tracking, formatting ownership
63
+ test_encoding.py BOM handling, bytes vs string input
64
+ samples/
65
+ bump_pom_version/ Maven POM version bumper (see README.md)
66
+ ```
67
+
68
+ ## Architecture
69
+
70
+ The implementation follows the dual-layer model described in [SPEC.md](SPEC.md) sections 6-8.
71
+
72
+ ### Layer 1: Tokenization
73
+
74
+ `_tokenizer.py` converts a source string into a flat list of `Token` objects. Each token stores its type, raw text, and byte offset. The fundamental invariant:
75
+
76
+ ```python
77
+ assert source == "".join(t.text for t in tokenize(source))
78
+ ```
79
+
80
+ Every byte of the input belongs to exactly one token. The 22 token types are defined in `_tokens.py`.
81
+
82
+ ### Layer 2: Tree Building
83
+
84
+ `_builder.py` walks the token list and constructs a tree of `Node` objects (defined in `_nodes.py`). Each node stores:
85
+
86
+ - A **token span** `(start, end)` -- indices into the token list identifying the tokens that produced this node
87
+ - A **dirty flag** -- initially `False`, set to `True` when any property is modified
88
+
89
+ ### Serialization
90
+
91
+ When serializing, each node checks whether it can use its token span:
92
+
93
+ - **Clean nodes** (unmodified): join their original tokens -- guaranteed byte-identical to source
94
+ - **Dirty nodes** (modified): rebuild from current properties and formatting metadata
95
+ - **New nodes** (created programmatically): rebuild with inferred or default formatting
96
+
97
+ This is the mechanism that produces minimal diffs on edit.
98
+
99
+ ## Conventions
100
+
101
+ ### Code Style
102
+
103
+ - Formatted with `ruff format` (default line length: 88)
104
+ - Linted with `ruff check`
105
+ - Type-checked with `basedpyright` (strict mode with select rules disabled in `pyproject.toml`)
106
+ - Run `./dev.sh check` before committing to validate formatting, lint, types, and tests in one step
107
+
108
+ ### Module Naming
109
+
110
+ - Public API is exported from `__init__.py`
111
+ - Internal modules use an underscore prefix: `_tokenizer.py`, `_nodes.py`, `_builder.py`, etc.
112
+ - Private methods and attributes use a single underscore prefix
113
+
114
+ ### Testing
115
+
116
+ - Tests live in `tests/` and use `pytest`
117
+ - Test files are grouped by feature area
118
+ - The round-trip test (`test_roundtrip.py`) is the most important invariant: any change must not break `parse(x).to_string() == x`
119
+ - Prefer testing through the public API (`xmlcst.parse`, `doc.to_string()`) rather than reaching into internal methods
120
+ - When adding a new feature, add both a round-trip test case and specific behavioural tests
121
+
122
+ ## Common Workflows
123
+
124
+ ### Adding a New Node Type
125
+
126
+ 1. Define the class in `_nodes.py` following existing patterns (`__slots__`, property accessors with dirty-tracking, `_rebuild()` method)
127
+ 2. Add any new token types to the `TokenType` enum in `_tokens.py`
128
+ 3. Update the tokenizer in `_tokenizer.py` to emit the new tokens
129
+ 4. Update the tree builder in `_builder.py` to construct the new node
130
+ 5. Export from `__init__.py` and add to `__all__`
131
+ 6. Add round-trip and tree-level tests
132
+
133
+ ### Adding a New Feature
134
+
135
+ 1. Check whether the behaviour is already specified in [SPEC.md](SPEC.md)
136
+ 2. If not, update the specification first (spec-driven development)
137
+ 3. Implement against the spec
138
+ 4. Verify that all existing round-trip tests still pass
139
+
140
+ ## Build and Packaging
141
+
142
+ - **Build system**: [hatchling](https://hatch.pypa.io/)
143
+ - **Package layout**: `src/xmlcst/` (src layout)
144
+ - **Build**: `python -m build` (produces wheel and sdist in `dist/`)
145
+ - **No compiled extensions** -- the wheel is pure Python and platform-independent
146
+ - **Type annotations** -- the package ships inline types with a `py.typed` marker (PEP 561); type checkers recognize annotations without stubs
147
+
148
+ ## CI
149
+
150
+ GitHub Actions runs the full check suite (format, lint, typecheck, test) on every push to `main` and every pull request, across Python 3.12 and 3.13. See `.github/workflows/ci.yml`.
151
+
152
+ ## Releasing
153
+
154
+ 1. Update the version in `pyproject.toml`
155
+ 2. Commit: `git commit -am "Bump version to X.Y.Z"`
156
+ 3. Tag: `git tag vX.Y.Z`
157
+ 4. Push: `git push && git push --tags`
158
+
159
+ The [`publish`](.github/workflows/publish.yml) workflow runs CI checks, builds the package, and publishes to PyPI automatically via [trusted publishing](https://docs.pypi.org/trusted-publishers/).
160
+
161
+ ## Specification
162
+
163
+ [SPEC.md](SPEC.md) is the source of truth for all behavioural decisions:
164
+
165
+ - Fidelity guarantees (what must be preserved)
166
+ - Token model (every token type and its semantics)
167
+ - Tree model (node types, dirty-tracking, token spans)
168
+ - Mutation semantics (formatting ownership, inferred formatting)
169
+ - Serialization modes (exact, normalized)
170
+ - Error handling
171
+ - Future roadmap
xmlcst-0.1.0/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 Richard Cook
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
xmlcst-0.1.0/PKG-INFO ADDED
@@ -0,0 +1,217 @@
1
+ Metadata-Version: 2.4
2
+ Name: xmlcst
3
+ Version: 0.1.0
4
+ Summary: Full-fidelity XML parser with lossless round-trip editing
5
+ Project-URL: Repository, https://github.com/rcook/xmlcst
6
+ Project-URL: Issues, https://github.com/rcook/xmlcst/issues
7
+ Author: Richard Cook
8
+ License-Expression: MIT
9
+ License-File: LICENSE
10
+ Keywords: cst,editing,parser,round-trip,xml
11
+ Classifier: Development Status :: 3 - Alpha
12
+ Classifier: License :: OSI Approved :: MIT License
13
+ Classifier: Programming Language :: Python :: 3
14
+ Classifier: Programming Language :: Python :: 3.12
15
+ Classifier: Programming Language :: Python :: 3.13
16
+ Classifier: Topic :: Text Processing :: Markup :: XML
17
+ Classifier: Typing :: Typed
18
+ Requires-Python: >=3.12
19
+ Provides-Extra: dev
20
+ Requires-Dist: basedpyright; extra == 'dev'
21
+ Requires-Dist: pytest; extra == 'dev'
22
+ Requires-Dist: ruff; extra == 'dev'
23
+ Description-Content-Type: text/markdown
24
+
25
+ # xmlcst
26
+
27
+ Full-fidelity XML concrete syntax tree for Python -- parse, edit, and serialize with zero formatting loss.
28
+
29
+ ## Why xmlcst?
30
+
31
+ Existing Python XML libraries (ElementTree, lxml, minidom) parse XML into a semantic tree that discards lexical details: whitespace, comment placement, attribute quote styles, entity reference forms, and more. When you serialize back, the output differs from the input even if you changed nothing.
32
+
33
+ `xmlcst` takes a different approach. It treats XML as **source text first** and semantic structure second, producing a concrete syntax tree (CST) that retains every byte of the original document. When you edit a single attribute, only that attribute changes in the output -- surrounding formatting, comments, and whitespace remain untouched.
34
+
35
+ This makes `xmlcst` ideal for programmatic editing of XML configuration files (Maven POMs, `.csproj` files, Spring configs, Android manifests) where changes must produce minimal, reviewable diffs.
36
+
37
+ ### How xmlcst compares
38
+
39
+ | Feature | ElementTree | lxml | minidom | xmlcst |
40
+ |---|---|---|---|---|
41
+ | Attribute order | Partial | Partial | Partial | Preserved |
42
+ | Quote style (`'` vs `"`) | No | No | No | Preserved |
43
+ | Whitespace / indentation | No | No | No | Preserved |
44
+ | Comments | No | Yes | Yes | Preserved |
45
+ | Entity reference form | No | No | No | Preserved |
46
+ | CDATA vs escaped text | No | Yes | Yes | Preserved |
47
+ | Empty-element syntax (`<x/>` vs `<x />`) | No | No | No | Preserved |
48
+ | Byte-identical round-trip | No | No | No | Yes |
49
+
50
+ The closest conceptual analogue is [ruamel.yaml](https://yaml.readthedocs.io/) -- a round-trip-capable YAML library -- applied to XML.
51
+
52
+ ## Installation
53
+
54
+ ```
55
+ pip install xmlcst
56
+ ```
57
+
58
+ Requires Python 3.12+. Pure Python -- no compiled dependencies. Ships [PEP 561](https://peps.python.org/pep-0561/) type annotations for full mypy / pyright support.
59
+
60
+ ## Quick Start
61
+
62
+ ### Parse and round-trip
63
+
64
+ ```python
65
+ import xmlcst
66
+
67
+ source = '<project xmlns="http://maven.apache.org/POM/4.0.0">\n <version>1.0</version>\n</project>'
68
+
69
+ doc = xmlcst.parse(source)
70
+ assert doc.to_string() == source # byte-identical round-trip
71
+ ```
72
+
73
+ ### Edit an attribute (minimal diff)
74
+
75
+ ```python
76
+ doc = xmlcst.parse('<root version="1.0" author="alice"/>')
77
+ doc.root.attributes["version"] = "2.0"
78
+ print(doc.to_string())
79
+ # <root version="2.0" author="alice"/>
80
+ # Only the value changed -- quotes, whitespace, other attributes untouched
81
+ ```
82
+
83
+ ### Navigate the tree
84
+
85
+ ```python
86
+ doc = xmlcst.parse("""\
87
+ <project>
88
+ <dependencies>
89
+ <dependency>
90
+ <groupId>junit</groupId>
91
+ </dependency>
92
+ </dependencies>
93
+ </project>""")
94
+
95
+ deps = doc.root.find("dependencies")
96
+ dep = deps.find("dependency")
97
+ group = dep.find("groupId")
98
+ print(group.children[0].content) # "junit" (a Text node)
99
+
100
+ # Or search recursively
101
+ dep2 = doc.root.find_recursive("dependency")
102
+ all_deps = doc.root.findall_recursive("dependency")
103
+ ```
104
+
105
+ ### Add and remove elements
106
+
107
+ ```python
108
+ doc = xmlcst.parse("<root>\n <a/>\n <b/>\n</root>")
109
+ doc.root.append(xmlcst.Element("c"))
110
+ print(doc.to_string())
111
+ # <root>
112
+ # <a/>
113
+ # <b/>
114
+ # <c/>
115
+ # </root>
116
+ ```
117
+
118
+ ### Access formatting metadata
119
+
120
+ ```python
121
+ doc = xmlcst.parse('<root id = "1" name=\'foo\'/>')
122
+ attr = doc.root.attributes["id"]
123
+ print(attr.raw_value) # "1"
124
+ print(attr.quote) # '"'
125
+ print(attr.leading_whitespace) # " "
126
+ print(attr.eq_whitespace) # (" ", " ")
127
+ ```
128
+
129
+ ### Work with entity references
130
+
131
+ ```python
132
+ doc = xmlcst.parse("<root>a &amp; b</root>")
133
+ text = doc.root.children[0]
134
+ print(text.content) # "a &amp; b" (raw, as in the source)
135
+ print(text.decoded_content()) # "a & b" (entities resolved)
136
+
137
+ text.set_content("x < y") # auto-escapes
138
+ print(text.content) # "x &lt; y"
139
+ ```
140
+
141
+ ## Sample Application
142
+
143
+ The [`samples/bump_pom_version/`](samples/bump_pom_version/) directory contains a complete example: a Maven POM version bumper that reads a `pom.xml`, increments the patch version, and writes the file back. Only the version string changes -- all comments, whitespace, attribute quoting, and other formatting are preserved exactly.
144
+
145
+ ```bash
146
+ python samples/bump_pom_version/bump_pom_version.py
147
+ # 1.2.3 -> 1.2.4
148
+ ```
149
+
150
+ The script accepts an optional path argument to operate on any POM file:
151
+
152
+ ```bash
153
+ python samples/bump_pom_version/bump_pom_version.py /path/to/your/pom.xml
154
+ ```
155
+
156
+ ## API Overview
157
+
158
+ ### Parsing
159
+
160
+ | Function | Input | Returns |
161
+ |---|---|---|
162
+ | `xmlcst.parse(text)` | `str` | `Document` |
163
+ | `xmlcst.parse_bytes(data)` | `bytes` | `Document` |
164
+ | `xmlcst.parse_file(path)` | `str \| Path` | `Document` |
165
+
166
+ All parse functions raise `xmlcst.ParseError` on malformed input. The error includes `message`, `line`, `column`, and `offset` attributes.
167
+
168
+ ### Node Types
169
+
170
+ | Type | Description |
171
+ |---|---|
172
+ | `Document` | Root container; holds all top-level nodes |
173
+ | `Element` | An XML element with tag, attributes, and children |
174
+ | `Attribute` | Name-value pair with formatting metadata (quote style, whitespace) |
175
+ | `AttributeList` | Ordered collection with dict-like access by name |
176
+ | `Text` | Character data (entity references preserved in raw form) |
177
+ | `Whitespace` | Whitespace-only character data between markup |
178
+ | `Comment` | `<!-- ... -->` |
179
+ | `ProcessingInstruction` | `<?target data?>` |
180
+ | `CData` | `<![CDATA[...]]>` |
181
+ | `Doctype` | `<!DOCTYPE ...>` (preserved verbatim) |
182
+ | `XmlDeclaration` | `<?xml version="1.0" ...?>` |
183
+
184
+ ### Serialization
185
+
186
+ | Method | Description |
187
+ |---|---|
188
+ | `doc.to_string()` | Exact round-trip serialization (default) |
189
+ | `doc.to_string(mode="normalized")` | Pretty-printed with consistent formatting |
190
+ | `doc.to_bytes()` | UTF-8 encoded; BOM preserved if present in input |
191
+ | `doc.write(path)` | Write to file |
192
+
193
+ ## Design
194
+
195
+ `xmlcst` uses a dual-layer architecture:
196
+
197
+ 1. **Token stream** (Layer 1) -- a lossless sequence of tokens covering every byte of the input. The fundamental invariant: `"".join(t.text for t in tokens) == source`.
198
+ 2. **Tree API** (Layer 2) -- mutable nodes backed by the token stream. Each node tracks a token span and a dirty flag.
199
+
200
+ Unmodified nodes serialize by replaying their original tokens (byte-identical). Modified nodes rebuild from their current properties. This guarantees that edits produce the smallest possible diff.
201
+
202
+ See [SPEC.md](SPEC.md) for the full specification.
203
+
204
+ ## Limitations (v1)
205
+
206
+ - UTF-8 encoding only
207
+ - XML 1.0 well-formed documents only (no error recovery)
208
+ - No DTD validation or schema support
209
+ - No XPath query engine
210
+ - No streaming / SAX-style parsing
211
+ - Pure Python (no compiled acceleration)
212
+
213
+ See the [future roadmap](SPEC.md#15-future-roadmap) in the specification for planned enhancements.
214
+
215
+ ## License
216
+
217
+ MIT
xmlcst-0.1.0/README.md ADDED
@@ -0,0 +1,193 @@
1
+ # xmlcst
2
+
3
+ Full-fidelity XML concrete syntax tree for Python -- parse, edit, and serialize with zero formatting loss.
4
+
5
+ ## Why xmlcst?
6
+
7
+ Existing Python XML libraries (ElementTree, lxml, minidom) parse XML into a semantic tree that discards lexical details: whitespace, comment placement, attribute quote styles, entity reference forms, and more. When you serialize back, the output differs from the input even if you changed nothing.
8
+
9
+ `xmlcst` takes a different approach. It treats XML as **source text first** and semantic structure second, producing a concrete syntax tree (CST) that retains every byte of the original document. When you edit a single attribute, only that attribute changes in the output -- surrounding formatting, comments, and whitespace remain untouched.
10
+
11
+ This makes `xmlcst` ideal for programmatic editing of XML configuration files (Maven POMs, `.csproj` files, Spring configs, Android manifests) where changes must produce minimal, reviewable diffs.
12
+
13
+ ### How xmlcst compares
14
+
15
+ | Feature | ElementTree | lxml | minidom | xmlcst |
16
+ |---|---|---|---|---|
17
+ | Attribute order | Partial | Partial | Partial | Preserved |
18
+ | Quote style (`'` vs `"`) | No | No | No | Preserved |
19
+ | Whitespace / indentation | No | No | No | Preserved |
20
+ | Comments | No | Yes | Yes | Preserved |
21
+ | Entity reference form | No | No | No | Preserved |
22
+ | CDATA vs escaped text | No | Yes | Yes | Preserved |
23
+ | Empty-element syntax (`<x/>` vs `<x />`) | No | No | No | Preserved |
24
+ | Byte-identical round-trip | No | No | No | Yes |
25
+
26
+ The closest conceptual analogue is [ruamel.yaml](https://yaml.readthedocs.io/) -- a round-trip-capable YAML library -- applied to XML.
27
+
28
+ ## Installation
29
+
30
+ ```
31
+ pip install xmlcst
32
+ ```
33
+
34
+ Requires Python 3.12+. Pure Python -- no compiled dependencies. Ships [PEP 561](https://peps.python.org/pep-0561/) type annotations for full mypy / pyright support.
35
+
36
+ ## Quick Start
37
+
38
+ ### Parse and round-trip
39
+
40
+ ```python
41
+ import xmlcst
42
+
43
+ source = '<project xmlns="http://maven.apache.org/POM/4.0.0">\n <version>1.0</version>\n</project>'
44
+
45
+ doc = xmlcst.parse(source)
46
+ assert doc.to_string() == source # byte-identical round-trip
47
+ ```
48
+
49
+ ### Edit an attribute (minimal diff)
50
+
51
+ ```python
52
+ doc = xmlcst.parse('<root version="1.0" author="alice"/>')
53
+ doc.root.attributes["version"] = "2.0"
54
+ print(doc.to_string())
55
+ # <root version="2.0" author="alice"/>
56
+ # Only the value changed -- quotes, whitespace, other attributes untouched
57
+ ```
58
+
59
+ ### Navigate the tree
60
+
61
+ ```python
62
+ doc = xmlcst.parse("""\
63
+ <project>
64
+ <dependencies>
65
+ <dependency>
66
+ <groupId>junit</groupId>
67
+ </dependency>
68
+ </dependencies>
69
+ </project>""")
70
+
71
+ deps = doc.root.find("dependencies")
72
+ dep = deps.find("dependency")
73
+ group = dep.find("groupId")
74
+ print(group.children[0].content) # "junit" (a Text node)
75
+
76
+ # Or search recursively
77
+ dep2 = doc.root.find_recursive("dependency")
78
+ all_deps = doc.root.findall_recursive("dependency")
79
+ ```
80
+
81
+ ### Add and remove elements
82
+
83
+ ```python
84
+ doc = xmlcst.parse("<root>\n <a/>\n <b/>\n</root>")
85
+ doc.root.append(xmlcst.Element("c"))
86
+ print(doc.to_string())
87
+ # <root>
88
+ # <a/>
89
+ # <b/>
90
+ # <c/>
91
+ # </root>
92
+ ```
93
+
94
+ ### Access formatting metadata
95
+
96
+ ```python
97
+ doc = xmlcst.parse('<root id = "1" name=\'foo\'/>')
98
+ attr = doc.root.attributes["id"]
99
+ print(attr.raw_value) # "1"
100
+ print(attr.quote) # '"'
101
+ print(attr.leading_whitespace) # " "
102
+ print(attr.eq_whitespace) # (" ", " ")
103
+ ```
104
+
105
+ ### Work with entity references
106
+
107
+ ```python
108
+ doc = xmlcst.parse("<root>a &amp; b</root>")
109
+ text = doc.root.children[0]
110
+ print(text.content) # "a &amp; b" (raw, as in the source)
111
+ print(text.decoded_content()) # "a & b" (entities resolved)
112
+
113
+ text.set_content("x < y") # auto-escapes
114
+ print(text.content) # "x &lt; y"
115
+ ```
116
+
117
+ ## Sample Application
118
+
119
+ The [`samples/bump_pom_version/`](samples/bump_pom_version/) directory contains a complete example: a Maven POM version bumper that reads a `pom.xml`, increments the patch version, and writes the file back. Only the version string changes -- all comments, whitespace, attribute quoting, and other formatting are preserved exactly.
120
+
121
+ ```bash
122
+ python samples/bump_pom_version/bump_pom_version.py
123
+ # 1.2.3 -> 1.2.4
124
+ ```
125
+
126
+ The script accepts an optional path argument to operate on any POM file:
127
+
128
+ ```bash
129
+ python samples/bump_pom_version/bump_pom_version.py /path/to/your/pom.xml
130
+ ```
131
+
132
+ ## API Overview
133
+
134
+ ### Parsing
135
+
136
+ | Function | Input | Returns |
137
+ |---|---|---|
138
+ | `xmlcst.parse(text)` | `str` | `Document` |
139
+ | `xmlcst.parse_bytes(data)` | `bytes` | `Document` |
140
+ | `xmlcst.parse_file(path)` | `str \| Path` | `Document` |
141
+
142
+ All parse functions raise `xmlcst.ParseError` on malformed input. The error includes `message`, `line`, `column`, and `offset` attributes.
143
+
144
+ ### Node Types
145
+
146
+ | Type | Description |
147
+ |---|---|
148
+ | `Document` | Root container; holds all top-level nodes |
149
+ | `Element` | An XML element with tag, attributes, and children |
150
+ | `Attribute` | Name-value pair with formatting metadata (quote style, whitespace) |
151
+ | `AttributeList` | Ordered collection with dict-like access by name |
152
+ | `Text` | Character data (entity references preserved in raw form) |
153
+ | `Whitespace` | Whitespace-only character data between markup |
154
+ | `Comment` | `<!-- ... -->` |
155
+ | `ProcessingInstruction` | `<?target data?>` |
156
+ | `CData` | `<![CDATA[...]]>` |
157
+ | `Doctype` | `<!DOCTYPE ...>` (preserved verbatim) |
158
+ | `XmlDeclaration` | `<?xml version="1.0" ...?>` |
159
+
160
+ ### Serialization
161
+
162
+ | Method | Description |
163
+ |---|---|
164
+ | `doc.to_string()` | Exact round-trip serialization (default) |
165
+ | `doc.to_string(mode="normalized")` | Pretty-printed with consistent formatting |
166
+ | `doc.to_bytes()` | UTF-8 encoded; BOM preserved if present in input |
167
+ | `doc.write(path)` | Write to file |
168
+
169
+ ## Design
170
+
171
+ `xmlcst` uses a dual-layer architecture:
172
+
173
+ 1. **Token stream** (Layer 1) -- a lossless sequence of tokens covering every byte of the input. The fundamental invariant: `"".join(t.text for t in tokens) == source`.
174
+ 2. **Tree API** (Layer 2) -- mutable nodes backed by the token stream. Each node tracks a token span and a dirty flag.
175
+
176
+ Unmodified nodes serialize by replaying their original tokens (byte-identical). Modified nodes rebuild from their current properties. This guarantees that edits produce the smallest possible diff.
177
+
178
+ See [SPEC.md](SPEC.md) for the full specification.
179
+
180
+ ## Limitations (v1)
181
+
182
+ - UTF-8 encoding only
183
+ - XML 1.0 well-formed documents only (no error recovery)
184
+ - No DTD validation or schema support
185
+ - No XPath query engine
186
+ - No streaming / SAX-style parsing
187
+ - Pure Python (no compiled acceleration)
188
+
189
+ See the [future roadmap](SPEC.md#15-future-roadmap) in the specification for planned enhancements.
190
+
191
+ ## License
192
+
193
+ MIT