xml-data-extractor 0.1.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,408 @@
1
+ Metadata-Version: 2.4
2
+ Name: xml-data-extractor
3
+ Version: 0.1.0
4
+ Summary: A flexible, configurable Python tool for extracting fields from XML files and converting them to rectangular data formats like CSV or Excel.
5
+ Author-email: Karl KrΓ€gelin <mail@karlkraegelin.de>
6
+ Requires-Python: >=3.14
7
+ Description-Content-Type: text/markdown
8
+ Requires-Dist: click>=8.3.1
9
+ Requires-Dist: loguru>=0.7.3
10
+ Requires-Dist: lxml>=6.0.2
11
+ Requires-Dist: openpyxl>=3.1.5
12
+ Requires-Dist: pandas>=2.3.3
13
+ Requires-Dist: pyarrow>=22.0.0
14
+ Requires-Dist: pyyaml>=6.0.3
15
+ Requires-Dist: rich>=14.2.0
16
+ Requires-Dist: textual>=8.2.7
17
+
18
+ # XML Field Extractor
19
+
20
+ A flexible, configurable Python tool for extracting fields from XML files and converting them to rectangular data formats like CSV or Excel. Designed to handle complex XML formats like **METS**, **LIDO**, **PREMIS**, and more.
21
+
22
+ ## Interactive Config Builder
23
+
24
+ `xml-data-extractor` ships with a terminal UI for building configs interactively β€” no YAML editing needed.
25
+
26
+ ```bash
27
+ xml-extractor build example.xml
28
+ ```
29
+
30
+ **Phase 1 β€” Select root element:** Navigate the XML tree, pick the element that represents one record.
31
+
32
+ ![Phase 1: select root element](assets/screenshot_phase1.png)
33
+
34
+ **Phase 2 β€” Map fields:** Navigate to any node, press Enter to map it as a column. Auto-generates XPath.
35
+
36
+ ![Phase 2: map fields](assets/screenshot_phase2.png)
37
+
38
+ The builder saves a ready-to-use `config.yaml` and can immediately run the extraction.
39
+
40
+ ## Disclaimer
41
+
42
+ 🚨 This project was completely developed with the assistance of GitHub Copilot.
43
+ It is currently in a prototype stage β€” use at your own risk. Contributions, reviews, and testing are highly appreciated.
44
+
45
+ ## ✨ Features
46
+
47
+ - πŸ” **Flexible XPath-based extraction** - Extract any field using XPath expressions
48
+ - 🏷️ **Automatic namespace detection** - No need to manually define namespaces (but you can if needed)
49
+ - 🎯 **Advanced filtering** - Filter extracted values using regex, startswith, or contains patterns
50
+ - πŸ”¬ **Record filtering** - Pre-filter records before extraction based on any field criteria (dates, licenses, etc.)
51
+ - πŸ“Š **Multiple value handling** - Concatenate multiple values with custom separators
52
+ - πŸš€ **Batch processing** - Process entire directories of XML files
53
+ - πŸ“ˆ **Progress tracking** - Beautiful progress bars and statistics with Rich
54
+ - πŸ“ **Comprehensive logging** - Detailed logs with Loguru
55
+ - πŸ›‘οΈ **Error resilience** - Skip invalid XML files and continue processing
56
+ - 🎨 **Modern CLI** - Built with Click for an intuitive command-line interface
57
+
58
+ ## πŸ“¦ Installation
59
+
60
+ This project uses [uv](https://github.com/astral-sh/uv) for fast, reliable package management.
61
+
62
+ ```bash
63
+ # Clone the repository
64
+ git clone <your-repo-url>
65
+ cd xml_extractor
66
+
67
+ # Install dependencies with uv
68
+ uv sync
69
+ ```
70
+
71
+ ## πŸš€ Quick Start
72
+
73
+ 1. **Prepare your XML files** - Place them in a directory (e.g., `./example_data`)
74
+
75
+ 2. **Configure extraction** - Edit `config.yaml` to define:
76
+ - Root XPath for record elements
77
+ - Field mappings with XPath expressions
78
+ - Namespaces (optional - auto-detected if omitted)
79
+ - Filters for specific values
80
+
81
+ 3. **Run extraction**:
82
+ ```bash
83
+ uv run python xml_extractor.py
84
+ ```
85
+
86
+ ## πŸ“– Configuration
87
+
88
+ ### Basic Configuration Structure
89
+
90
+ ```yaml
91
+ # Input/Output
92
+ input_directory: "./example_data"
93
+ output_file: "output.csv"
94
+
95
+ # XPath to root element (each match = one CSV row)
96
+ root_xpath: ".//oai_dc:dc"
97
+
98
+ # Namespace definitions (optional)
99
+ namespaces:
100
+ oai_dc: "http://www.openarchives.org/OAI/2.0/oai_dc/"
101
+ dc: "http://purl.org/dc/elements/1.1/"
102
+
103
+ # Field mappings
104
+ fields:
105
+ - column: "Title"
106
+ xpath: ".//dc:title/text()"
107
+
108
+ - column: "Creators"
109
+ xpath: ".//dc:creator/text()"
110
+ separator: " | " # For multiple values
111
+
112
+ - column: "URN"
113
+ xpath: ".//dc:identifier/text()"
114
+ filter:
115
+ type: "startswith"
116
+ pattern: "urn:nbn"
117
+ ```
118
+
119
+ ### Field Configuration Options
120
+
121
+ Each field can have the following properties:
122
+
123
+ | Property | Required | Description |
124
+ |-------------|----------|---------------------------------------------------|
125
+ | `column` | βœ… Yes | Name of the CSV column |
126
+ | `xpath` | βœ… Yes | XPath expression (relative to root_xpath) |
127
+ | `separator` | ❌ No | Separator for multiple values (default: `" \| "`) |
128
+ | `filter` | ❌ No | Filter configuration for value selection |
129
+
130
+ ### Filter Types
131
+
132
+ Apply filters to select specific values when an XPath matches multiple elements:
133
+
134
+ **Regex Filter:**
135
+ ```yaml
136
+ filter:
137
+ type: "regex"
138
+ pattern: "^urn:nbn" # Matches URNs starting with "urn:nbn"
139
+ ```
140
+
141
+ **StartsWith Filter:**
142
+ ```yaml
143
+ filter:
144
+ type: "startswith"
145
+ pattern: "10." # Matches DOIs starting with "10."
146
+ ```
147
+
148
+ **Contains Filter:**
149
+ ```yaml
150
+ filter:
151
+ type: "contains"
152
+ pattern: "miami.uni-muenster.de" # Matches URLs containing this domain
153
+ ```
154
+
155
+ ### Transform
156
+
157
+ Extract specific parts from text using regex patterns with capture groups:
158
+
159
+ ```yaml
160
+ # Extract date from MARC 008 field and reformat: "180830e20180830||..." β†’ "2018-08-30"
161
+ - column: "Publication_Date"
162
+ xpath: ".//marcxml:controlfield[@tag='008']/text()"
163
+ transform:
164
+ regex: "\\d{6}e(\\d{4})(\\d{2})(\\d{2})"
165
+ format: "{0}-{1}-{2}" # Reformat using capture groups
166
+
167
+ # Extract language code: "...ger||||||" β†’ "ger"
168
+ - column: "Language_Code"
169
+ xpath: ".//marcxml:controlfield[@tag='008']/text()"
170
+ transform:
171
+ regex: "([a-z]{3})\\|{6}$"
172
+ group: 1
173
+ ```
174
+
175
+ **Transform options:**
176
+ - `regex`: Regular expression pattern (use `\\` for backslashes in YAML)
177
+ - `group`: Which capture group to extract (default: 0 = full match, 1+ = capture groups)
178
+ - `format`: Optional format string to reformat the matched data
179
+ - Use `{0}`, `{1}`, `{2}` to reference capture groups
180
+ - Example: `"{0}-{1}-{2}"` converts `20180830` to `2018-08-30`
181
+
182
+ ### URL Prefix
183
+
184
+ Automatically prepend a URL to extracted values:
185
+
186
+ ```yaml
187
+ - column: "DOI"
188
+ xpath: ".//datafield[@tag='024']/subfield[@code='a']/text()"
189
+ url_prefix: "https://doi.org/" # Converts "10.1234/..." to "https://doi.org/10.1234/..."
190
+ ```
191
+
192
+ ### Record Filters
193
+
194
+ Pre-filter records **before** extraction to only process records matching specific criteria:
195
+
196
+ ```yaml
197
+ # Only extract records from the 1920s with Public Domain licenses
198
+ record_filters:
199
+ - xpath: ".//mods:dateCreated/text()"
200
+ condition: "matches"
201
+ value: "192\\d"
202
+
203
+ ```
204
+
205
+ ### Record Filters
206
+
207
+ Pre-filter records **before** extraction to only process records matching specific criteria:
208
+
209
+ ```yaml
210
+ # Only extract records from the 1920s with Public Domain licenses
211
+ record_filters:
212
+ - xpath: ".//mods:dateCreated/text()"
213
+ condition: "matches"
214
+ value: "192\\d"
215
+
216
+ - xpath: ".//mods:accessCondition/text()"
217
+ condition: "contains"
218
+ value: "publicdomain"
219
+ ```
220
+
221
+ **Available filter conditions:** `exists`, `not_exists`, `equals`, `not_equals`, `contains`, `not_contains`, `matches`, `not_matches`, `date_after`, `date_before`, `in`, `not_in`
222
+
223
+ πŸ“– **See [RECORD_FILTERS.md](RECORD_FILTERS.md) for complete documentation and examples**
224
+
225
+ ## 🎯 Usage Examples
226
+
227
+ ### Basic Usage
228
+ ```bash
229
+ # Use default config.yaml
230
+ uv run python xml_extractor.py
231
+
232
+ # Specify custom config
233
+ uv run python xml_extractor.py --config my_config.yaml
234
+
235
+ # Override input/output
236
+ uv run python xml_extractor.py --input ./my_xmls --output results.csv
237
+
238
+ # Enable debug mode
239
+ uv run python xml_extractor.py --debug
240
+ ```
241
+
242
+ ### Command-Line Options
243
+
244
+ ```
245
+ Options:
246
+ -c, --config PATH Path to configuration YAML file [default: config.yaml]
247
+ -i, --input PATH Input directory containing XML files (overrides config)
248
+ -o, --output PATH Output CSV file path (overrides config)
249
+ --debug Enable debug mode with verbose logging
250
+ --log-file PATH Log file path (overrides config)
251
+ --help Show this message and exit
252
+ ```
253
+
254
+ ## πŸ“‚ Example: Dublin Core OAI-PMH Records
255
+
256
+ ### Input XML Structure
257
+
258
+ Your XML files can have different structures:
259
+
260
+ **Single record per file:**
261
+ ```xml
262
+ <OAI-PMH xmlns="...">
263
+ <GetRecord>
264
+ <record>
265
+ <metadata>
266
+ <oai_dc:dc xmlns:oai_dc="..." xmlns:dc="...">
267
+ <dc:title>My Title</dc:title>
268
+ <dc:creator>Author Name</dc:creator>
269
+ <!-- more fields -->
270
+ </oai_dc:dc>
271
+ </metadata>
272
+ </record>
273
+ </GetRecord>
274
+ </OAI-PMH>
275
+ ```
276
+
277
+ **Multiple records per file:**
278
+ ```xml
279
+ <records>
280
+ <record>
281
+ <metadata>
282
+ <oai_dc:dc xmlns:oai_dc="..." xmlns:dc="...">
283
+ <!-- fields -->
284
+ </oai_dc:dc>
285
+ </metadata>
286
+ </record>
287
+ <record>
288
+ <metadata>
289
+ <oai_dc:dc xmlns:oai_dc="..." xmlns:dc="...">
290
+ <!-- fields -->
291
+ </oai_dc:dc>
292
+ </metadata>
293
+ </record>
294
+ </records>
295
+ ```
296
+
297
+ Both formats work seamlessly - the tool finds all `oai_dc:dc` elements across all files.
298
+
299
+ ### Output CSV
300
+
301
+ ```csv
302
+ Title,Creators,Date,URN,DocType
303
+ "ReligiΓΆse Traditionen in...","Dam, P. (Peter) van",2018-08-30,urn:nbn:de:hbz:6-87159515859,doc-type:article
304
+ "Urinary Dickkopf 3...","Jehn, U. (Ulrich) | Altuner, U. (Ugur)",2024-08-21,urn:nbn:de:hbz:6-05978733071,doc-type:article
305
+ ```
306
+
307
+ ## πŸ”§ Advanced Examples
308
+
309
+ ### METS Configuration
310
+
311
+ ```yaml
312
+ root_xpath: ".//mets:mets"
313
+
314
+ namespaces:
315
+ mets: "http://www.loc.gov/METS/"
316
+ mods: "http://www.loc.gov/mods/v3"
317
+ xlink: "http://www.w3.org/1999/xlink"
318
+
319
+ fields:
320
+ - column: "Title"
321
+ xpath: ".//mets:dmdSec/mets:mdWrap/mets:xmlData/mods:mods/mods:titleInfo/mods:title/text()"
322
+
323
+ - column: "FileID"
324
+ xpath: ".//mets:file/@ID"
325
+ ```
326
+
327
+ ### LIDO Configuration
328
+
329
+ ```yaml
330
+ root_xpath: ".//lido:lido"
331
+
332
+ namespaces:
333
+ lido: "http://www.lido-schema.org"
334
+
335
+ fields:
336
+ - column: "ObjectTitle"
337
+ xpath: ".//lido:titleWrap/lido:titleSet/lido:appellationValue/text()"
338
+
339
+ - column: "ObjectType"
340
+ xpath: ".//lido:objectWorkType/lido:term/text()"
341
+ ```
342
+
343
+ ## πŸ“Š Statistics & Logging
344
+
345
+ After extraction, you'll see:
346
+
347
+ - βœ… Files processed
348
+ - ⚠️ Files skipped (due to errors)
349
+ - πŸ“ Total records extracted
350
+ - ⚠️ Fields with missing data
351
+ - ❌ Errors encountered
352
+
353
+ All details are logged to `extraction.log` (configurable).
354
+
355
+ ## πŸ› οΈ Development
356
+
357
+ ### Project Structure
358
+
359
+ ```
360
+ xml_extractor/
361
+ β”œβ”€β”€ xml_extractor.py # Main application
362
+ β”œβ”€β”€ config.yaml # Configuration file
363
+ β”œβ”€β”€ pyproject.toml # Project dependencies
364
+ β”œβ”€β”€ README.md # This file
365
+ └── extraction.log # Generated log file
366
+ ```
367
+
368
+ ### Dependencies
369
+
370
+ - **lxml** - Fast XML processing with XPath support
371
+ - **rich** - Beautiful terminal output
372
+ - **click** - CLI framework
373
+ - **loguru** - Simple, powerful logging
374
+ - **pyyaml** - YAML configuration parsing
375
+
376
+ ## 🀝 Contributing
377
+
378
+ Contributions are welcome! Please feel free to submit issues or pull requests.
379
+
380
+ ## πŸ“„ License
381
+
382
+ MIT License - feel free to use this tool for your projects!
383
+
384
+ ## πŸ’‘ Tips
385
+
386
+ 1. **Start with debug mode** (`--debug`) when creating a new configuration to see what's happening
387
+ 2. **Test with a small subset** of files first before processing large batches
388
+ 3. **Use filters** to extract specific identifiers (URNs, DOIs, etc.) from multi-value fields
389
+ 4. **Check the log file** (`extraction.log`) for warnings about missing XPath matches
390
+ 5. **Namespaces are auto-detected** - you only need to define them manually if auto-detection fails
391
+
392
+ ## πŸ†˜ Troubleshooting
393
+
394
+ ### "No records found with root_xpath"
395
+ - Check your `root_xpath` expression
396
+ - Verify namespace prefixes match your XML
397
+ - Try without namespace prefix: `.//dc` instead of `.//oai_dc:dc`
398
+
399
+ ### "Invalid XPath expression"
400
+ - Ensure XPath syntax is correct
401
+ - Check that namespace prefixes are defined
402
+ - Use `.//` for descendant search, `/` for direct children
403
+
404
+ ### "XML syntax error"
405
+ - Enable `skip_invalid_xml: true` in config to skip bad files
406
+ - Check XML file encoding (should be UTF-8)
407
+ - Validate XML structure with an XML validator
408
+
@@ -0,0 +1,7 @@
1
+ xml_config_builder.py,sha256=A-JMPqAdn59JKeNl1J_geulsMoVahqtT-y3RT2OiZ1I,20012
2
+ xml_extractor.py,sha256=8KUFlde-UEMrquFWdfzhWkFHqfY6yvSc1feI_PnzN3Y,29563
3
+ xml_data_extractor-0.1.0.dist-info/METADATA,sha256=7NSfoJPybAoApdlWvwZn-OSNcGLLfe4MzQt8A1a7in4,11811
4
+ xml_data_extractor-0.1.0.dist-info/WHEEL,sha256=aeYiig01lYGDzBgS8HxWXOg3uV61G9ijOsup-k9o1sk,91
5
+ xml_data_extractor-0.1.0.dist-info/entry_points.txt,sha256=7FGh1dnmdK0UGIg-x_tF0ID5_-oyDyLFugfdVtufBro,52
6
+ xml_data_extractor-0.1.0.dist-info/top_level.txt,sha256=mqWt9lami_mshTPvLSk3pQKFzHtVarrlNJUsCvePjJU,33
7
+ xml_data_extractor-0.1.0.dist-info/RECORD,,
@@ -0,0 +1,5 @@
1
+ Wheel-Version: 1.0
2
+ Generator: setuptools (82.0.1)
3
+ Root-Is-Purelib: true
4
+ Tag: py3-none-any
5
+
@@ -0,0 +1,2 @@
1
+ [console_scripts]
2
+ xml-extractor = xml_extractor:cli
@@ -0,0 +1,2 @@
1
+ xml_config_builder
2
+ xml_extractor