xml-data-extractor 0.1.0__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- xml_config_builder.py +617 -0
- xml_data_extractor-0.1.0.dist-info/METADATA +408 -0
- xml_data_extractor-0.1.0.dist-info/RECORD +7 -0
- xml_data_extractor-0.1.0.dist-info/WHEEL +5 -0
- xml_data_extractor-0.1.0.dist-info/entry_points.txt +2 -0
- xml_data_extractor-0.1.0.dist-info/top_level.txt +2 -0
- xml_extractor.py +778 -0
|
@@ -0,0 +1,408 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: xml-data-extractor
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: A flexible, configurable Python tool for extracting fields from XML files and converting them to rectangular data formats like CSV or Excel.
|
|
5
|
+
Author-email: Karl KrΓ€gelin <mail@karlkraegelin.de>
|
|
6
|
+
Requires-Python: >=3.14
|
|
7
|
+
Description-Content-Type: text/markdown
|
|
8
|
+
Requires-Dist: click>=8.3.1
|
|
9
|
+
Requires-Dist: loguru>=0.7.3
|
|
10
|
+
Requires-Dist: lxml>=6.0.2
|
|
11
|
+
Requires-Dist: openpyxl>=3.1.5
|
|
12
|
+
Requires-Dist: pandas>=2.3.3
|
|
13
|
+
Requires-Dist: pyarrow>=22.0.0
|
|
14
|
+
Requires-Dist: pyyaml>=6.0.3
|
|
15
|
+
Requires-Dist: rich>=14.2.0
|
|
16
|
+
Requires-Dist: textual>=8.2.7
|
|
17
|
+
|
|
18
|
+
# XML Field Extractor
|
|
19
|
+
|
|
20
|
+
A flexible, configurable Python tool for extracting fields from XML files and converting them to rectangular data formats like CSV or Excel. Designed to handle complex XML formats like **METS**, **LIDO**, **PREMIS**, and more.
|
|
21
|
+
|
|
22
|
+
## Interactive Config Builder
|
|
23
|
+
|
|
24
|
+
`xml-data-extractor` ships with a terminal UI for building configs interactively β no YAML editing needed.
|
|
25
|
+
|
|
26
|
+
```bash
|
|
27
|
+
xml-extractor build example.xml
|
|
28
|
+
```
|
|
29
|
+
|
|
30
|
+
**Phase 1 β Select root element:** Navigate the XML tree, pick the element that represents one record.
|
|
31
|
+
|
|
32
|
+

|
|
33
|
+
|
|
34
|
+
**Phase 2 β Map fields:** Navigate to any node, press Enter to map it as a column. Auto-generates XPath.
|
|
35
|
+
|
|
36
|
+

|
|
37
|
+
|
|
38
|
+
The builder saves a ready-to-use `config.yaml` and can immediately run the extraction.
|
|
39
|
+
|
|
40
|
+
## Disclaimer
|
|
41
|
+
|
|
42
|
+
π¨ This project was completely developed with the assistance of GitHub Copilot.
|
|
43
|
+
It is currently in a prototype stage β use at your own risk. Contributions, reviews, and testing are highly appreciated.
|
|
44
|
+
|
|
45
|
+
## β¨ Features
|
|
46
|
+
|
|
47
|
+
- π **Flexible XPath-based extraction** - Extract any field using XPath expressions
|
|
48
|
+
- π·οΈ **Automatic namespace detection** - No need to manually define namespaces (but you can if needed)
|
|
49
|
+
- π― **Advanced filtering** - Filter extracted values using regex, startswith, or contains patterns
|
|
50
|
+
- π¬ **Record filtering** - Pre-filter records before extraction based on any field criteria (dates, licenses, etc.)
|
|
51
|
+
- π **Multiple value handling** - Concatenate multiple values with custom separators
|
|
52
|
+
- π **Batch processing** - Process entire directories of XML files
|
|
53
|
+
- π **Progress tracking** - Beautiful progress bars and statistics with Rich
|
|
54
|
+
- π **Comprehensive logging** - Detailed logs with Loguru
|
|
55
|
+
- π‘οΈ **Error resilience** - Skip invalid XML files and continue processing
|
|
56
|
+
- π¨ **Modern CLI** - Built with Click for an intuitive command-line interface
|
|
57
|
+
|
|
58
|
+
## π¦ Installation
|
|
59
|
+
|
|
60
|
+
This project uses [uv](https://github.com/astral-sh/uv) for fast, reliable package management.
|
|
61
|
+
|
|
62
|
+
```bash
|
|
63
|
+
# Clone the repository
|
|
64
|
+
git clone <your-repo-url>
|
|
65
|
+
cd xml_extractor
|
|
66
|
+
|
|
67
|
+
# Install dependencies with uv
|
|
68
|
+
uv sync
|
|
69
|
+
```
|
|
70
|
+
|
|
71
|
+
## π Quick Start
|
|
72
|
+
|
|
73
|
+
1. **Prepare your XML files** - Place them in a directory (e.g., `./example_data`)
|
|
74
|
+
|
|
75
|
+
2. **Configure extraction** - Edit `config.yaml` to define:
|
|
76
|
+
- Root XPath for record elements
|
|
77
|
+
- Field mappings with XPath expressions
|
|
78
|
+
- Namespaces (optional - auto-detected if omitted)
|
|
79
|
+
- Filters for specific values
|
|
80
|
+
|
|
81
|
+
3. **Run extraction**:
|
|
82
|
+
```bash
|
|
83
|
+
uv run python xml_extractor.py
|
|
84
|
+
```
|
|
85
|
+
|
|
86
|
+
## π Configuration
|
|
87
|
+
|
|
88
|
+
### Basic Configuration Structure
|
|
89
|
+
|
|
90
|
+
```yaml
|
|
91
|
+
# Input/Output
|
|
92
|
+
input_directory: "./example_data"
|
|
93
|
+
output_file: "output.csv"
|
|
94
|
+
|
|
95
|
+
# XPath to root element (each match = one CSV row)
|
|
96
|
+
root_xpath: ".//oai_dc:dc"
|
|
97
|
+
|
|
98
|
+
# Namespace definitions (optional)
|
|
99
|
+
namespaces:
|
|
100
|
+
oai_dc: "http://www.openarchives.org/OAI/2.0/oai_dc/"
|
|
101
|
+
dc: "http://purl.org/dc/elements/1.1/"
|
|
102
|
+
|
|
103
|
+
# Field mappings
|
|
104
|
+
fields:
|
|
105
|
+
- column: "Title"
|
|
106
|
+
xpath: ".//dc:title/text()"
|
|
107
|
+
|
|
108
|
+
- column: "Creators"
|
|
109
|
+
xpath: ".//dc:creator/text()"
|
|
110
|
+
separator: " | " # For multiple values
|
|
111
|
+
|
|
112
|
+
- column: "URN"
|
|
113
|
+
xpath: ".//dc:identifier/text()"
|
|
114
|
+
filter:
|
|
115
|
+
type: "startswith"
|
|
116
|
+
pattern: "urn:nbn"
|
|
117
|
+
```
|
|
118
|
+
|
|
119
|
+
### Field Configuration Options
|
|
120
|
+
|
|
121
|
+
Each field can have the following properties:
|
|
122
|
+
|
|
123
|
+
| Property | Required | Description |
|
|
124
|
+
|-------------|----------|---------------------------------------------------|
|
|
125
|
+
| `column` | β
Yes | Name of the CSV column |
|
|
126
|
+
| `xpath` | β
Yes | XPath expression (relative to root_xpath) |
|
|
127
|
+
| `separator` | β No | Separator for multiple values (default: `" \| "`) |
|
|
128
|
+
| `filter` | β No | Filter configuration for value selection |
|
|
129
|
+
|
|
130
|
+
### Filter Types
|
|
131
|
+
|
|
132
|
+
Apply filters to select specific values when an XPath matches multiple elements:
|
|
133
|
+
|
|
134
|
+
**Regex Filter:**
|
|
135
|
+
```yaml
|
|
136
|
+
filter:
|
|
137
|
+
type: "regex"
|
|
138
|
+
pattern: "^urn:nbn" # Matches URNs starting with "urn:nbn"
|
|
139
|
+
```
|
|
140
|
+
|
|
141
|
+
**StartsWith Filter:**
|
|
142
|
+
```yaml
|
|
143
|
+
filter:
|
|
144
|
+
type: "startswith"
|
|
145
|
+
pattern: "10." # Matches DOIs starting with "10."
|
|
146
|
+
```
|
|
147
|
+
|
|
148
|
+
**Contains Filter:**
|
|
149
|
+
```yaml
|
|
150
|
+
filter:
|
|
151
|
+
type: "contains"
|
|
152
|
+
pattern: "miami.uni-muenster.de" # Matches URLs containing this domain
|
|
153
|
+
```
|
|
154
|
+
|
|
155
|
+
### Transform
|
|
156
|
+
|
|
157
|
+
Extract specific parts from text using regex patterns with capture groups:
|
|
158
|
+
|
|
159
|
+
```yaml
|
|
160
|
+
# Extract date from MARC 008 field and reformat: "180830e20180830||..." β "2018-08-30"
|
|
161
|
+
- column: "Publication_Date"
|
|
162
|
+
xpath: ".//marcxml:controlfield[@tag='008']/text()"
|
|
163
|
+
transform:
|
|
164
|
+
regex: "\\d{6}e(\\d{4})(\\d{2})(\\d{2})"
|
|
165
|
+
format: "{0}-{1}-{2}" # Reformat using capture groups
|
|
166
|
+
|
|
167
|
+
# Extract language code: "...ger||||||" β "ger"
|
|
168
|
+
- column: "Language_Code"
|
|
169
|
+
xpath: ".//marcxml:controlfield[@tag='008']/text()"
|
|
170
|
+
transform:
|
|
171
|
+
regex: "([a-z]{3})\\|{6}$"
|
|
172
|
+
group: 1
|
|
173
|
+
```
|
|
174
|
+
|
|
175
|
+
**Transform options:**
|
|
176
|
+
- `regex`: Regular expression pattern (use `\\` for backslashes in YAML)
|
|
177
|
+
- `group`: Which capture group to extract (default: 0 = full match, 1+ = capture groups)
|
|
178
|
+
- `format`: Optional format string to reformat the matched data
|
|
179
|
+
- Use `{0}`, `{1}`, `{2}` to reference capture groups
|
|
180
|
+
- Example: `"{0}-{1}-{2}"` converts `20180830` to `2018-08-30`
|
|
181
|
+
|
|
182
|
+
### URL Prefix
|
|
183
|
+
|
|
184
|
+
Automatically prepend a URL to extracted values:
|
|
185
|
+
|
|
186
|
+
```yaml
|
|
187
|
+
- column: "DOI"
|
|
188
|
+
xpath: ".//datafield[@tag='024']/subfield[@code='a']/text()"
|
|
189
|
+
url_prefix: "https://doi.org/" # Converts "10.1234/..." to "https://doi.org/10.1234/..."
|
|
190
|
+
```
|
|
191
|
+
|
|
192
|
+
### Record Filters
|
|
193
|
+
|
|
194
|
+
Pre-filter records **before** extraction to only process records matching specific criteria:
|
|
195
|
+
|
|
196
|
+
```yaml
|
|
197
|
+
# Only extract records from the 1920s with Public Domain licenses
|
|
198
|
+
record_filters:
|
|
199
|
+
- xpath: ".//mods:dateCreated/text()"
|
|
200
|
+
condition: "matches"
|
|
201
|
+
value: "192\\d"
|
|
202
|
+
|
|
203
|
+
```
|
|
204
|
+
|
|
205
|
+
### Record Filters
|
|
206
|
+
|
|
207
|
+
Pre-filter records **before** extraction to only process records matching specific criteria:
|
|
208
|
+
|
|
209
|
+
```yaml
|
|
210
|
+
# Only extract records from the 1920s with Public Domain licenses
|
|
211
|
+
record_filters:
|
|
212
|
+
- xpath: ".//mods:dateCreated/text()"
|
|
213
|
+
condition: "matches"
|
|
214
|
+
value: "192\\d"
|
|
215
|
+
|
|
216
|
+
- xpath: ".//mods:accessCondition/text()"
|
|
217
|
+
condition: "contains"
|
|
218
|
+
value: "publicdomain"
|
|
219
|
+
```
|
|
220
|
+
|
|
221
|
+
**Available filter conditions:** `exists`, `not_exists`, `equals`, `not_equals`, `contains`, `not_contains`, `matches`, `not_matches`, `date_after`, `date_before`, `in`, `not_in`
|
|
222
|
+
|
|
223
|
+
π **See [RECORD_FILTERS.md](RECORD_FILTERS.md) for complete documentation and examples**
|
|
224
|
+
|
|
225
|
+
## π― Usage Examples
|
|
226
|
+
|
|
227
|
+
### Basic Usage
|
|
228
|
+
```bash
|
|
229
|
+
# Use default config.yaml
|
|
230
|
+
uv run python xml_extractor.py
|
|
231
|
+
|
|
232
|
+
# Specify custom config
|
|
233
|
+
uv run python xml_extractor.py --config my_config.yaml
|
|
234
|
+
|
|
235
|
+
# Override input/output
|
|
236
|
+
uv run python xml_extractor.py --input ./my_xmls --output results.csv
|
|
237
|
+
|
|
238
|
+
# Enable debug mode
|
|
239
|
+
uv run python xml_extractor.py --debug
|
|
240
|
+
```
|
|
241
|
+
|
|
242
|
+
### Command-Line Options
|
|
243
|
+
|
|
244
|
+
```
|
|
245
|
+
Options:
|
|
246
|
+
-c, --config PATH Path to configuration YAML file [default: config.yaml]
|
|
247
|
+
-i, --input PATH Input directory containing XML files (overrides config)
|
|
248
|
+
-o, --output PATH Output CSV file path (overrides config)
|
|
249
|
+
--debug Enable debug mode with verbose logging
|
|
250
|
+
--log-file PATH Log file path (overrides config)
|
|
251
|
+
--help Show this message and exit
|
|
252
|
+
```
|
|
253
|
+
|
|
254
|
+
## π Example: Dublin Core OAI-PMH Records
|
|
255
|
+
|
|
256
|
+
### Input XML Structure
|
|
257
|
+
|
|
258
|
+
Your XML files can have different structures:
|
|
259
|
+
|
|
260
|
+
**Single record per file:**
|
|
261
|
+
```xml
|
|
262
|
+
<OAI-PMH xmlns="...">
|
|
263
|
+
<GetRecord>
|
|
264
|
+
<record>
|
|
265
|
+
<metadata>
|
|
266
|
+
<oai_dc:dc xmlns:oai_dc="..." xmlns:dc="...">
|
|
267
|
+
<dc:title>My Title</dc:title>
|
|
268
|
+
<dc:creator>Author Name</dc:creator>
|
|
269
|
+
<!-- more fields -->
|
|
270
|
+
</oai_dc:dc>
|
|
271
|
+
</metadata>
|
|
272
|
+
</record>
|
|
273
|
+
</GetRecord>
|
|
274
|
+
</OAI-PMH>
|
|
275
|
+
```
|
|
276
|
+
|
|
277
|
+
**Multiple records per file:**
|
|
278
|
+
```xml
|
|
279
|
+
<records>
|
|
280
|
+
<record>
|
|
281
|
+
<metadata>
|
|
282
|
+
<oai_dc:dc xmlns:oai_dc="..." xmlns:dc="...">
|
|
283
|
+
<!-- fields -->
|
|
284
|
+
</oai_dc:dc>
|
|
285
|
+
</metadata>
|
|
286
|
+
</record>
|
|
287
|
+
<record>
|
|
288
|
+
<metadata>
|
|
289
|
+
<oai_dc:dc xmlns:oai_dc="..." xmlns:dc="...">
|
|
290
|
+
<!-- fields -->
|
|
291
|
+
</oai_dc:dc>
|
|
292
|
+
</metadata>
|
|
293
|
+
</record>
|
|
294
|
+
</records>
|
|
295
|
+
```
|
|
296
|
+
|
|
297
|
+
Both formats work seamlessly - the tool finds all `oai_dc:dc` elements across all files.
|
|
298
|
+
|
|
299
|
+
### Output CSV
|
|
300
|
+
|
|
301
|
+
```csv
|
|
302
|
+
Title,Creators,Date,URN,DocType
|
|
303
|
+
"ReligiΓΆse Traditionen in...","Dam, P. (Peter) van",2018-08-30,urn:nbn:de:hbz:6-87159515859,doc-type:article
|
|
304
|
+
"Urinary Dickkopf 3...","Jehn, U. (Ulrich) | Altuner, U. (Ugur)",2024-08-21,urn:nbn:de:hbz:6-05978733071,doc-type:article
|
|
305
|
+
```
|
|
306
|
+
|
|
307
|
+
## π§ Advanced Examples
|
|
308
|
+
|
|
309
|
+
### METS Configuration
|
|
310
|
+
|
|
311
|
+
```yaml
|
|
312
|
+
root_xpath: ".//mets:mets"
|
|
313
|
+
|
|
314
|
+
namespaces:
|
|
315
|
+
mets: "http://www.loc.gov/METS/"
|
|
316
|
+
mods: "http://www.loc.gov/mods/v3"
|
|
317
|
+
xlink: "http://www.w3.org/1999/xlink"
|
|
318
|
+
|
|
319
|
+
fields:
|
|
320
|
+
- column: "Title"
|
|
321
|
+
xpath: ".//mets:dmdSec/mets:mdWrap/mets:xmlData/mods:mods/mods:titleInfo/mods:title/text()"
|
|
322
|
+
|
|
323
|
+
- column: "FileID"
|
|
324
|
+
xpath: ".//mets:file/@ID"
|
|
325
|
+
```
|
|
326
|
+
|
|
327
|
+
### LIDO Configuration
|
|
328
|
+
|
|
329
|
+
```yaml
|
|
330
|
+
root_xpath: ".//lido:lido"
|
|
331
|
+
|
|
332
|
+
namespaces:
|
|
333
|
+
lido: "http://www.lido-schema.org"
|
|
334
|
+
|
|
335
|
+
fields:
|
|
336
|
+
- column: "ObjectTitle"
|
|
337
|
+
xpath: ".//lido:titleWrap/lido:titleSet/lido:appellationValue/text()"
|
|
338
|
+
|
|
339
|
+
- column: "ObjectType"
|
|
340
|
+
xpath: ".//lido:objectWorkType/lido:term/text()"
|
|
341
|
+
```
|
|
342
|
+
|
|
343
|
+
## π Statistics & Logging
|
|
344
|
+
|
|
345
|
+
After extraction, you'll see:
|
|
346
|
+
|
|
347
|
+
- β
Files processed
|
|
348
|
+
- β οΈ Files skipped (due to errors)
|
|
349
|
+
- π Total records extracted
|
|
350
|
+
- β οΈ Fields with missing data
|
|
351
|
+
- β Errors encountered
|
|
352
|
+
|
|
353
|
+
All details are logged to `extraction.log` (configurable).
|
|
354
|
+
|
|
355
|
+
## π οΈ Development
|
|
356
|
+
|
|
357
|
+
### Project Structure
|
|
358
|
+
|
|
359
|
+
```
|
|
360
|
+
xml_extractor/
|
|
361
|
+
βββ xml_extractor.py # Main application
|
|
362
|
+
βββ config.yaml # Configuration file
|
|
363
|
+
βββ pyproject.toml # Project dependencies
|
|
364
|
+
βββ README.md # This file
|
|
365
|
+
βββ extraction.log # Generated log file
|
|
366
|
+
```
|
|
367
|
+
|
|
368
|
+
### Dependencies
|
|
369
|
+
|
|
370
|
+
- **lxml** - Fast XML processing with XPath support
|
|
371
|
+
- **rich** - Beautiful terminal output
|
|
372
|
+
- **click** - CLI framework
|
|
373
|
+
- **loguru** - Simple, powerful logging
|
|
374
|
+
- **pyyaml** - YAML configuration parsing
|
|
375
|
+
|
|
376
|
+
## π€ Contributing
|
|
377
|
+
|
|
378
|
+
Contributions are welcome! Please feel free to submit issues or pull requests.
|
|
379
|
+
|
|
380
|
+
## π License
|
|
381
|
+
|
|
382
|
+
MIT License - feel free to use this tool for your projects!
|
|
383
|
+
|
|
384
|
+
## π‘ Tips
|
|
385
|
+
|
|
386
|
+
1. **Start with debug mode** (`--debug`) when creating a new configuration to see what's happening
|
|
387
|
+
2. **Test with a small subset** of files first before processing large batches
|
|
388
|
+
3. **Use filters** to extract specific identifiers (URNs, DOIs, etc.) from multi-value fields
|
|
389
|
+
4. **Check the log file** (`extraction.log`) for warnings about missing XPath matches
|
|
390
|
+
5. **Namespaces are auto-detected** - you only need to define them manually if auto-detection fails
|
|
391
|
+
|
|
392
|
+
## π Troubleshooting
|
|
393
|
+
|
|
394
|
+
### "No records found with root_xpath"
|
|
395
|
+
- Check your `root_xpath` expression
|
|
396
|
+
- Verify namespace prefixes match your XML
|
|
397
|
+
- Try without namespace prefix: `.//dc` instead of `.//oai_dc:dc`
|
|
398
|
+
|
|
399
|
+
### "Invalid XPath expression"
|
|
400
|
+
- Ensure XPath syntax is correct
|
|
401
|
+
- Check that namespace prefixes are defined
|
|
402
|
+
- Use `.//` for descendant search, `/` for direct children
|
|
403
|
+
|
|
404
|
+
### "XML syntax error"
|
|
405
|
+
- Enable `skip_invalid_xml: true` in config to skip bad files
|
|
406
|
+
- Check XML file encoding (should be UTF-8)
|
|
407
|
+
- Validate XML structure with an XML validator
|
|
408
|
+
|
|
@@ -0,0 +1,7 @@
|
|
|
1
|
+
xml_config_builder.py,sha256=A-JMPqAdn59JKeNl1J_geulsMoVahqtT-y3RT2OiZ1I,20012
|
|
2
|
+
xml_extractor.py,sha256=8KUFlde-UEMrquFWdfzhWkFHqfY6yvSc1feI_PnzN3Y,29563
|
|
3
|
+
xml_data_extractor-0.1.0.dist-info/METADATA,sha256=7NSfoJPybAoApdlWvwZn-OSNcGLLfe4MzQt8A1a7in4,11811
|
|
4
|
+
xml_data_extractor-0.1.0.dist-info/WHEEL,sha256=aeYiig01lYGDzBgS8HxWXOg3uV61G9ijOsup-k9o1sk,91
|
|
5
|
+
xml_data_extractor-0.1.0.dist-info/entry_points.txt,sha256=7FGh1dnmdK0UGIg-x_tF0ID5_-oyDyLFugfdVtufBro,52
|
|
6
|
+
xml_data_extractor-0.1.0.dist-info/top_level.txt,sha256=mqWt9lami_mshTPvLSk3pQKFzHtVarrlNJUsCvePjJU,33
|
|
7
|
+
xml_data_extractor-0.1.0.dist-info/RECORD,,
|