py-devo 0.2.0__tar.gz → 0.2.2__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- py_devo-0.2.2/PKG-INFO +778 -0
- py_devo-0.2.2/README.md +764 -0
- py_devo-0.2.2/py_devo.egg-info/PKG-INFO +778 -0
- {py_devo-0.2.0 → py_devo-0.2.2}/pyproject.toml +2 -2
- py_devo-0.2.0/PKG-INFO +0 -167
- py_devo-0.2.0/README.md +0 -153
- py_devo-0.2.0/py_devo.egg-info/PKG-INFO +0 -167
- {py_devo-0.2.0 → py_devo-0.2.2}/LICENSE +0 -0
- {py_devo-0.2.0 → py_devo-0.2.2}/devo/__init__.py +0 -0
- {py_devo-0.2.0 → py_devo-0.2.2}/devo/_infer.py +0 -0
- {py_devo-0.2.0 → py_devo-0.2.2}/devo/_parser.py +0 -0
- {py_devo-0.2.0 → py_devo-0.2.2}/devo/_report.py +0 -0
- {py_devo-0.2.0 → py_devo-0.2.2}/devo/_schema.py +0 -0
- {py_devo-0.2.0 → py_devo-0.2.2}/devo/cli.py +0 -0
- {py_devo-0.2.0 → py_devo-0.2.2}/devo/enrich.py +0 -0
- {py_devo-0.2.0 → py_devo-0.2.2}/devo/exceptions.py +0 -0
- {py_devo-0.2.0 → py_devo-0.2.2}/devo/validate.py +0 -0
- {py_devo-0.2.0 → py_devo-0.2.2}/devo/webui.py +0 -0
- {py_devo-0.2.0 → py_devo-0.2.2}/py_devo.egg-info/SOURCES.txt +0 -0
- {py_devo-0.2.0 → py_devo-0.2.2}/py_devo.egg-info/dependency_links.txt +0 -0
- {py_devo-0.2.0 → py_devo-0.2.2}/py_devo.egg-info/entry_points.txt +0 -0
- {py_devo-0.2.0 → py_devo-0.2.2}/py_devo.egg-info/requires.txt +0 -0
- {py_devo-0.2.0 → py_devo-0.2.2}/py_devo.egg-info/top_level.txt +0 -0
- {py_devo-0.2.0 → py_devo-0.2.2}/setup.cfg +0 -0
- {py_devo-0.2.0 → py_devo-0.2.2}/tests/test_cli.py +0 -0
- {py_devo-0.2.0 → py_devo-0.2.2}/tests/test_enrich.py +0 -0
- {py_devo-0.2.0 → py_devo-0.2.2}/tests/test_infer.py +0 -0
- {py_devo-0.2.0 → py_devo-0.2.2}/tests/test_parser.py +0 -0
- {py_devo-0.2.0 → py_devo-0.2.2}/tests/test_syntax_only.py +0 -0
- {py_devo-0.2.0 → py_devo-0.2.2}/tests/test_validate.py +0 -0
py_devo-0.2.2/PKG-INFO
ADDED
|
@@ -0,0 +1,778 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: py-devo
|
|
3
|
+
Version: 0.2.2
|
|
4
|
+
Summary: DEVO — CSV to iCSV enrichment and Frictionless validation
|
|
5
|
+
License-Expression: MIT
|
|
6
|
+
Project-URL: Source, https://github.com/chasenunez/devo
|
|
7
|
+
Requires-Python: >=3.9
|
|
8
|
+
Description-Content-Type: text/markdown
|
|
9
|
+
License-File: LICENSE
|
|
10
|
+
Requires-Dist: frictionless>=4.0.0
|
|
11
|
+
Provides-Extra: webui
|
|
12
|
+
Requires-Dist: flask>=2.0.0; extra == "webui"
|
|
13
|
+
Dynamic: license-file
|
|
14
|
+
|
|
15
|
+
# DEVO
|
|
16
|
+
|
|
17
|
+
DEVO takes a plain CSV, infers column types and statistics, and produces three output files:
|
|
18
|
+
|
|
19
|
+
| Output file | What it is |
|
|
20
|
+
|---|---|
|
|
21
|
+
| `data.icsv` | Self-documenting [iCSV](https://envidat.github.io/iCSV/) with embedded metadata |
|
|
22
|
+
| `data_schema.json` | [Frictionless Table Schema](https://specs.frictionlessdata.io/table-schema/) for data validation |
|
|
23
|
+
| `data_DEVO_report.txt` | Human-readable validation report (**start here**) |
|
|
24
|
+
|
|
25
|
+
Before uploading, confirm:
|
|
26
|
+
|
|
27
|
+
- [ ] `Valid: YES` in the report
|
|
28
|
+
- [ ] All `# types` entries match the real-world meaning of each column
|
|
29
|
+
- [ ] `# min` and `# max` values are physically plausible
|
|
30
|
+
- [ ] `# missing_count` values match your expectations
|
|
31
|
+
- [ ] No `[WARN]` lines in TYPE CONSISTENCY
|
|
32
|
+
- [ ] `# description` fields are filled in (if required by your data archive)
|
|
33
|
+
- [ ] The `.icsv` file and its `_schema.json` are both included in your upload
|
|
34
|
+
|
|
35
|
+
---
|
|
36
|
+
|
|
37
|
+
## Contents
|
|
38
|
+
|
|
39
|
+
1. [Installation](#1-installation)
|
|
40
|
+
2. [The Three Commands](#2-the-three-commands)
|
|
41
|
+
3. [Tutorial: From Messy CSV to Upload-Ready iCSV](#3-tutorial-from-messy-csv-to-upload-ready-icsv)
|
|
42
|
+
4. [Understanding the Validation Report](#4-understanding-the-validation-report)
|
|
43
|
+
5. [Understanding the iCSV Format](#5-understanding-the-icsv-format)
|
|
44
|
+
6. [Common Errors and How to Fix Them](#6-common-errors-and-how-to-fix-them)
|
|
45
|
+
7. [CLI Reference](#7-cli-reference)
|
|
46
|
+
8. [Python API](#8-python-api)
|
|
47
|
+
|
|
48
|
+
---
|
|
49
|
+
|
|
50
|
+
## 1. Installation
|
|
51
|
+
|
|
52
|
+
```bash
|
|
53
|
+
pip install py-devo
|
|
54
|
+
```
|
|
55
|
+
|
|
56
|
+
Requires Python 3.9 or later. The `frictionless` package is installed automatically.
|
|
57
|
+
|
|
58
|
+
To install from a local clone:
|
|
59
|
+
|
|
60
|
+
```bash
|
|
61
|
+
pip install -e .
|
|
62
|
+
```
|
|
63
|
+
|
|
64
|
+
Verify the installation:
|
|
65
|
+
|
|
66
|
+
```bash
|
|
67
|
+
devo --help
|
|
68
|
+
```
|
|
69
|
+
|
|
70
|
+
---
|
|
71
|
+
|
|
72
|
+
## 2. The Three Commands
|
|
73
|
+
|
|
74
|
+
```
|
|
75
|
+
devo run data.csv # enrich → validate → report (most common)
|
|
76
|
+
devo enrich data.csv # CSV → iCSV + schema only (no validation)
|
|
77
|
+
devo validate data.icsv # validate an iCSV against its schema
|
|
78
|
+
```
|
|
79
|
+
|
|
80
|
+
All three write their outputs to `DEVO_output/` by default. Use `--out MY_DIR` to write elsewhere.
|
|
81
|
+
|
|
82
|
+
**Exit codes:** `0` = everything passed, `1` = validation found data errors, `2` = usage or file error.
|
|
83
|
+
|
|
84
|
+
---
|
|
85
|
+
|
|
86
|
+
## 3. Tutorial: From Messy CSV to Upload-Ready iCSV
|
|
87
|
+
|
|
88
|
+
This tutorial walks through a realistic scenario: environmental sensor data with two common problems. You will enrich the file, read the output to spot the problems, fix the source CSV, and confirm the corrected file is ready for upload.
|
|
89
|
+
|
|
90
|
+
### Step 1: The raw data (with errors)
|
|
91
|
+
|
|
92
|
+
Save the following as `sensor_data.csv`:
|
|
93
|
+
|
|
94
|
+
```csv
|
|
95
|
+
station_id,observation_date,temperature_c,humidity_pct,wind_speed_ms
|
|
96
|
+
S001,2024-01-15,21.4,65,3.2
|
|
97
|
+
S002,2024-01-15,MISSING,72,N/A
|
|
98
|
+
S003,2024-01-15,19.8,168,5.1
|
|
99
|
+
S004,2024-01-15,23.1,71,2.8
|
|
100
|
+
S005,2024-01-16,20.0,71,4.0
|
|
101
|
+
```
|
|
102
|
+
|
|
103
|
+
Two problems are hidden in this file:
|
|
104
|
+
|
|
105
|
+
- **Row 2, `temperature_c`**: the value `MISSING` is a custom nodata sentinel that DEVO does not recognise by default. DEVO will treat it as a real string value, which forces the entire column's inferred type to `string` instead of `number`.
|
|
106
|
+
- **Row 3, `humidity_pct`**: the value `168` is a data-entry error; relative humidity cannot exceed 100%. DEVO will not catch impossible domain values on its own, but the iCSV will expose the inflated maximum so you can spot it.
|
|
107
|
+
|
|
108
|
+
(Note: `N/A` in `wind_speed_ms` is fine; it is a recognised nodata sentinel and is handled correctly.)
|
|
109
|
+
|
|
110
|
+
---
|
|
111
|
+
|
|
112
|
+
### Step 2: First run
|
|
113
|
+
|
|
114
|
+
```bash
|
|
115
|
+
devo run sensor_data.csv
|
|
116
|
+
```
|
|
117
|
+
|
|
118
|
+
Terminal output:
|
|
119
|
+
|
|
120
|
+
```
|
|
121
|
+
[OK] Enriched: DEVO_output/sensor_data.icsv
|
|
122
|
+
[OK] Report: DEVO_output/sensor_data_DEVO_report.txt
|
|
123
|
+
```
|
|
124
|
+
|
|
125
|
+
The command exits with code `0` (success) because DEVO describes the data as it finds it; the schema it builds from the data will technically fit the data. Errors only appear in the report when the data contradicts the schema. Reading the outputs is how you find hidden problems.
|
|
126
|
+
|
|
127
|
+
---
|
|
128
|
+
|
|
129
|
+
### Step 3: Read the validation report
|
|
130
|
+
|
|
131
|
+
Open `DEVO_output/sensor_data_DEVO_report.txt`:
|
|
132
|
+
|
|
133
|
+
```
|
|
134
|
+
DEVO Validation Report
|
|
135
|
+
======================
|
|
136
|
+
File: sensor_data.icsv
|
|
137
|
+
Date: 2024-01-20T10:35:22Z
|
|
138
|
+
Valid: YES
|
|
139
|
+
|
|
140
|
+
METADATA
|
|
141
|
+
----------------------------------------
|
|
142
|
+
[OK] All required metadata present.
|
|
143
|
+
|
|
144
|
+
TYPE CONSISTENCY
|
|
145
|
+
----------------------------------------
|
|
146
|
+
[OK] station_id: declared=string, inferred=string
|
|
147
|
+
[OK] observation_date: declared=datetime, inferred=datetime
|
|
148
|
+
[OK] temperature_c: declared=string, inferred=string
|
|
149
|
+
[OK] humidity_pct: declared=integer, inferred=integer
|
|
150
|
+
[OK] wind_speed_ms: declared=number, inferred=number
|
|
151
|
+
|
|
152
|
+
DATA VALIDATION
|
|
153
|
+
----------------------------------------
|
|
154
|
+
[PASS] No data errors found.
|
|
155
|
+
```
|
|
156
|
+
|
|
157
|
+
**The report says `Valid: YES`.** But look at `temperature_c`: it is declared and inferred as `string`. Temperature readings should be numbers. The report is technically correct (the declared type matches the inferred type), but the inferred type is wrong because DEVO did not know that `MISSING` should be treated as a nodata sentinel.
|
|
158
|
+
|
|
159
|
+
The report alone is not enough. You also need to read the iCSV.
|
|
160
|
+
|
|
161
|
+
---
|
|
162
|
+
|
|
163
|
+
### Step 4: Read the iCSV to spot the problems
|
|
164
|
+
|
|
165
|
+
Open `DEVO_output/sensor_data.icsv`. The `# [FIELDS]` section is the most important part to review:
|
|
166
|
+
|
|
167
|
+
```
|
|
168
|
+
# [FIELDS]
|
|
169
|
+
# fields = station_id|observation_date|temperature_c|humidity_pct|wind_speed_ms
|
|
170
|
+
# types = string|datetime|string|integer|number
|
|
171
|
+
# min = |2024-01-15T00:00:00||65|2.8
|
|
172
|
+
# max = |2024-01-16T00:00:00||168|5.1
|
|
173
|
+
# missing_count = 0|0|0|0|1
|
|
174
|
+
# description = ||||
|
|
175
|
+
```
|
|
176
|
+
|
|
177
|
+
Scan each column from left to right:
|
|
178
|
+
|
|
179
|
+
| Column | Type | Min | Max | Missing | Problem? |
|
|
180
|
+
|---|---|---|---|---|---|
|
|
181
|
+
| `station_id` | string | — | — | 0 | No |
|
|
182
|
+
| `observation_date` | datetime | 2024-01-15 | 2024-01-16 | 0 | No |
|
|
183
|
+
| `temperature_c` | **string** | — | — | **0** | **Yes: should be number; `MISSING` not recognised** |
|
|
184
|
+
| `humidity_pct` | integer | 65 | **168** | 0 | **Yes: max of 168 is physically impossible** |
|
|
185
|
+
| `wind_speed_ms` | number | 2.8 | 5.1 | 1 | No |
|
|
186
|
+
|
|
187
|
+
Two red flags:
|
|
188
|
+
1. `temperature_c` type is `string` and `missing_count` is `0`; the column has a nodata value (`MISSING`) that was treated as a real string.
|
|
189
|
+
2. `humidity_pct` max is `168`. Relative humidity cannot exceed 100; this is a data-entry error.
|
|
190
|
+
|
|
191
|
+
---
|
|
192
|
+
|
|
193
|
+
### Step 5: Fix the errors
|
|
194
|
+
|
|
195
|
+
#### Fix 1: The unrecognised nodata sentinel
|
|
196
|
+
|
|
197
|
+
The cleanest fix is to replace `MISSING` in the source CSV with a sentinel DEVO already recognises: `N/A`, `NA`, `null`, or an empty cell are all understood automatically.
|
|
198
|
+
|
|
199
|
+
Change row 2, column `temperature_c` from `MISSING` to `N/A` (or leave the cell blank).
|
|
200
|
+
|
|
201
|
+
If you cannot change the source data and `MISSING` will always appear in your files, pass `--nodata MISSING` on the command line. DEVO will then treat `MISSING` the same way it treats `N/A`:
|
|
202
|
+
|
|
203
|
+
```bash
|
|
204
|
+
devo run sensor_data.csv --nodata MISSING
|
|
205
|
+
```
|
|
206
|
+
|
|
207
|
+
#### Fix 2: The impossible humidity value
|
|
208
|
+
|
|
209
|
+
Row 3 has `humidity_pct = 168`. Investigate the source; it is likely a typo for `68`. Correct it in the CSV.
|
|
210
|
+
|
|
211
|
+
---
|
|
212
|
+
|
|
213
|
+
### Step 6: Re-run on the corrected file
|
|
214
|
+
|
|
215
|
+
After making both corrections, `sensor_data.csv` should look like this:
|
|
216
|
+
|
|
217
|
+
```csv
|
|
218
|
+
station_id,observation_date,temperature_c,humidity_pct,wind_speed_ms
|
|
219
|
+
S001,2024-01-15,21.4,65,3.2
|
|
220
|
+
S002,2024-01-15,N/A,72,N/A
|
|
221
|
+
S003,2024-01-15,19.8,68,5.1
|
|
222
|
+
S004,2024-01-15,23.1,71,2.8
|
|
223
|
+
S005,2024-01-16,20.0,71,4.0
|
|
224
|
+
```
|
|
225
|
+
|
|
226
|
+
Run DEVO again:
|
|
227
|
+
|
|
228
|
+
```bash
|
|
229
|
+
devo run sensor_data.csv
|
|
230
|
+
```
|
|
231
|
+
|
|
232
|
+
Terminal output:
|
|
233
|
+
|
|
234
|
+
```
|
|
235
|
+
[OK] Enriched: DEVO_output/sensor_data.icsv
|
|
236
|
+
[OK] Report: DEVO_output/sensor_data_DEVO_report.txt
|
|
237
|
+
```
|
|
238
|
+
|
|
239
|
+
Validation report:
|
|
240
|
+
|
|
241
|
+
```
|
|
242
|
+
DEVO Validation Report
|
|
243
|
+
======================
|
|
244
|
+
File: sensor_data.icsv
|
|
245
|
+
Date: 2024-01-20T10:40:15Z
|
|
246
|
+
Valid: YES
|
|
247
|
+
|
|
248
|
+
METADATA
|
|
249
|
+
----------------------------------------
|
|
250
|
+
[OK] All required metadata present.
|
|
251
|
+
|
|
252
|
+
TYPE CONSISTENCY
|
|
253
|
+
----------------------------------------
|
|
254
|
+
[OK] station_id: declared=string, inferred=string
|
|
255
|
+
[OK] observation_date: declared=datetime, inferred=datetime
|
|
256
|
+
[OK] temperature_c: declared=number, inferred=number
|
|
257
|
+
[OK] humidity_pct: declared=integer, inferred=integer
|
|
258
|
+
[OK] wind_speed_ms: declared=number, inferred=number
|
|
259
|
+
|
|
260
|
+
DATA VALIDATION
|
|
261
|
+
----------------------------------------
|
|
262
|
+
[PASS] No data errors found.
|
|
263
|
+
```
|
|
264
|
+
|
|
265
|
+
The `# [FIELDS]` section of the iCSV now shows correct types and a plausible maximum for humidity:
|
|
266
|
+
|
|
267
|
+
```
|
|
268
|
+
# [FIELDS]
|
|
269
|
+
# fields = station_id|observation_date|temperature_c|humidity_pct|wind_speed_ms
|
|
270
|
+
# types = string|datetime|number|integer|number
|
|
271
|
+
# min = |2024-01-15T00:00:00|19.8|65|2.8
|
|
272
|
+
# max = |2024-01-16T00:00:00|23.1|72|5.1
|
|
273
|
+
# missing_count = 0|0|1|0|1
|
|
274
|
+
# description = ||||
|
|
275
|
+
```
|
|
276
|
+
|
|
277
|
+
---
|
|
278
|
+
|
|
279
|
+
### Step 7: How to know the file is ready for upload
|
|
280
|
+
|
|
281
|
+
A file is ready for upload when all of the following are true:
|
|
282
|
+
|
|
283
|
+
- [ ] **Report says `Valid: YES`**
|
|
284
|
+
- [ ] **All column types in `# types` are correct** for the data: numbers are `integer` or `number`, dates are `datetime`, free text is `string`
|
|
285
|
+
- [ ] **`# min` and `# max` values are physically plausible**, with no impossible extremes like `humidity_pct = 168`
|
|
286
|
+
- [ ] **`# missing_count` matches your expectation.** If a column should have no gaps and shows `missing_count = 5`, investigate before uploading.
|
|
287
|
+
- [ ] **No `[WARN]` lines in TYPE CONSISTENCY.** A warning means the declared type does not match what DEVO sees in the data (see [Common Errors](#6-common-errors-and-how-to-fix-them)).
|
|
288
|
+
|
|
289
|
+
Once all boxes are checked, submit the `.icsv` file and its accompanying `_schema.json`.
|
|
290
|
+
|
|
291
|
+
---
|
|
292
|
+
|
|
293
|
+
## 4. Understanding the Validation Report
|
|
294
|
+
|
|
295
|
+
The report has three sections:
|
|
296
|
+
|
|
297
|
+
### Report header
|
|
298
|
+
|
|
299
|
+
```
|
|
300
|
+
DEVO Validation Report
|
|
301
|
+
======================
|
|
302
|
+
File: sensor_data.icsv
|
|
303
|
+
Date: 2024-01-20T10:40:15Z
|
|
304
|
+
Valid: YES
|
|
305
|
+
```
|
|
306
|
+
|
|
307
|
+
`Valid: YES` means **both** the metadata check and the Frictionless data check passed. Type consistency warnings (`[WARN]`) do not make the file invalid; they are advisory. `Valid: NO` means at least one `[FAIL]` was found in METADATA or DATA VALIDATION.
|
|
308
|
+
|
|
309
|
+
---
|
|
310
|
+
|
|
311
|
+
### METADATA section
|
|
312
|
+
|
|
313
|
+
Checks that the required iCSV metadata keys are present.
|
|
314
|
+
|
|
315
|
+
```
|
|
316
|
+
METADATA
|
|
317
|
+
----------------------------------------
|
|
318
|
+
[OK] All required metadata present.
|
|
319
|
+
```
|
|
320
|
+
|
|
321
|
+
Or, if there are problems:
|
|
322
|
+
|
|
323
|
+
```
|
|
324
|
+
METADATA
|
|
325
|
+
----------------------------------------
|
|
326
|
+
[FAIL] Missing required metadata key: field_delimiter
|
|
327
|
+
[WARN] Spatial columns detected but 'geometry' metadata key is missing
|
|
328
|
+
[WARN] Spatial columns detected but 'srid' metadata key is missing
|
|
329
|
+
```
|
|
330
|
+
|
|
331
|
+
| Message | Meaning | Effect on `Valid` |
|
|
332
|
+
| ------------------------------------------------------------------------ | ----------------------------------------------------------------------------- | ----------------- |
|
|
333
|
+
| `[OK] All required metadata present.` | Everything is in order | — |
|
|
334
|
+
| `[FAIL] Missing required metadata key: field_delimiter` | The `field_delimiter` key is absent from `# [METADATA]` | Sets `Valid: NO` |
|
|
335
|
+
| `[WARN] Spatial columns detected but 'geometry' metadata key is missing` | Columns named `lat`/`lon`/`geometry` found but `geometry` key is not declared | Advisory only |
|
|
336
|
+
| `[WARN] Spatial columns detected but 'srid' metadata key is missing` | Lat/lon columns found but no coordinate reference system declared | Advisory only |
|
|
337
|
+
|
|
338
|
+
`[FAIL]` in METADATA sets the overall result to `Valid: NO`. `[WARN]` in METADATA does not.
|
|
339
|
+
|
|
340
|
+
---
|
|
341
|
+
|
|
342
|
+
### TYPE CONSISTENCY section
|
|
343
|
+
|
|
344
|
+
DEVO re-infers each column's type from the actual data rows and compares it to the type declared in `# [FIELDS]`. This catches cases where the declared type was manually edited to be stricter than what the data actually contains.
|
|
345
|
+
|
|
346
|
+
```
|
|
347
|
+
TYPE CONSISTENCY
|
|
348
|
+
----------------------------------------
|
|
349
|
+
[OK] temperature_c: declared=number, inferred=number
|
|
350
|
+
[WARN] humidity_pct: declared=integer, inferred=number
|
|
351
|
+
Inferred type is wider than declared. Data may not satisfy 'integer' constraints.
|
|
352
|
+
```
|
|
353
|
+
|
|
354
|
+
| Result | Meaning |
|
|
355
|
+
| -------- | ----------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
356
|
+
| `[OK]` | Inferred type is equal to or narrower than declared (e.g., inferred `integer` satisfies declared `number`) |
|
|
357
|
+
| `[WARN]` | Inferred type is **wider** than declared (e.g., inferred `number` does not satisfy declared `integer`; floats exist but integers are expected) |
|
|
358
|
+
|
|
359
|
+
Type hierarchy (narrowest to widest): `integer` → `number` → `string`, and `datetime` → `string`.
|
|
360
|
+
|
|
361
|
+
`[WARN]` in TYPE CONSISTENCY is advisory and does **not** set `Valid: NO`. However, it usually means the data has values that will fail Frictionless validation. Check the DATA VALIDATION section for accompanying `[FAIL]` lines.
|
|
362
|
+
|
|
363
|
+
---
|
|
364
|
+
|
|
365
|
+
### DATA VALIDATION section
|
|
366
|
+
|
|
367
|
+
Frictionless validates the actual data rows against the schema JSON. This catches type mismatches, out-of-range values, and required-field violations.
|
|
368
|
+
|
|
369
|
+
```
|
|
370
|
+
DATA VALIDATION
|
|
371
|
+
----------------------------------------
|
|
372
|
+
[PASS] No data errors found.
|
|
373
|
+
```
|
|
374
|
+
|
|
375
|
+
Or, when errors are found:
|
|
376
|
+
|
|
377
|
+
```
|
|
378
|
+
DATA VALIDATION
|
|
379
|
+
----------------------------------------
|
|
380
|
+
[FAIL] 3 error(s) found:
|
|
381
|
+
Row 2, Col temperature_c [type-error]: type is "number/default" and value "MISSING" is not valid
|
|
382
|
+
Row 3, Col humidity_pct [constraint-error]: constraint "maximum is 100" is not satisfied for value "168"
|
|
383
|
+
Row 4, Col station_id [required-error]: constraint "required is True" is not satisfied for value ""
|
|
384
|
+
```
|
|
385
|
+
|
|
386
|
+
Each error line shows:
|
|
387
|
+
- **Row number**: the row in the data section (row 1 is the header, so row 2 is the first data row)
|
|
388
|
+
- **Column name**: which field failed
|
|
389
|
+
- **Error code**: the Frictionless error type (see table below)
|
|
390
|
+
- **Message**: the specific constraint that was violated
|
|
391
|
+
|
|
392
|
+
| Error code | What it means | How to fix |
|
|
393
|
+
|---|---|---|
|
|
394
|
+
| `type-error` | A value cannot be parsed as the declared type | Correct the value in the source CSV, or adjust the type in the schema if the declaration is wrong |
|
|
395
|
+
| `constraint-error` | A value falls outside a `minimum`, `maximum`, or other constraint | Correct the value in the source CSV, or update the schema constraint if it was set too tightly |
|
|
396
|
+
| `required-error` | A required field has a blank or missing value | Fill in the missing value, or mark the field as not required in the schema |
|
|
397
|
+
|
|
398
|
+
If there are more than 50 errors, the report shows only the first 50 and notes the total count. Fix the listed errors first; re-running often reveals whether additional errors exist.
|
|
399
|
+
|
|
400
|
+
---
|
|
401
|
+
|
|
402
|
+
## 5. Understanding the iCSV Format
|
|
403
|
+
|
|
404
|
+
An iCSV file is a plain-text CSV with a structured comment header. Comments begin with `#`. There are three named sections.
|
|
405
|
+
|
|
406
|
+
### `# [METADATA]` section
|
|
407
|
+
|
|
408
|
+
Key/value pairs describing the file as a whole.
|
|
409
|
+
|
|
410
|
+
```
|
|
411
|
+
# iCSV 1.0 UTF-8
|
|
412
|
+
# [METADATA]
|
|
413
|
+
# iCSV_version = 1.0
|
|
414
|
+
# field_delimiter = |
|
|
415
|
+
# rows = 5
|
|
416
|
+
# columns = 5
|
|
417
|
+
# creation_date = 2024-01-20T10:40:15.123456Z
|
|
418
|
+
# nodata = N/A
|
|
419
|
+
# generator = DEVO
|
|
420
|
+
```
|
|
421
|
+
|
|
422
|
+
**`field_delimiter`** is the character used to separate values in `# [FIELDS]` lines and in the `# [DATA]` section. DEVO maps commas to `|` (pipe) to avoid ambiguity with the `,` separator in metadata lines. This key is **required**; its absence is a `[FAIL]`.
|
|
423
|
+
|
|
424
|
+
**`nodata`** is the most commonly seen missing-value sentinel in the data. DEVO detects this automatically from the data; you can override it with `--nodata VALUE`.
|
|
425
|
+
|
|
426
|
+
**`geometry`** and **`srid`** are written automatically when DEVO detects spatial columns (columns named `lat`/`latitude`, `lon`/`lng`/`longitude`, or `geometry`).
|
|
427
|
+
|
|
428
|
+
---
|
|
429
|
+
|
|
430
|
+
### `# [FIELDS]` section
|
|
431
|
+
|
|
432
|
+
Per-column metadata. Each line is a pipe-delimited list aligned to the column order in `# [DATA]`.
|
|
433
|
+
|
|
434
|
+
```
|
|
435
|
+
# [FIELDS]
|
|
436
|
+
# fields = station_id|observation_date|temperature_c|humidity_pct|wind_speed_ms
|
|
437
|
+
# types = string|datetime|number|integer|number
|
|
438
|
+
# min = |2024-01-15T00:00:00|19.8|65|2.8
|
|
439
|
+
# max = |2024-01-16T00:00:00|23.1|72|5.1
|
|
440
|
+
# missing_count = 0|0|1|0|1
|
|
441
|
+
# description = ||||
|
|
442
|
+
```
|
|
443
|
+
|
|
444
|
+
| Field line | What to look for |
|
|
445
|
+
|---|---|
|
|
446
|
+
| `types` | Confirm every column has the type you expect. `string` for a column that should be numeric is a red flag. |
|
|
447
|
+
| `min` / `max` | Verify the range makes sense for your domain. A humidity maximum of 168 is physically impossible and indicates a data-entry error. String and all-missing columns have no min/max (blank). |
|
|
448
|
+
| `missing_count` | A `0` on a column that should have gaps means your nodata sentinel was not recognised. A high count on a column that should be complete is worth investigating. |
|
|
449
|
+
| `description` | Blank by default. Fill these in by hand before uploading if your archive requires field descriptions. |
|
|
450
|
+
|
|
451
|
+
The recognised **Frictionless types** are: `string`, `integer`, `number`, `datetime`. DEVO infers them in this order of preference: `integer` → `number` → `datetime` → `string`.
|
|
452
|
+
|
|
453
|
+
---
|
|
454
|
+
|
|
455
|
+
### `# [DATA]` section
|
|
456
|
+
|
|
457
|
+
The data rows, written with the `field_delimiter` as the separator. The first row after `# [DATA]` is the column header.
|
|
458
|
+
|
|
459
|
+
```
|
|
460
|
+
# [DATA]
|
|
461
|
+
station_id|observation_date|temperature_c|humidity_pct|wind_speed_ms
|
|
462
|
+
S001|2024-01-15|21.4|65|3.2
|
|
463
|
+
S002|2024-01-15|N/A|72|N/A
|
|
464
|
+
...
|
|
465
|
+
```
|
|
466
|
+
|
|
467
|
+
You can edit values in `# [DATA]` directly, but if you do, re-run `devo validate` afterwards to confirm the edited file still passes.
|
|
468
|
+
|
|
469
|
+
---
|
|
470
|
+
|
|
471
|
+
## 6. Common Errors and How to Fix Them
|
|
472
|
+
|
|
473
|
+
### A numeric column is typed as `string`
|
|
474
|
+
|
|
475
|
+
**Symptom:** `# types` shows `string` for a column that holds measurements or counts. `min` and `max` are blank for that column.
|
|
476
|
+
|
|
477
|
+
**Cause:** At least one value in the column is not a number and is not a recognised nodata sentinel. Common culprits: custom sentinels like `MISSING`, `ND`, `NM`, `-`, `na`, `none`; stray text like `error` or `N/M`; unit suffixes like `21.4°C`.
|
|
478
|
+
|
|
479
|
+
**Fix options:**
|
|
480
|
+
|
|
481
|
+
1. Replace the non-numeric values with a standard sentinel (`N/A`, `NA`, `null`, or an empty cell) in the source CSV, then re-run.
|
|
482
|
+
2. If you cannot change the source, tell DEVO about the custom sentinel:
|
|
483
|
+
```bash
|
|
484
|
+
devo run data.csv --nodata MISSING
|
|
485
|
+
```
|
|
486
|
+
3. If the column genuinely has mixed text (e.g., a notes field), `string` may be correct; no action is needed.
|
|
487
|
+
|
|
488
|
+
---
|
|
489
|
+
|
|
490
|
+
### `[WARN]` in TYPE CONSISTENCY
|
|
491
|
+
|
|
492
|
+
**Symptom:**
|
|
493
|
+
```
|
|
494
|
+
[WARN] temperature_c: declared=number, inferred=string
|
|
495
|
+
Inferred type is wider than declared. Data may not satisfy 'number' constraints.
|
|
496
|
+
```
|
|
497
|
+
|
|
498
|
+
**Cause:** The type declared in `# [FIELDS]` (usually set during enrichment or edited manually) is stricter than what the actual data rows contain. The most common cause is editing the iCSV type from `string` to `number` without also fixing the values that caused the original `string` inference.
|
|
499
|
+
|
|
500
|
+
**Fix:** Look for non-numeric, non-sentinel values in that column's data rows. Either:
|
|
501
|
+
- Replace them with a recognised sentinel and re-run `devo run` on the corrected source CSV, or
|
|
502
|
+
- Revert the type in `# [FIELDS]` to `string` if the column really contains mixed content.
|
|
503
|
+
|
|
504
|
+
---
|
|
505
|
+
|
|
506
|
+
### `[FAIL] type-error` in DATA VALIDATION
|
|
507
|
+
|
|
508
|
+
**Symptom:**
|
|
509
|
+
```
|
|
510
|
+
[FAIL] 1 error(s) found:
|
|
511
|
+
Row 2, Col temperature_c [type-error]: type is "number/default" and value "MISSING" is not valid
|
|
512
|
+
```
|
|
513
|
+
|
|
514
|
+
**Cause:** A value in the data cannot be parsed as the declared type in the schema JSON. This often occurs together with a TYPE CONSISTENCY `[WARN]` and typically means the schema says one type (e.g., `number`) while the data contains incompatible values (e.g., the string `MISSING`).
|
|
515
|
+
|
|
516
|
+
**Fix:** Correct the value in the source data and re-run. If the value is a nodata sentinel, use `--nodata VALUE` so it is excluded from type inference and added to the schema's `missingValues` list.
|
|
517
|
+
|
|
518
|
+
---
|
|
519
|
+
|
|
520
|
+
### `[FAIL] constraint-error` in DATA VALIDATION
|
|
521
|
+
|
|
522
|
+
**Symptom:**
|
|
523
|
+
```
|
|
524
|
+
[FAIL] 1 error(s) found:
|
|
525
|
+
Row 3, Col humidity_pct [constraint-error]: constraint "maximum is 72" is not satisfied for value "168"
|
|
526
|
+
```
|
|
527
|
+
|
|
528
|
+
**Cause:** A value violates a `minimum` or `maximum` constraint in the schema. The schema constraints are derived from the data at enrichment time; if you later add or correct rows that push values outside the original range, validation will fail.
|
|
529
|
+
|
|
530
|
+
**Fix options:**
|
|
531
|
+
|
|
532
|
+
1. Correct the outlier in the source CSV (e.g., change `168` to `68`) and re-run `devo run`.
|
|
533
|
+
2. If the new range is legitimate, re-run `devo enrich` to rebuild the schema from the updated data, then `devo validate` to confirm.
|
|
534
|
+
|
|
535
|
+
---
|
|
536
|
+
|
|
537
|
+
### `[FAIL] required-error` in DATA VALIDATION
|
|
538
|
+
|
|
539
|
+
**Symptom:**
|
|
540
|
+
```
|
|
541
|
+
[FAIL] 1 error(s) found:
|
|
542
|
+
Row 4, Col station_id [required-error]: constraint "required is True" is not satisfied for value ""
|
|
543
|
+
```
|
|
544
|
+
|
|
545
|
+
**Cause:** A field was declared `required: true` in the schema (because it had no missing values at enrichment time), but a later row has an empty or missing value for that field.
|
|
546
|
+
|
|
547
|
+
**Fix options:**
|
|
548
|
+
|
|
549
|
+
1. Fill in the missing value in the source CSV and re-run.
|
|
550
|
+
2. If blanks are valid for that column, rebuild the schema after adding a row with a blank value; DEVO will set `required: false` and `missing_count` to a non-zero value.
|
|
551
|
+
|
|
552
|
+
---
|
|
553
|
+
|
|
554
|
+
### `[FAIL] Missing required metadata key: field_delimiter`
|
|
555
|
+
|
|
556
|
+
**Symptom:**
|
|
557
|
+
```
|
|
558
|
+
METADATA
|
|
559
|
+
----------------------------------------
|
|
560
|
+
[FAIL] Missing required metadata key: field_delimiter
|
|
561
|
+
```
|
|
562
|
+
`Valid: NO`
|
|
563
|
+
|
|
564
|
+
**Cause:** The iCSV's `# [METADATA]` section is missing the `field_delimiter` key. This should not occur in iCSV files generated by DEVO, but can happen in hand-authored files.
|
|
565
|
+
|
|
566
|
+
**Fix:** Add `# field_delimiter = |` (or your actual delimiter) to the `# [METADATA]` section of the iCSV file.
|
|
567
|
+
|
|
568
|
+
---
|
|
569
|
+
|
|
570
|
+
### `[ERROR] Column name(s) contain the iCSV delimiter`
|
|
571
|
+
|
|
572
|
+
**Symptom (terminal):**
|
|
573
|
+
```
|
|
574
|
+
[ERROR] Column name(s) contain the iCSV delimiter '|': ['flow|rate'].
|
|
575
|
+
Rename the columns or force a different delimiter with --delimiter.
|
|
576
|
+
```
|
|
577
|
+
|
|
578
|
+
**Cause:** A column header in the source CSV contains the pipe character `|`. DEVO uses `|` as the iCSV field delimiter, so a pipe inside a column name is ambiguous.
|
|
579
|
+
|
|
580
|
+
**Fix options:**
|
|
581
|
+
|
|
582
|
+
1. Rename the column in the source CSV (e.g., `flow|rate` → `flow_rate`).
|
|
583
|
+
2. Force a different delimiter that does not appear in your column names:
|
|
584
|
+
```bash
|
|
585
|
+
devo run data.csv --delimiter ":"
|
|
586
|
+
```
|
|
587
|
+
Valid iCSV delimiters are: `, | / \ : ;`
|
|
588
|
+
|
|
589
|
+
---
|
|
590
|
+
|
|
591
|
+
### `[ERROR] No schema provided and none found`
|
|
592
|
+
|
|
593
|
+
**Symptom (terminal):**
|
|
594
|
+
```
|
|
595
|
+
[ERROR] No schema provided and none found alongside data.icsv.
|
|
596
|
+
Run 'devo enrich' first or pass --schema.
|
|
597
|
+
```
|
|
598
|
+
|
|
599
|
+
**Cause:** `devo validate` expects a schema JSON file in the same directory as the iCSV, named `<stem>_schema.json`. If the schema file is missing or in a different location, validation cannot run.
|
|
600
|
+
|
|
601
|
+
**Fix options:**
|
|
602
|
+
|
|
603
|
+
1. Run `devo enrich data.csv` first to generate the schema, then `devo validate`.
|
|
604
|
+
2. Point to an existing schema explicitly:
|
|
605
|
+
```bash
|
|
606
|
+
devo validate data.icsv --schema /path/to/data_schema.json
|
|
607
|
+
```
|
|
608
|
+
|
|
609
|
+
---
|
|
610
|
+
|
|
611
|
+
### `[ERROR] data.icsv is already an iCSV file`
|
|
612
|
+
|
|
613
|
+
**Symptom (terminal):**
|
|
614
|
+
```
|
|
615
|
+
[ERROR] data.icsv is already an iCSV file.
|
|
616
|
+
Use 'devo validate' to validate it, or 'devo run' which handles both.
|
|
617
|
+
```
|
|
618
|
+
|
|
619
|
+
**Cause:** You ran `devo enrich` on a `.icsv` file.
|
|
620
|
+
|
|
621
|
+
**Fix:** Use `devo validate data.icsv` to validate it, or `devo run data.icsv` (which detects the `.icsv` format and skips enrichment automatically).
|
|
622
|
+
|
|
623
|
+
---
|
|
624
|
+
|
|
625
|
+
### Nodata sentinels DEVO recognises automatically
|
|
626
|
+
|
|
627
|
+
The following values are treated as missing by default; no `--nodata` flag needed:
|
|
628
|
+
|
|
629
|
+
```
|
|
630
|
+
(empty cell) NA N/A na n/a NULL null nan NaN -999 -999.0 -999.000000
|
|
631
|
+
```
|
|
632
|
+
|
|
633
|
+
Any other sentinel, such as `MISSING`, `ND`, `NM`, `none`, `-`, or `9999`, must be declared with `--nodata VALUE`.
|
|
634
|
+
|
|
635
|
+
---
|
|
636
|
+
|
|
637
|
+
## 7. CLI Reference
|
|
638
|
+
|
|
639
|
+
### `devo run`: enrich then validate (most common)
|
|
640
|
+
|
|
641
|
+
```bash
|
|
642
|
+
devo run INPUT [--out DIR] [--delimiter CHAR] [--nodata VALUE] [--app PROFILE]
|
|
643
|
+
```
|
|
644
|
+
|
|
645
|
+
If `INPUT` is a `.csv`, DEVO enriches it first, then validates. If `INPUT` is already a `.icsv`, enrichment is skipped.
|
|
646
|
+
|
|
647
|
+
| Flag | Default | Description |
|
|
648
|
+
|---|---|---|
|
|
649
|
+
| `--out DIR` | `DEVO_output` | Directory for all output files |
|
|
650
|
+
| `--delimiter CHAR` | auto-detected | Force a specific input delimiter (CSV files only) |
|
|
651
|
+
| `--nodata VALUE` | auto-detected | Declare a custom missing-value sentinel |
|
|
652
|
+
| `--app PROFILE` | (none) | Set the `application_profile` metadata key |
|
|
653
|
+
|
|
654
|
+
---
|
|
655
|
+
|
|
656
|
+
### `devo enrich`: CSV → iCSV + schema
|
|
657
|
+
|
|
658
|
+
```bash
|
|
659
|
+
devo enrich INPUT.csv [--out DIR] [--delimiter CHAR] [--nodata VALUE] [--app PROFILE]
|
|
660
|
+
```
|
|
661
|
+
|
|
662
|
+
Writes `INPUT.icsv` and `INPUT_schema.json` to `--out DIR`. Does not validate.
|
|
663
|
+
|
|
664
|
+
---
|
|
665
|
+
|
|
666
|
+
### `devo validate`: iCSV → validation report
|
|
667
|
+
|
|
668
|
+
```bash
|
|
669
|
+
devo validate INPUT.icsv [--out DIR] [--schema PATH]
|
|
670
|
+
```
|
|
671
|
+
|
|
672
|
+
| Flag | Default | Description |
|
|
673
|
+
|---|---|---|
|
|
674
|
+
| `--out DIR` | `DEVO_output` | Directory for the report |
|
|
675
|
+
| `--schema PATH` | auto-discover | Path to the schema JSON; defaults to `INPUT_schema.json` in the same directory |
|
|
676
|
+
|
|
677
|
+
---
|
|
678
|
+
|
|
679
|
+
### Exit codes
|
|
680
|
+
|
|
681
|
+
| Code | Meaning |
|
|
682
|
+
|---|---|
|
|
683
|
+
| `0` | Success: validation passed (or enrichment completed without errors) |
|
|
684
|
+
| `1` | Validation failed: data errors found; read the report |
|
|
685
|
+
| `2` | Usage or runtime error: bad arguments, missing file, etc. |
|
|
686
|
+
|
|
687
|
+
---
|
|
688
|
+
|
|
689
|
+
## 8. Python API
|
|
690
|
+
|
|
691
|
+
For scripted or batch use cases:
|
|
692
|
+
|
|
693
|
+
```python
|
|
694
|
+
from devo.enrich import ICSVEnricher
|
|
695
|
+
from devo.validate import validate_icsv
|
|
696
|
+
|
|
697
|
+
# Step 1: Enrich CSV → iCSV + schema
|
|
698
|
+
icsv_path, schema_path = ICSVEnricher().make_icsv(
|
|
699
|
+
"sensor_data.csv",
|
|
700
|
+
outdir="DEVO_output",
|
|
701
|
+
nodata_override="MISSING", # optional: custom sentinel
|
|
702
|
+
application_profile="MyApp" # optional: profile name
|
|
703
|
+
)
|
|
704
|
+
|
|
705
|
+
# Step 2: Validate
|
|
706
|
+
report_path, is_valid = validate_icsv(
|
|
707
|
+
icsv_path,
|
|
708
|
+
schema_path=schema_path,
|
|
709
|
+
outdir="DEVO_output"
|
|
710
|
+
)
|
|
711
|
+
|
|
712
|
+
print(f"Valid: {is_valid}")
|
|
713
|
+
print(f"Report: {report_path}")
|
|
714
|
+
|
|
715
|
+
if not is_valid:
|
|
716
|
+
# Read the report for details
|
|
717
|
+
print(open(report_path).read())
|
|
718
|
+
```
|
|
719
|
+
|
|
720
|
+
### Error handling
|
|
721
|
+
|
|
722
|
+
```python
|
|
723
|
+
from devo.exceptions import DEVOError, EnrichError, ParseError, ValidationError
|
|
724
|
+
|
|
725
|
+
try:
|
|
726
|
+
icsv_path, schema_path = ICSVEnricher().make_icsv("data.csv", "out")
|
|
727
|
+
report_path, is_valid = validate_icsv(icsv_path, schema_path=schema_path)
|
|
728
|
+
except EnrichError as e:
|
|
729
|
+
print(f"Enrichment failed: {e}")
|
|
730
|
+
except ParseError as e:
|
|
731
|
+
print(f"Could not parse iCSV header: {e}")
|
|
732
|
+
except ValidationError as e:
|
|
733
|
+
print(f"Validation infrastructure error: {e}")
|
|
734
|
+
except FileNotFoundError as e:
|
|
735
|
+
print(f"File not found: {e}")
|
|
736
|
+
```
|
|
737
|
+
|
|
738
|
+
| Exception | When it is raised |
|
|
739
|
+
|---|---|
|
|
740
|
+
| `EnrichError` | Input CSV cannot be read, is already an iCSV, or has column names that contain the output delimiter |
|
|
741
|
+
| `ParseError` | An iCSV file is missing its `# [METADATA]` section or cannot be opened |
|
|
742
|
+
| `ValidationError` | The `frictionless` package is not installed |
|
|
743
|
+
| `FileNotFoundError` | The input file or schema file does not exist |
|
|
744
|
+
|
|
745
|
+
All four inherit from `DEVOError`, so `except DEVOError` catches any DEVO-specific failure.
|
|
746
|
+
|
|
747
|
+
---
|
|
748
|
+
|
|
749
|
+
### Batch processing example
|
|
750
|
+
|
|
751
|
+
```python
|
|
752
|
+
from pathlib import Path
|
|
753
|
+
from devo.enrich import ICSVEnricher
|
|
754
|
+
from devo.validate import validate_icsv
|
|
755
|
+
from devo.exceptions import DEVOError
|
|
756
|
+
|
|
757
|
+
enricher = ICSVEnricher()
|
|
758
|
+
results = []
|
|
759
|
+
|
|
760
|
+
for csv_file in Path("incoming").glob("*.csv"):
|
|
761
|
+
try:
|
|
762
|
+
icsv, schema = enricher.make_icsv(str(csv_file), outdir="DEVO_output")
|
|
763
|
+
report, valid = validate_icsv(icsv, schema_path=schema)
|
|
764
|
+
results.append((csv_file.name, valid, report))
|
|
765
|
+
except DEVOError as e:
|
|
766
|
+
results.append((csv_file.name, False, str(e)))
|
|
767
|
+
|
|
768
|
+
for name, valid, info in results:
|
|
769
|
+
status = "READY" if valid else "NEEDS REVIEW"
|
|
770
|
+
print(f"{status} {name} → {info}")
|
|
771
|
+
```
|
|
772
|
+
|
|
773
|
+
---
|
|
774
|
+
|
|
775
|
+
|
|
776
|
+
## License
|
|
777
|
+
|
|
778
|
+
MIT. See `LICENSE`.
|