cgm-format 0.2.2__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- cgm_format-0.2.2/LICENSE +21 -0
- cgm_format-0.2.2/PKG-INFO +680 -0
- cgm_format-0.2.2/README.md +642 -0
- cgm_format-0.2.2/cgm_format.egg-info/PKG-INFO +680 -0
- cgm_format-0.2.2/cgm_format.egg-info/SOURCES.txt +19 -0
- cgm_format-0.2.2/cgm_format.egg-info/dependency_links.txt +1 -0
- cgm_format-0.2.2/cgm_format.egg-info/requires.txt +9 -0
- cgm_format-0.2.2/cgm_format.egg-info/top_level.txt +2 -0
- cgm_format-0.2.2/formats/__init__.py +88 -0
- cgm_format-0.2.2/formats/dexcom.py +307 -0
- cgm_format-0.2.2/formats/libre.py +277 -0
- cgm_format-0.2.2/formats/medtronic_WIP.py +531 -0
- cgm_format-0.2.2/formats/unified.py +185 -0
- cgm_format-0.2.2/interface/__init__.py +27 -0
- cgm_format-0.2.2/interface/cgm_interface.py +253 -0
- cgm_format-0.2.2/interface/schema.py +315 -0
- cgm_format-0.2.2/pyproject.toml +24 -0
- cgm_format-0.2.2/setup.cfg +4 -0
- cgm_format-0.2.2/tests/test_format_converter.py +443 -0
- cgm_format-0.2.2/tests/test_format_processor.py +1547 -0
- cgm_format-0.2.2/tests/test_schema.py +367 -0
cgm_format-0.2.2/LICENSE
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2025 GlucoseDAO
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
|
@@ -0,0 +1,680 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: cgm-format
|
|
3
|
+
Version: 0.2.2
|
|
4
|
+
Summary: Unified CGM data format converter for ML training and inference
|
|
5
|
+
License: MIT License
|
|
6
|
+
|
|
7
|
+
Copyright (c) 2025 GlucoseDAO
|
|
8
|
+
|
|
9
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
10
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
11
|
+
in the Software without restriction, including without limitation the rights
|
|
12
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
13
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
14
|
+
furnished to do so, subject to the following conditions:
|
|
15
|
+
|
|
16
|
+
The above copyright notice and this permission notice shall be included in all
|
|
17
|
+
copies or substantial portions of the Software.
|
|
18
|
+
|
|
19
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
20
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
21
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
22
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
23
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
24
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
25
|
+
SOFTWARE.
|
|
26
|
+
|
|
27
|
+
Requires-Python: >=3.12
|
|
28
|
+
Description-Content-Type: text/markdown
|
|
29
|
+
License-File: LICENSE
|
|
30
|
+
Requires-Dist: polars>=1.34.0
|
|
31
|
+
Provides-Extra: extra
|
|
32
|
+
Requires-Dist: pandas>=2.3.3; extra == "extra"
|
|
33
|
+
Requires-Dist: pyarrow>=21.0.0; extra == "extra"
|
|
34
|
+
Requires-Dist: frictionless>=5.18.1; extra == "extra"
|
|
35
|
+
Provides-Extra: dev
|
|
36
|
+
Requires-Dist: pytest>=8.0.0; extra == "dev"
|
|
37
|
+
Dynamic: license-file
|
|
38
|
+
|
|
39
|
+
# cgm_format
|
|
40
|
+
|
|
41
|
+
Python library for converting vendor-specific Continuous Glucose Monitoring (CGM) data (Dexcom, Libre) into a standardized unified format for ML training and inference.
|
|
42
|
+
|
|
43
|
+
## Features
|
|
44
|
+
|
|
45
|
+
- **Vendor format detection**: Automatic detection of Dexcom, Libre, and Unified formats
|
|
46
|
+
- **Robust parsing**: Handles BOM marks, encoding artifacts, and vendor-specific CSV quirks
|
|
47
|
+
- **Unified schema**: Standardized data format with service columns (metadata) and data columns
|
|
48
|
+
- **Schema validation**: Frictionless Data Table Schema support for validation
|
|
49
|
+
- **Type-safe**: Polars-based with strict type definitions and enum support
|
|
50
|
+
- **Extensible**: Clean abstract interfaces for adding new vendor formats
|
|
51
|
+
|
|
52
|
+
## Installation
|
|
53
|
+
|
|
54
|
+
```bash
|
|
55
|
+
# Using uv (recommended)
|
|
56
|
+
uv pip install -e .
|
|
57
|
+
|
|
58
|
+
# Or using pip
|
|
59
|
+
pip3 install -e .
|
|
60
|
+
|
|
61
|
+
# Optional dependencies
|
|
62
|
+
uv pip install -e ".[extra]" # pandas, pyarrow, frictionless
|
|
63
|
+
uv pip install -e ".[dev]" # pytest
|
|
64
|
+
```
|
|
65
|
+
|
|
66
|
+
## Quick Start
|
|
67
|
+
|
|
68
|
+
### Basic Parsing
|
|
69
|
+
|
|
70
|
+
```python
|
|
71
|
+
from format_converter import FormatParser
|
|
72
|
+
import polars as pl
|
|
73
|
+
|
|
74
|
+
# Parse any supported CGM file (Dexcom, Libre, or Unified)
|
|
75
|
+
unified_df = FormatParser.parse_from_file("data/example.csv")
|
|
76
|
+
|
|
77
|
+
# Access the data
|
|
78
|
+
print(unified_df.head())
|
|
79
|
+
|
|
80
|
+
# Save to unified format
|
|
81
|
+
FormatParser.to_csv_file(unified_df, "output.csv")
|
|
82
|
+
```
|
|
83
|
+
|
|
84
|
+
### Complete Inference Pipeline
|
|
85
|
+
|
|
86
|
+
```python
|
|
87
|
+
from format_converter import FormatParser
|
|
88
|
+
from format_processor import FormatProcessor
|
|
89
|
+
|
|
90
|
+
# Stage 1-3: Parse vendor format to unified
|
|
91
|
+
unified_df = FormatParser.parse_from_file("data/dexcom_export.csv")
|
|
92
|
+
|
|
93
|
+
# Stage 4-5: Process for inference
|
|
94
|
+
processor = FormatProcessor(
|
|
95
|
+
expected_interval_minutes=5,
|
|
96
|
+
small_gap_max_minutes=15
|
|
97
|
+
)
|
|
98
|
+
|
|
99
|
+
# Fill gaps and create sequences
|
|
100
|
+
processed_df = processor.interpolate_gaps(unified_df)
|
|
101
|
+
|
|
102
|
+
# Prepare final inference data
|
|
103
|
+
inference_df, warnings = processor.prepare_for_inference(
|
|
104
|
+
processed_df,
|
|
105
|
+
minimum_duration_minutes=180,
|
|
106
|
+
maximum_wanted_duration=1440
|
|
107
|
+
)
|
|
108
|
+
|
|
109
|
+
# Feed to ML model
|
|
110
|
+
predictions = your_model.predict(inference_df)
|
|
111
|
+
```
|
|
112
|
+
|
|
113
|
+
**See [USAGE.md](USAGE.md) for complete inference workflows and [usage_example.py](usage_example.py) for runnable examples.**
|
|
114
|
+
|
|
115
|
+
## Unified Format Schema
|
|
116
|
+
|
|
117
|
+
The library converts all vendor formats to a standardized schema with two types of columns:
|
|
118
|
+
|
|
119
|
+
### Service Columns (Metadata)
|
|
120
|
+
|
|
121
|
+
| Column | Type | Description |
|
|
122
|
+
|--------|------|-------------|
|
|
123
|
+
| `sequence_id` | `Int64` | Unique sequence identifier |
|
|
124
|
+
| `event_type` | `Utf8` | Event type (8-char code: EGV_READ, INS_FAST, CARBS_IN, etc.) |
|
|
125
|
+
| `quality` | `Int64` | Data quality (0=GOOD, 1=ILL, 2=SENSOR_CALIBRATION) |
|
|
126
|
+
|
|
127
|
+
### Data Columns
|
|
128
|
+
|
|
129
|
+
| Column | Type | Unit | Description |
|
|
130
|
+
|--------|------|------|-------------|
|
|
131
|
+
| `datetime` | `Datetime` | - | Timestamp (ISO 8601) |
|
|
132
|
+
| `glucose` | `Float64` | mg/dL | Blood glucose reading |
|
|
133
|
+
| `carbs` | `Float64` | g | Carbohydrate intake |
|
|
134
|
+
| `insulin_slow` | `Float64` | u | Long-acting insulin dose |
|
|
135
|
+
| `insulin_fast` | `Float64` | u | Short-acting insulin dose |
|
|
136
|
+
| `exercise` | `Int64` | seconds | Exercise duration |
|
|
137
|
+
|
|
138
|
+
See [`formats/UNIFIED_FORMAT.md`](formats/UNIFIED_FORMAT.md) for complete specification and event type enums.
|
|
139
|
+
|
|
140
|
+
## Processing Pipeline
|
|
141
|
+
|
|
142
|
+
The library implements a 3-stage parsing pipeline defined in the `CGMParser` interface:
|
|
143
|
+
|
|
144
|
+
### Stage 1: Preprocess Raw Data
|
|
145
|
+
|
|
146
|
+
Remove BOM marks, encoding artifacts, and normalize text encoding.
|
|
147
|
+
|
|
148
|
+
```python
|
|
149
|
+
text_data = FormatParser.decode_raw_data(raw_bytes)
|
|
150
|
+
```
|
|
151
|
+
|
|
152
|
+
### Stage 2: Format Detection
|
|
153
|
+
|
|
154
|
+
Automatically detect vendor format from CSV headers.
|
|
155
|
+
|
|
156
|
+
```python
|
|
157
|
+
from interface.cgm_interface import SupportedCGMFormat
|
|
158
|
+
|
|
159
|
+
format_type = FormatParser.detect_format(text_data)
|
|
160
|
+
# Returns: SupportedCGMFormat.DEXCOM, .LIBRE, or .UNIFIED_CGM
|
|
161
|
+
```
|
|
162
|
+
|
|
163
|
+
### Stage 3: Vendor-Specific Parsing
|
|
164
|
+
|
|
165
|
+
Parse vendor CSV to unified format, handling vendor-specific quirks:
|
|
166
|
+
|
|
167
|
+
- Dexcom: High/Low glucose markers, variable-length rows, metadata rows
|
|
168
|
+
- Libre: Record type filtering, timestamp format variations
|
|
169
|
+
|
|
170
|
+
```python
|
|
171
|
+
unified_df = FormatParser.parse_to_unified(text_data, format_type)
|
|
172
|
+
```
|
|
173
|
+
|
|
174
|
+
All stages can be chained with convenience methods:
|
|
175
|
+
|
|
176
|
+
```python
|
|
177
|
+
# Parse from file
|
|
178
|
+
unified_df = FormatParser.parse_from_file("data.csv")
|
|
179
|
+
|
|
180
|
+
# Parse from bytes
|
|
181
|
+
unified_df = FormatParser.parse_from_bytes(raw_data)
|
|
182
|
+
|
|
183
|
+
# Parse from string
|
|
184
|
+
unified_df = FormatParser.parse_from_string(text_data)
|
|
185
|
+
```
|
|
186
|
+
|
|
187
|
+
See [`interface/PIPELINE.md`](interface/PIPELINE.md) for complete pipeline documentation.
|
|
188
|
+
|
|
189
|
+
### Stage 4: Gap Interpolation and Sequence Creation
|
|
190
|
+
|
|
191
|
+
The `FormatProcessor.interpolate_gaps()` method handles data continuity:
|
|
192
|
+
|
|
193
|
+
```python
|
|
194
|
+
from format_processor import FormatProcessor
|
|
195
|
+
|
|
196
|
+
processor = FormatProcessor(
|
|
197
|
+
expected_interval_minutes=5, # Normal CGM reading interval
|
|
198
|
+
small_gap_max_minutes=15 # Max gap size to interpolate
|
|
199
|
+
)
|
|
200
|
+
|
|
201
|
+
# Detect gaps, create sequences, and interpolate missing values
|
|
202
|
+
processed_df = processor.interpolate_gaps(unified_df)
|
|
203
|
+
```
|
|
204
|
+
|
|
205
|
+
**What it does:**
|
|
206
|
+
|
|
207
|
+
1. **Gap Detection**: Identifies gaps in continuous glucose monitoring data
|
|
208
|
+
2. **Sequence Creation**: Splits data at large gaps (>15 min default) into separate sequences
|
|
209
|
+
3. **Small Gap Interpolation**: Fills small gaps (≤15 min) with linearly interpolated glucose values
|
|
210
|
+
4. **Calibration Marking**: Marks 24-hour periods after gaps ≥2h45m as `Quality.SENSOR_CALIBRATION`
|
|
211
|
+
5. **Warning Collection**: Tracks imputation events via `ProcessingWarning.IMPUTATION`
|
|
212
|
+
|
|
213
|
+
**Example - Analyze sequences created:**
|
|
214
|
+
|
|
215
|
+
```python
|
|
216
|
+
# Check sequences
|
|
217
|
+
sequence_count = processed_df['sequence_id'].n_unique()
|
|
218
|
+
print(f"Created {sequence_count} sequences")
|
|
219
|
+
|
|
220
|
+
# Analyze each sequence
|
|
221
|
+
import polars as pl
|
|
222
|
+
sequence_info = processed_df.group_by('sequence_id').agg([
|
|
223
|
+
pl.col('datetime').min().alias('start_time'),
|
|
224
|
+
pl.col('datetime').max().alias('end_time'),
|
|
225
|
+
pl.col('datetime').count().alias('num_points'),
|
|
226
|
+
])
|
|
227
|
+
|
|
228
|
+
for row in sequence_info.iter_rows(named=True):
|
|
229
|
+
duration_hours = (row['end_time'] - row['start_time']).total_seconds() / 3600
|
|
230
|
+
print(f"Sequence {row['sequence_id']}: {duration_hours:.1f}h, {row['num_points']} points")
|
|
231
|
+
```
|
|
232
|
+
|
|
233
|
+
### Stage 5: Timestamp Synchronization (Optional)
|
|
234
|
+
|
|
235
|
+
Align timestamps to fixed-frequency intervals for ML models requiring regular time steps:
|
|
236
|
+
|
|
237
|
+
```python
|
|
238
|
+
# After interpolate_gaps(), synchronize to exact intervals
|
|
239
|
+
synchronized_df = processor.synchronize_timestamps(processed_df)
|
|
240
|
+
|
|
241
|
+
# Now all timestamps are at exact 5-minute intervals: 10:00:00, 10:05:00, 10:10:00, etc.
|
|
242
|
+
```
|
|
243
|
+
|
|
244
|
+
**What it does:**
|
|
245
|
+
|
|
246
|
+
1. Rounds timestamps to nearest minute boundary (removes seconds)
|
|
247
|
+
2. Creates fixed-frequency timestamps at `expected_interval_minutes` intervals
|
|
248
|
+
3. Linearly interpolates glucose values between measurements
|
|
249
|
+
4. Shifts discrete events (carbs, insulin, exercise) to nearest timestamp
|
|
250
|
+
5. Preserves sequence boundaries (processes each sequence independently)
|
|
251
|
+
|
|
252
|
+
**When to use:** Time-series models expecting fixed intervals (LSTM, transformers, ARIMA)
|
|
253
|
+
**When to skip:** Models handling irregular timestamps, or when original timing is critical
|
|
254
|
+
|
|
255
|
+
### Stage 6: Inference Preparation
|
|
256
|
+
|
|
257
|
+
The `prepare_for_inference()` method performs final quality assurance and data extraction:
|
|
258
|
+
|
|
259
|
+
```python
|
|
260
|
+
# Prepare final inference-ready data
|
|
261
|
+
inference_df, warnings = processor.prepare_for_inference(
|
|
262
|
+
processed_df,
|
|
263
|
+
minimum_duration_minutes=180, # Require 3 hours minimum
|
|
264
|
+
maximum_wanted_duration=1440 # Truncate to last 24 hours if longer
|
|
265
|
+
)
|
|
266
|
+
|
|
267
|
+
# Check for quality issues
|
|
268
|
+
from interface.cgm_interface import ProcessingWarning
|
|
269
|
+
|
|
270
|
+
if warnings & ProcessingWarning.TOO_SHORT:
|
|
271
|
+
print("Warning: Sequence shorter than minimum duration")
|
|
272
|
+
if warnings & ProcessingWarning.QUALITY:
|
|
273
|
+
print("Warning: Data contains quality issues (ILL or SENSOR_CALIBRATION)")
|
|
274
|
+
if warnings & ProcessingWarning.IMPUTATION:
|
|
275
|
+
print("Warning: Data contains interpolated values")
|
|
276
|
+
```
|
|
277
|
+
|
|
278
|
+
**What it does:**
|
|
279
|
+
|
|
280
|
+
1. **Validation**: Raises `ZeroValidInputError` if no valid glucose data exists
|
|
281
|
+
2. **Sequence Selection**: Keeps only the **latest** sequence (most recent timestamps)
|
|
282
|
+
3. **Duration Checks**: Warns if sequence < `minimum_duration_minutes`
|
|
283
|
+
4. **Quality Checks**: Collects warnings for calibration events and quality flags
|
|
284
|
+
5. **Truncation**: Keeps last N minutes if exceeding `maximum_wanted_duration`
|
|
285
|
+
6. **Column Extraction**: Returns only data columns (removes service metadata)
|
|
286
|
+
|
|
287
|
+
**Output DataFrame:**
|
|
288
|
+
|
|
289
|
+
```python
|
|
290
|
+
# inference_df contains only data columns:
|
|
291
|
+
# ['datetime', 'glucose', 'carbs', 'insulin_slow', 'insulin_fast', 'exercise']
|
|
292
|
+
|
|
293
|
+
# Feed directly to ML model
|
|
294
|
+
predictions = your_model.predict(inference_df)
|
|
295
|
+
```
|
|
296
|
+
|
|
297
|
+
### Complete Processor Configuration
|
|
298
|
+
|
|
299
|
+
```python
|
|
300
|
+
from format_processor import FormatProcessor
|
|
301
|
+
from interface.cgm_interface import MINIMUM_DURATION_MINUTES, MAXIMUM_WANTED_DURATION_MINUTES
|
|
302
|
+
|
|
303
|
+
# Initialize processor with custom intervals
|
|
304
|
+
processor = FormatProcessor(
|
|
305
|
+
expected_interval_minutes=5, # CGM reading interval (5 min for Dexcom, 15 min for Libre)
|
|
306
|
+
small_gap_max_minutes=15 # Max gap to interpolate (larger gaps create new sequences)
|
|
307
|
+
)
|
|
308
|
+
|
|
309
|
+
# Stage 4: Fill gaps and create sequences
|
|
310
|
+
processed_df = processor.interpolate_gaps(unified_df)
|
|
311
|
+
|
|
312
|
+
# Stage 5 (Optional): Synchronize to fixed intervals
|
|
313
|
+
# synchronized_df = processor.synchronize_timestamps(processed_df)
|
|
314
|
+
|
|
315
|
+
# Stage 6: Prepare for inference
|
|
316
|
+
inference_df, warnings = processor.prepare_for_inference(
|
|
317
|
+
processed_df, # or synchronized_df if using Stage 5
|
|
318
|
+
minimum_duration_minutes=MINIMUM_DURATION_MINUTES, # Default: 180 (3 hours)
|
|
319
|
+
maximum_wanted_duration=MAXIMUM_WANTED_DURATION_MINUTES # Default: 1440 (24 hours)
|
|
320
|
+
)
|
|
321
|
+
|
|
322
|
+
# Check warnings
|
|
323
|
+
if processor.has_warnings():
|
|
324
|
+
all_warnings = processor.get_warnings()
|
|
325
|
+
print(f"Processing collected {len(all_warnings)} warnings")
|
|
326
|
+
```
|
|
327
|
+
|
|
328
|
+
## Advanced Usage
|
|
329
|
+
|
|
330
|
+
### Working with Schemas
|
|
331
|
+
|
|
332
|
+
```python
|
|
333
|
+
from formats.unified import CGM_SCHEMA, UnifiedEventType, Quality
|
|
334
|
+
|
|
335
|
+
# Get Polars schema
|
|
336
|
+
polars_schema = CGM_SCHEMA.get_polars_schema()
|
|
337
|
+
data_only_schema = CGM_SCHEMA.get_polars_schema(data_only=True)
|
|
338
|
+
|
|
339
|
+
# Get column names
|
|
340
|
+
all_columns = CGM_SCHEMA.get_column_names()
|
|
341
|
+
data_columns = CGM_SCHEMA.get_column_names(data_only=True)
|
|
342
|
+
|
|
343
|
+
# Get cast expressions for Polars
|
|
344
|
+
cast_exprs = CGM_SCHEMA.get_cast_expressions()
|
|
345
|
+
df = df.with_columns(cast_exprs)
|
|
346
|
+
|
|
347
|
+
# Use enums
|
|
348
|
+
event = UnifiedEventType.GLUCOSE # "EGV_READ"
|
|
349
|
+
quality = Quality.GOOD # 0
|
|
350
|
+
```
|
|
351
|
+
|
|
352
|
+
### Batch Processing with Inference Preparation
|
|
353
|
+
|
|
354
|
+
```python
|
|
355
|
+
from pathlib import Path
|
|
356
|
+
from format_converter import FormatParser
|
|
357
|
+
from format_processor import FormatProcessor
|
|
358
|
+
import polars as pl
|
|
359
|
+
|
|
360
|
+
data_dir = Path("data")
|
|
361
|
+
output_dir = Path("data/inference_ready")
|
|
362
|
+
output_dir.mkdir(exist_ok=True)
|
|
363
|
+
|
|
364
|
+
processor = FormatProcessor()
|
|
365
|
+
results = []
|
|
366
|
+
|
|
367
|
+
for csv_file in data_dir.glob("*.csv"):
|
|
368
|
+
try:
|
|
369
|
+
# Parse to unified format
|
|
370
|
+
unified_df = FormatParser.parse_from_file(csv_file)
|
|
371
|
+
|
|
372
|
+
# Process for inference
|
|
373
|
+
processed_df = processor.interpolate_gaps(unified_df)
|
|
374
|
+
inference_df, warnings = processor.prepare_for_inference(processed_df)
|
|
375
|
+
|
|
376
|
+
# Add patient identifier
|
|
377
|
+
patient_id = csv_file.stem
|
|
378
|
+
inference_df = inference_df.with_columns([
|
|
379
|
+
pl.lit(patient_id).alias('patient_id')
|
|
380
|
+
])
|
|
381
|
+
|
|
382
|
+
results.append(inference_df)
|
|
383
|
+
|
|
384
|
+
# Save individual file
|
|
385
|
+
output_file = output_dir / f"{patient_id}_inference.csv"
|
|
386
|
+
FormatParser.to_csv_file(inference_df, str(output_file))
|
|
387
|
+
|
|
388
|
+
warning_str = f"warnings={warnings.value}" if warnings else "OK"
|
|
389
|
+
print(f"✓ {csv_file.name}: {len(inference_df)} records, {warning_str}")
|
|
390
|
+
|
|
391
|
+
except Exception as e:
|
|
392
|
+
print(f"✗ Failed {csv_file.name}: {e}")
|
|
393
|
+
|
|
394
|
+
# Combine all processed data
|
|
395
|
+
if results:
|
|
396
|
+
combined_df = pl.concat(results)
|
|
397
|
+
FormatParser.to_csv_file(combined_df, str(output_dir / "combined_inference.csv"))
|
|
398
|
+
print(f"\n✓ Combined {len(results)} files into single dataset")
|
|
399
|
+
```
|
|
400
|
+
|
|
401
|
+
### Format Detection and Validation
|
|
402
|
+
|
|
403
|
+
```python
|
|
404
|
+
from example_schema_usage import run_format_detection_and_validation
|
|
405
|
+
from pathlib import Path
|
|
406
|
+
|
|
407
|
+
# Validate all files in data directory
|
|
408
|
+
run_format_detection_and_validation(
|
|
409
|
+
data_dir=Path("data"),
|
|
410
|
+
parsed_dir=Path("data/parsed"),
|
|
411
|
+
output_file=Path("validation_report.txt")
|
|
412
|
+
)
|
|
413
|
+
```
|
|
414
|
+
|
|
415
|
+
This generates a detailed report with:
|
|
416
|
+
|
|
417
|
+
- Format detection statistics
|
|
418
|
+
- Frictionless schema validation results (if library installed)
|
|
419
|
+
- Known vendor quirks automatically suppressed
|
|
420
|
+
|
|
421
|
+
## Supported Formats
|
|
422
|
+
|
|
423
|
+
### Dexcom Clarity Export
|
|
424
|
+
|
|
425
|
+
- CSV with metadata rows (rows 2-11)
|
|
426
|
+
- Variable-length rows (non-EGV events missing trailing columns)
|
|
427
|
+
- High/Low glucose markers for out-of-range values
|
|
428
|
+
- Event types: EGV, Insulin, Carbs, Exercise
|
|
429
|
+
- Multiple timestamp format variants
|
|
430
|
+
|
|
431
|
+
### FreeStyle Libre
|
|
432
|
+
|
|
433
|
+
- CSV with metadata row 1, header row 2
|
|
434
|
+
- Record type filtering (0=glucose, 4=insulin, 5=food)
|
|
435
|
+
- Multiple timestamp format variants
|
|
436
|
+
- Separate rapid/long insulin columns
|
|
437
|
+
|
|
438
|
+
### Unified Format
|
|
439
|
+
|
|
440
|
+
- Standardized CSV with header row 1
|
|
441
|
+
- ISO 8601 timestamps
|
|
442
|
+
- Service columns + data columns
|
|
443
|
+
- Validates existing unified format files
|
|
444
|
+
|
|
445
|
+
## Project Structure
|
|
446
|
+
|
|
447
|
+
```text
|
|
448
|
+
cgm_format/
|
|
449
|
+
├── interface/ # Abstract interfaces and schema infrastructure
|
|
450
|
+
│ ├── cgm_interface.py # CGMParser and CGMProcessor interfaces
|
|
451
|
+
│ ├── schema.py # Base schema definition system
|
|
452
|
+
│ └── PIPELINE.md # Pipeline documentation
|
|
453
|
+
├── formats/ # Format-specific schemas and definitions
|
|
454
|
+
│ ├── unified.py # Unified format schema and enums
|
|
455
|
+
│ ├── unified.json # Frictionless schema export
|
|
456
|
+
│ ├── dexcom.py # Dexcom format schema and constants
|
|
457
|
+
│ ├── dexcom.json # Frictionless schema for Dexcom
|
|
458
|
+
│ ├── libre.py # Libre format schema and constants
|
|
459
|
+
│ ├── libre.json # Frictionless schema for Libre
|
|
460
|
+
│ └── UNIFIED_FORMAT.md # Unified format specification
|
|
461
|
+
├── format_converter.py # FormatParser implementation (Stages 1-3)
|
|
462
|
+
├── format_processor.py # FormatProcessor implementation (Stages 4-5)
|
|
463
|
+
├── USAGE.md # Complete usage guide for inference
|
|
464
|
+
├── usage_example.py # Runnable usage examples
|
|
465
|
+
├── example_schema_usage.py # Format detection & validation examples
|
|
466
|
+
├── tests/ # Pytest test suite
|
|
467
|
+
│ ├── test_format_converter.py # Parsing and conversion tests
|
|
468
|
+
│ └── test_schema.py # Schema validation tests
|
|
469
|
+
└── data/ # Test data and parsed outputs
|
|
470
|
+
└── parsed/ # Converted unified format files
|
|
471
|
+
```
|
|
472
|
+
|
|
473
|
+
## Architecture
|
|
474
|
+
|
|
475
|
+
### Two-Layer Interface Design
|
|
476
|
+
|
|
477
|
+
**CGMParser** (Stages 1-3): Vendor-specific parsing to unified format
|
|
478
|
+
|
|
479
|
+
- `decode_raw_data()` - Encoding cleanup
|
|
480
|
+
- `detect_format()` - Format detection
|
|
481
|
+
- `parse_to_unified()` - Vendor CSV → UnifiedFormat
|
|
482
|
+
|
|
483
|
+
**CGMProcessor** (Stages 4-5): Vendor-agnostic operations on unified data
|
|
484
|
+
|
|
485
|
+
- `synchronize_timestamps()` - Timestamp alignment to fixed intervals
|
|
486
|
+
- `interpolate_gaps()` - Gap detection, sequence creation, and interpolation
|
|
487
|
+
- `prepare_for_inference()` - ML preparation with quality checks and truncation
|
|
488
|
+
|
|
489
|
+
The current implementation:
|
|
490
|
+
- `FormatParser` implements the `CGMParser` interface (Stages 1-3)
|
|
491
|
+
- `FormatProcessor` implements the `CGMProcessor` interface (Stages 4-5)
|
|
492
|
+
|
|
493
|
+
### Processing Stages Implementation
|
|
494
|
+
|
|
495
|
+
**Stage 1-3 (FormatParser):**
|
|
496
|
+
- BOM removal and encoding normalization
|
|
497
|
+
- Pattern-based format detection (first 15 lines)
|
|
498
|
+
- Vendor-specific CSV parsing with quirk handling
|
|
499
|
+
- Column mapping to unified schema
|
|
500
|
+
- Service field population (sequence_id, event_type, quality)
|
|
501
|
+
|
|
502
|
+
**Stage 4 (FormatProcessor.interpolate_gaps):**
|
|
503
|
+
- Time difference calculation between consecutive readings
|
|
504
|
+
- Sequence boundary detection (gaps > `small_gap_max_minutes`)
|
|
505
|
+
- Linear interpolation for small gaps (≤ `small_gap_max_minutes`)
|
|
506
|
+
- Imputation event creation with `event_type='IMPUTATN'`
|
|
507
|
+
- Calibration period marking (24h after gaps ≥ 2h45m)
|
|
508
|
+
- Warning collection for imputed data
|
|
509
|
+
|
|
510
|
+
**Stage 5 (FormatProcessor.synchronize_timestamps):**
|
|
511
|
+
- Timestamp rounding to minute boundaries
|
|
512
|
+
- Fixed-frequency grid generation at `expected_interval_minutes`
|
|
513
|
+
- Asof join (backward/forward) for value alignment
|
|
514
|
+
- Linear glucose interpolation between grid points
|
|
515
|
+
- Discrete event shifting to nearest timestamp
|
|
516
|
+
|
|
517
|
+
**Stage 6 (FormatProcessor.prepare_for_inference):**
|
|
518
|
+
- Zero-data validation (raises `ZeroValidInputError`)
|
|
519
|
+
- Latest sequence selection (max timestamp)
|
|
520
|
+
- Duration verification with `TOO_SHORT` warning
|
|
521
|
+
- Quality flag detection (`ILL`, `SENSOR_CALIBRATION`)
|
|
522
|
+
- Sequence truncation from beginning (preserves most recent data)
|
|
523
|
+
- Service column removal (data columns only)
|
|
524
|
+
- Warning flag aggregation and return
|
|
525
|
+
|
|
526
|
+
### Processing Configuration Parameters
|
|
527
|
+
|
|
528
|
+
**FormatProcessor initialization:**
|
|
529
|
+
|
|
530
|
+
| Parameter | Default | Description | Effect |
|
|
531
|
+
|-----------|---------|-------------|--------|
|
|
532
|
+
| `expected_interval_minutes` | 5 | Normal reading interval | Grid spacing for synchronization; gap detection baseline |
|
|
533
|
+
| `small_gap_max_minutes` | 15 | Max gap to interpolate | Gaps > this create new sequences; gaps ≤ this are filled |
|
|
534
|
+
|
|
535
|
+
**Common configurations:**
|
|
536
|
+
|
|
537
|
+
```python
|
|
538
|
+
# Dexcom G6/G7 (5-minute readings)
|
|
539
|
+
processor = FormatProcessor(expected_interval_minutes=5, small_gap_max_minutes=15)
|
|
540
|
+
|
|
541
|
+
# FreeStyle Libre (manual scans, typically 15 min)
|
|
542
|
+
processor = FormatProcessor(expected_interval_minutes=15, small_gap_max_minutes=45)
|
|
543
|
+
|
|
544
|
+
# Strict quality (minimal imputation)
|
|
545
|
+
processor = FormatProcessor(expected_interval_minutes=5, small_gap_max_minutes=10)
|
|
546
|
+
|
|
547
|
+
# Lenient (more gap filling for sparse data)
|
|
548
|
+
processor = FormatProcessor(expected_interval_minutes=5, small_gap_max_minutes=30)
|
|
549
|
+
```
|
|
550
|
+
|
|
551
|
+
**prepare_for_inference parameters:**
|
|
552
|
+
|
|
553
|
+
| Parameter | Default | Description |
|
|
554
|
+
|-----------|---------|-------------|
|
|
555
|
+
| `minimum_duration_minutes` | 180 | Minimum sequence duration required (warns if shorter) |
|
|
556
|
+
| `maximum_wanted_duration` | 1440 | Maximum duration to keep (truncates from beginning) |
|
|
557
|
+
|
|
558
|
+
**Constants from interface:**
|
|
559
|
+
|
|
560
|
+
```python
|
|
561
|
+
from interface.cgm_interface import (
|
|
562
|
+
MINIMUM_DURATION_MINUTES, # 180 (3 hours)
|
|
563
|
+
MAXIMUM_WANTED_DURATION_MINUTES, # 1440 (24 hours)
|
|
564
|
+
CALIBRATION_GAP_THRESHOLD, # 9900 seconds (2h45m)
|
|
565
|
+
)
|
|
566
|
+
```
|
|
567
|
+
|
|
568
|
+
### Schema System
|
|
569
|
+
|
|
570
|
+
Schemas are defined using `CGMSchemaDefinition` from `interface/schema.py`:
|
|
571
|
+
|
|
572
|
+
- **Type-safe**: Polars dtypes with constraints
|
|
573
|
+
- **Vendor-specific**: Each format has its own schema with quirks documented
|
|
574
|
+
- **Frictionless export**: Auto-generate validation schemas
|
|
575
|
+
- **Dialect support**: CSV parsing hints (header rows, comment rows, etc.)
|
|
576
|
+
|
|
577
|
+
## Error Handling
|
|
578
|
+
|
|
579
|
+
### Exceptions
|
|
580
|
+
|
|
581
|
+
| Exception | Base | Description |
|
|
582
|
+
|-----------|------|-------------|
|
|
583
|
+
| `UnknownFormatError` | `ValueError` | Format cannot be detected |
|
|
584
|
+
| `MalformedDataError` | `ValueError` | CSV parsing or conversion failed |
|
|
585
|
+
| `ZeroValidInputError` | `ValueError` | No valid data points found |
|
|
586
|
+
|
|
587
|
+
### Processing Warnings
|
|
588
|
+
|
|
589
|
+
The `FormatProcessor` collects quality warnings during processing:
|
|
590
|
+
|
|
591
|
+
| Warning Flag | Description | Triggered By |
|
|
592
|
+
|--------------|-------------|--------------|
|
|
593
|
+
| `ProcessingWarning.TOO_SHORT` | Sequence duration < minimum_duration_minutes | `prepare_for_inference()` |
|
|
594
|
+
| `ProcessingWarning.QUALITY` | Data contains ILL or SENSOR_CALIBRATION quality flags | `prepare_for_inference()` |
|
|
595
|
+
| `ProcessingWarning.IMPUTATION` | Data contains interpolated values | `interpolate_gaps()` |
|
|
596
|
+
| `ProcessingWarning.CALIBRATION` | Data contains calibration events | `prepare_for_inference()` |
|
|
597
|
+
|
|
598
|
+
**Usage:**
|
|
599
|
+
|
|
600
|
+
```python
|
|
601
|
+
processor = FormatProcessor()
|
|
602
|
+
processed_df = processor.interpolate_gaps(unified_df)
|
|
603
|
+
inference_df, warnings = processor.prepare_for_inference(processed_df)
|
|
604
|
+
|
|
605
|
+
# Check individual warnings
|
|
606
|
+
if warnings & ProcessingWarning.QUALITY:
|
|
607
|
+
print("Quality issues detected")
|
|
608
|
+
|
|
609
|
+
# Get all warnings as list
|
|
610
|
+
all_warnings = processor.get_warnings()
|
|
611
|
+
print(f"Collected {len(all_warnings)} warnings")
|
|
612
|
+
|
|
613
|
+
# Check if any warnings exist
|
|
614
|
+
if processor.has_warnings():
|
|
615
|
+
print("Processing completed with warnings")
|
|
616
|
+
```
|
|
617
|
+
|
|
618
|
+
## Testing
|
|
619
|
+
|
|
620
|
+
```bash
|
|
621
|
+
# Run all tests
|
|
622
|
+
pytest tests/
|
|
623
|
+
|
|
624
|
+
# Run specific test
|
|
625
|
+
pytest tests/test_format_converter.py -v
|
|
626
|
+
|
|
627
|
+
# Generate validation report
|
|
628
|
+
python3 example_schema_usage.py
|
|
629
|
+
|
|
630
|
+
# Run usage examples with real data
|
|
631
|
+
uv run python usage_example.py
|
|
632
|
+
```
|
|
633
|
+
|
|
634
|
+
## Development
|
|
635
|
+
|
|
636
|
+
### Regenerating Schema JSON Files
|
|
637
|
+
|
|
638
|
+
After modifying schema definitions:
|
|
639
|
+
|
|
640
|
+
```bash
|
|
641
|
+
# Regenerate unified.json
|
|
642
|
+
python3 -c "from formats.unified import regenerate_schema_json; regenerate_schema_json()"
|
|
643
|
+
|
|
644
|
+
# Regenerate dexcom.json
|
|
645
|
+
python3 -c "from formats.dexcom import regenerate_schema_json; regenerate_schema_json()"
|
|
646
|
+
|
|
647
|
+
# Regenerate libre.json
|
|
648
|
+
python3 -c "from formats.libre import regenerate_schema_json; regenerate_schema_json()"
|
|
649
|
+
```
|
|
650
|
+
|
|
651
|
+
### Adding New Vendor Formats
|
|
652
|
+
|
|
653
|
+
1. Create schema in `formats/your_vendor.py` using `CGMSchemaDefinition`
|
|
654
|
+
2. Add format to `SupportedCGMFormat` enum in `interface/cgm_interface.py`
|
|
655
|
+
3. Add detection patterns and implement parsing in `format_converter.py`
|
|
656
|
+
4. Add tests in `tests/test_format_converter.py`
|
|
657
|
+
|
|
658
|
+
## Requirements
|
|
659
|
+
|
|
660
|
+
- Python 3.12+
|
|
661
|
+
- polars 1.34.0+
|
|
662
|
+
|
|
663
|
+
Optional:
|
|
664
|
+
|
|
665
|
+
- pandas 2.3.3+ (compatibility layer)
|
|
666
|
+
- pyarrow 21.0.0+ (pandas conversion)
|
|
667
|
+
- frictionless 5.18.1+ (schema validation)
|
|
668
|
+
- pytest 8.0.0+ (testing)
|
|
669
|
+
|
|
670
|
+
## Documentation
|
|
671
|
+
|
|
672
|
+
- **[USAGE.md](USAGE.md)** - Complete usage guide for inference workflows
|
|
673
|
+
- **[usage_example.py](usage_example.py)** - Runnable examples with real data
|
|
674
|
+
- **[interface/PIPELINE.md](interface/PIPELINE.md)** - Detailed pipeline architecture
|
|
675
|
+
- **[formats/UNIFIED_FORMAT.md](formats/UNIFIED_FORMAT.md)** - Unified schema specification
|
|
676
|
+
- **[example_schema_usage.py](example_schema_usage.py)** - Schema validation examples
|
|
677
|
+
|
|
678
|
+
## License
|
|
679
|
+
|
|
680
|
+
See [LICENSE](LICENSE) file.
|