cgm-format 0.2.2__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2025 GlucoseDAO
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,680 @@
1
+ Metadata-Version: 2.4
2
+ Name: cgm-format
3
+ Version: 0.2.2
4
+ Summary: Unified CGM data format converter for ML training and inference
5
+ License: MIT License
6
+
7
+ Copyright (c) 2025 GlucoseDAO
8
+
9
+ Permission is hereby granted, free of charge, to any person obtaining a copy
10
+ of this software and associated documentation files (the "Software"), to deal
11
+ in the Software without restriction, including without limitation the rights
12
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
13
+ copies of the Software, and to permit persons to whom the Software is
14
+ furnished to do so, subject to the following conditions:
15
+
16
+ The above copyright notice and this permission notice shall be included in all
17
+ copies or substantial portions of the Software.
18
+
19
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
20
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
21
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
22
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
23
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
24
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
25
+ SOFTWARE.
26
+
27
+ Requires-Python: >=3.12
28
+ Description-Content-Type: text/markdown
29
+ License-File: LICENSE
30
+ Requires-Dist: polars>=1.34.0
31
+ Provides-Extra: extra
32
+ Requires-Dist: pandas>=2.3.3; extra == "extra"
33
+ Requires-Dist: pyarrow>=21.0.0; extra == "extra"
34
+ Requires-Dist: frictionless>=5.18.1; extra == "extra"
35
+ Provides-Extra: dev
36
+ Requires-Dist: pytest>=8.0.0; extra == "dev"
37
+ Dynamic: license-file
38
+
39
+ # cgm_format
40
+
41
+ Python library for converting vendor-specific Continuous Glucose Monitoring (CGM) data (Dexcom, Libre) into a standardized unified format for ML training and inference.
42
+
43
+ ## Features
44
+
45
+ - **Vendor format detection**: Automatic detection of Dexcom, Libre, and Unified formats
46
+ - **Robust parsing**: Handles BOM marks, encoding artifacts, and vendor-specific CSV quirks
47
+ - **Unified schema**: Standardized data format with service columns (metadata) and data columns
48
+ - **Schema validation**: Frictionless Data Table Schema support for validation
49
+ - **Type-safe**: Polars-based with strict type definitions and enum support
50
+ - **Extensible**: Clean abstract interfaces for adding new vendor formats
51
+
52
+ ## Installation
53
+
54
+ ```bash
55
+ # Using uv (recommended)
56
+ uv pip install -e .
57
+
58
+ # Or using pip
59
+ pip3 install -e .
60
+
61
+ # Optional dependencies
62
+ uv pip install -e ".[extra]" # pandas, pyarrow, frictionless
63
+ uv pip install -e ".[dev]" # pytest
64
+ ```
65
+
66
+ ## Quick Start
67
+
68
+ ### Basic Parsing
69
+
70
+ ```python
71
+ from format_converter import FormatParser
72
+ import polars as pl
73
+
74
+ # Parse any supported CGM file (Dexcom, Libre, or Unified)
75
+ unified_df = FormatParser.parse_from_file("data/example.csv")
76
+
77
+ # Access the data
78
+ print(unified_df.head())
79
+
80
+ # Save to unified format
81
+ FormatParser.to_csv_file(unified_df, "output.csv")
82
+ ```
83
+
84
+ ### Complete Inference Pipeline
85
+
86
+ ```python
87
+ from format_converter import FormatParser
88
+ from format_processor import FormatProcessor
89
+
90
+ # Stage 1-3: Parse vendor format to unified
91
+ unified_df = FormatParser.parse_from_file("data/dexcom_export.csv")
92
+
93
+ # Stage 4-5: Process for inference
94
+ processor = FormatProcessor(
95
+ expected_interval_minutes=5,
96
+ small_gap_max_minutes=15
97
+ )
98
+
99
+ # Fill gaps and create sequences
100
+ processed_df = processor.interpolate_gaps(unified_df)
101
+
102
+ # Prepare final inference data
103
+ inference_df, warnings = processor.prepare_for_inference(
104
+ processed_df,
105
+ minimum_duration_minutes=180,
106
+ maximum_wanted_duration=1440
107
+ )
108
+
109
+ # Feed to ML model
110
+ predictions = your_model.predict(inference_df)
111
+ ```
112
+
113
+ **See [USAGE.md](USAGE.md) for complete inference workflows and [usage_example.py](usage_example.py) for runnable examples.**
114
+
115
+ ## Unified Format Schema
116
+
117
+ The library converts all vendor formats to a standardized schema with two types of columns:
118
+
119
+ ### Service Columns (Metadata)
120
+
121
+ | Column | Type | Description |
122
+ |--------|------|-------------|
123
+ | `sequence_id` | `Int64` | Unique sequence identifier |
124
+ | `event_type` | `Utf8` | Event type (8-char code: EGV_READ, INS_FAST, CARBS_IN, etc.) |
125
+ | `quality` | `Int64` | Data quality (0=GOOD, 1=ILL, 2=SENSOR_CALIBRATION) |
126
+
127
+ ### Data Columns
128
+
129
+ | Column | Type | Unit | Description |
130
+ |--------|------|------|-------------|
131
+ | `datetime` | `Datetime` | - | Timestamp (ISO 8601) |
132
+ | `glucose` | `Float64` | mg/dL | Blood glucose reading |
133
+ | `carbs` | `Float64` | g | Carbohydrate intake |
134
+ | `insulin_slow` | `Float64` | u | Long-acting insulin dose |
135
+ | `insulin_fast` | `Float64` | u | Short-acting insulin dose |
136
+ | `exercise` | `Int64` | seconds | Exercise duration |
137
+
138
+ See [`formats/UNIFIED_FORMAT.md`](formats/UNIFIED_FORMAT.md) for complete specification and event type enums.
139
+
140
+ ## Processing Pipeline
141
+
142
+ The library implements a 3-stage parsing pipeline defined in the `CGMParser` interface:
143
+
144
+ ### Stage 1: Preprocess Raw Data
145
+
146
+ Remove BOM marks, encoding artifacts, and normalize text encoding.
147
+
148
+ ```python
149
+ text_data = FormatParser.decode_raw_data(raw_bytes)
150
+ ```
151
+
152
+ ### Stage 2: Format Detection
153
+
154
+ Automatically detect vendor format from CSV headers.
155
+
156
+ ```python
157
+ from interface.cgm_interface import SupportedCGMFormat
158
+
159
+ format_type = FormatParser.detect_format(text_data)
160
+ # Returns: SupportedCGMFormat.DEXCOM, .LIBRE, or .UNIFIED_CGM
161
+ ```
162
+
163
+ ### Stage 3: Vendor-Specific Parsing
164
+
165
+ Parse vendor CSV to unified format, handling vendor-specific quirks:
166
+
167
+ - Dexcom: High/Low glucose markers, variable-length rows, metadata rows
168
+ - Libre: Record type filtering, timestamp format variations
169
+
170
+ ```python
171
+ unified_df = FormatParser.parse_to_unified(text_data, format_type)
172
+ ```
173
+
174
+ All stages can be chained with convenience methods:
175
+
176
+ ```python
177
+ # Parse from file
178
+ unified_df = FormatParser.parse_from_file("data.csv")
179
+
180
+ # Parse from bytes
181
+ unified_df = FormatParser.parse_from_bytes(raw_data)
182
+
183
+ # Parse from string
184
+ unified_df = FormatParser.parse_from_string(text_data)
185
+ ```
186
+
187
+ See [`interface/PIPELINE.md`](interface/PIPELINE.md) for complete pipeline documentation.
188
+
189
+ ### Stage 4: Gap Interpolation and Sequence Creation
190
+
191
+ The `FormatProcessor.interpolate_gaps()` method handles data continuity:
192
+
193
+ ```python
194
+ from format_processor import FormatProcessor
195
+
196
+ processor = FormatProcessor(
197
+ expected_interval_minutes=5, # Normal CGM reading interval
198
+ small_gap_max_minutes=15 # Max gap size to interpolate
199
+ )
200
+
201
+ # Detect gaps, create sequences, and interpolate missing values
202
+ processed_df = processor.interpolate_gaps(unified_df)
203
+ ```
204
+
205
+ **What it does:**
206
+
207
+ 1. **Gap Detection**: Identifies gaps in continuous glucose monitoring data
208
+ 2. **Sequence Creation**: Splits data at large gaps (>15 min default) into separate sequences
209
+ 3. **Small Gap Interpolation**: Fills small gaps (≤15 min) with linearly interpolated glucose values
210
+ 4. **Calibration Marking**: Marks 24-hour periods after gaps ≥2h45m as `Quality.SENSOR_CALIBRATION`
211
+ 5. **Warning Collection**: Tracks imputation events via `ProcessingWarning.IMPUTATION`
212
+
213
+ **Example - Analyze sequences created:**
214
+
215
+ ```python
216
+ # Check sequences
217
+ sequence_count = processed_df['sequence_id'].n_unique()
218
+ print(f"Created {sequence_count} sequences")
219
+
220
+ # Analyze each sequence
221
+ import polars as pl
222
+ sequence_info = processed_df.group_by('sequence_id').agg([
223
+ pl.col('datetime').min().alias('start_time'),
224
+ pl.col('datetime').max().alias('end_time'),
225
+ pl.col('datetime').count().alias('num_points'),
226
+ ])
227
+
228
+ for row in sequence_info.iter_rows(named=True):
229
+ duration_hours = (row['end_time'] - row['start_time']).total_seconds() / 3600
230
+ print(f"Sequence {row['sequence_id']}: {duration_hours:.1f}h, {row['num_points']} points")
231
+ ```
232
+
233
+ ### Stage 5: Timestamp Synchronization (Optional)
234
+
235
+ Align timestamps to fixed-frequency intervals for ML models requiring regular time steps:
236
+
237
+ ```python
238
+ # After interpolate_gaps(), synchronize to exact intervals
239
+ synchronized_df = processor.synchronize_timestamps(processed_df)
240
+
241
+ # Now all timestamps are at exact 5-minute intervals: 10:00:00, 10:05:00, 10:10:00, etc.
242
+ ```
243
+
244
+ **What it does:**
245
+
246
+ 1. Rounds timestamps to nearest minute boundary (removes seconds)
247
+ 2. Creates fixed-frequency timestamps at `expected_interval_minutes` intervals
248
+ 3. Linearly interpolates glucose values between measurements
249
+ 4. Shifts discrete events (carbs, insulin, exercise) to nearest timestamp
250
+ 5. Preserves sequence boundaries (processes each sequence independently)
251
+
252
+ **When to use:** Time-series models expecting fixed intervals (LSTM, transformers, ARIMA)
253
+ **When to skip:** Models handling irregular timestamps, or when original timing is critical
254
+
255
+ ### Stage 6: Inference Preparation
256
+
257
+ The `prepare_for_inference()` method performs final quality assurance and data extraction:
258
+
259
+ ```python
260
+ # Prepare final inference-ready data
261
+ inference_df, warnings = processor.prepare_for_inference(
262
+ processed_df,
263
+ minimum_duration_minutes=180, # Require 3 hours minimum
264
+ maximum_wanted_duration=1440 # Truncate to last 24 hours if longer
265
+ )
266
+
267
+ # Check for quality issues
268
+ from interface.cgm_interface import ProcessingWarning
269
+
270
+ if warnings & ProcessingWarning.TOO_SHORT:
271
+ print("Warning: Sequence shorter than minimum duration")
272
+ if warnings & ProcessingWarning.QUALITY:
273
+ print("Warning: Data contains quality issues (ILL or SENSOR_CALIBRATION)")
274
+ if warnings & ProcessingWarning.IMPUTATION:
275
+ print("Warning: Data contains interpolated values")
276
+ ```
277
+
278
+ **What it does:**
279
+
280
+ 1. **Validation**: Raises `ZeroValidInputError` if no valid glucose data exists
281
+ 2. **Sequence Selection**: Keeps only the **latest** sequence (most recent timestamps)
282
+ 3. **Duration Checks**: Warns if sequence < `minimum_duration_minutes`
283
+ 4. **Quality Checks**: Collects warnings for calibration events and quality flags
284
+ 5. **Truncation**: Keeps last N minutes if exceeding `maximum_wanted_duration`
285
+ 6. **Column Extraction**: Returns only data columns (removes service metadata)
286
+
287
+ **Output DataFrame:**
288
+
289
+ ```python
290
+ # inference_df contains only data columns:
291
+ # ['datetime', 'glucose', 'carbs', 'insulin_slow', 'insulin_fast', 'exercise']
292
+
293
+ # Feed directly to ML model
294
+ predictions = your_model.predict(inference_df)
295
+ ```
296
+
297
+ ### Complete Processor Configuration
298
+
299
+ ```python
300
+ from format_processor import FormatProcessor
301
+ from interface.cgm_interface import MINIMUM_DURATION_MINUTES, MAXIMUM_WANTED_DURATION_MINUTES
302
+
303
+ # Initialize processor with custom intervals
304
+ processor = FormatProcessor(
305
+ expected_interval_minutes=5, # CGM reading interval (5 min for Dexcom, 15 min for Libre)
306
+ small_gap_max_minutes=15 # Max gap to interpolate (larger gaps create new sequences)
307
+ )
308
+
309
+ # Stage 4: Fill gaps and create sequences
310
+ processed_df = processor.interpolate_gaps(unified_df)
311
+
312
+ # Stage 5 (Optional): Synchronize to fixed intervals
313
+ # synchronized_df = processor.synchronize_timestamps(processed_df)
314
+
315
+ # Stage 6: Prepare for inference
316
+ inference_df, warnings = processor.prepare_for_inference(
317
+ processed_df, # or synchronized_df if using Stage 5
318
+ minimum_duration_minutes=MINIMUM_DURATION_MINUTES, # Default: 180 (3 hours)
319
+ maximum_wanted_duration=MAXIMUM_WANTED_DURATION_MINUTES # Default: 1440 (24 hours)
320
+ )
321
+
322
+ # Check warnings
323
+ if processor.has_warnings():
324
+ all_warnings = processor.get_warnings()
325
+ print(f"Processing collected {len(all_warnings)} warnings")
326
+ ```
327
+
328
+ ## Advanced Usage
329
+
330
+ ### Working with Schemas
331
+
332
+ ```python
333
+ from formats.unified import CGM_SCHEMA, UnifiedEventType, Quality
334
+
335
+ # Get Polars schema
336
+ polars_schema = CGM_SCHEMA.get_polars_schema()
337
+ data_only_schema = CGM_SCHEMA.get_polars_schema(data_only=True)
338
+
339
+ # Get column names
340
+ all_columns = CGM_SCHEMA.get_column_names()
341
+ data_columns = CGM_SCHEMA.get_column_names(data_only=True)
342
+
343
+ # Get cast expressions for Polars
344
+ cast_exprs = CGM_SCHEMA.get_cast_expressions()
345
+ df = df.with_columns(cast_exprs)
346
+
347
+ # Use enums
348
+ event = UnifiedEventType.GLUCOSE # "EGV_READ"
349
+ quality = Quality.GOOD # 0
350
+ ```
351
+
352
+ ### Batch Processing with Inference Preparation
353
+
354
+ ```python
355
+ from pathlib import Path
356
+ from format_converter import FormatParser
357
+ from format_processor import FormatProcessor
358
+ import polars as pl
359
+
360
+ data_dir = Path("data")
361
+ output_dir = Path("data/inference_ready")
362
+ output_dir.mkdir(exist_ok=True)
363
+
364
+ processor = FormatProcessor()
365
+ results = []
366
+
367
+ for csv_file in data_dir.glob("*.csv"):
368
+ try:
369
+ # Parse to unified format
370
+ unified_df = FormatParser.parse_from_file(csv_file)
371
+
372
+ # Process for inference
373
+ processed_df = processor.interpolate_gaps(unified_df)
374
+ inference_df, warnings = processor.prepare_for_inference(processed_df)
375
+
376
+ # Add patient identifier
377
+ patient_id = csv_file.stem
378
+ inference_df = inference_df.with_columns([
379
+ pl.lit(patient_id).alias('patient_id')
380
+ ])
381
+
382
+ results.append(inference_df)
383
+
384
+ # Save individual file
385
+ output_file = output_dir / f"{patient_id}_inference.csv"
386
+ FormatParser.to_csv_file(inference_df, str(output_file))
387
+
388
+ warning_str = f"warnings={warnings.value}" if warnings else "OK"
389
+ print(f"✓ {csv_file.name}: {len(inference_df)} records, {warning_str}")
390
+
391
+ except Exception as e:
392
+ print(f"✗ Failed {csv_file.name}: {e}")
393
+
394
+ # Combine all processed data
395
+ if results:
396
+ combined_df = pl.concat(results)
397
+ FormatParser.to_csv_file(combined_df, str(output_dir / "combined_inference.csv"))
398
+ print(f"\n✓ Combined {len(results)} files into single dataset")
399
+ ```
400
+
401
+ ### Format Detection and Validation
402
+
403
+ ```python
404
+ from example_schema_usage import run_format_detection_and_validation
405
+ from pathlib import Path
406
+
407
+ # Validate all files in data directory
408
+ run_format_detection_and_validation(
409
+ data_dir=Path("data"),
410
+ parsed_dir=Path("data/parsed"),
411
+ output_file=Path("validation_report.txt")
412
+ )
413
+ ```
414
+
415
+ This generates a detailed report with:
416
+
417
+ - Format detection statistics
418
+ - Frictionless schema validation results (if library installed)
419
+ - Known vendor quirks automatically suppressed
420
+
421
+ ## Supported Formats
422
+
423
+ ### Dexcom Clarity Export
424
+
425
+ - CSV with metadata rows (rows 2-11)
426
+ - Variable-length rows (non-EGV events missing trailing columns)
427
+ - High/Low glucose markers for out-of-range values
428
+ - Event types: EGV, Insulin, Carbs, Exercise
429
+ - Multiple timestamp format variants
430
+
431
+ ### FreeStyle Libre
432
+
433
+ - CSV with metadata row 1, header row 2
434
+ - Record type filtering (0=glucose, 4=insulin, 5=food)
435
+ - Multiple timestamp format variants
436
+ - Separate rapid/long insulin columns
437
+
438
+ ### Unified Format
439
+
440
+ - Standardized CSV with header row 1
441
+ - ISO 8601 timestamps
442
+ - Service columns + data columns
443
+ - Validates existing unified format files
444
+
445
+ ## Project Structure
446
+
447
+ ```text
448
+ cgm_format/
449
+ ├── interface/ # Abstract interfaces and schema infrastructure
450
+ │ ├── cgm_interface.py # CGMParser and CGMProcessor interfaces
451
+ │ ├── schema.py # Base schema definition system
452
+ │ └── PIPELINE.md # Pipeline documentation
453
+ ├── formats/ # Format-specific schemas and definitions
454
+ │ ├── unified.py # Unified format schema and enums
455
+ │ ├── unified.json # Frictionless schema export
456
+ │ ├── dexcom.py # Dexcom format schema and constants
457
+ │ ├── dexcom.json # Frictionless schema for Dexcom
458
+ │ ├── libre.py # Libre format schema and constants
459
+ │ ├── libre.json # Frictionless schema for Libre
460
+ │ └── UNIFIED_FORMAT.md # Unified format specification
461
+ ├── format_converter.py # FormatParser implementation (Stages 1-3)
462
+ ├── format_processor.py # FormatProcessor implementation (Stages 4-5)
463
+ ├── USAGE.md # Complete usage guide for inference
464
+ ├── usage_example.py # Runnable usage examples
465
+ ├── example_schema_usage.py # Format detection & validation examples
466
+ ├── tests/ # Pytest test suite
467
+ │ ├── test_format_converter.py # Parsing and conversion tests
468
+ │ └── test_schema.py # Schema validation tests
469
+ └── data/ # Test data and parsed outputs
470
+ └── parsed/ # Converted unified format files
471
+ ```
472
+
473
+ ## Architecture
474
+
475
+ ### Two-Layer Interface Design
476
+
477
+ **CGMParser** (Stages 1-3): Vendor-specific parsing to unified format
478
+
479
+ - `decode_raw_data()` - Encoding cleanup
480
+ - `detect_format()` - Format detection
481
+ - `parse_to_unified()` - Vendor CSV → UnifiedFormat
482
+
483
+ **CGMProcessor** (Stages 4-5): Vendor-agnostic operations on unified data
484
+
485
+ - `synchronize_timestamps()` - Timestamp alignment to fixed intervals
486
+ - `interpolate_gaps()` - Gap detection, sequence creation, and interpolation
487
+ - `prepare_for_inference()` - ML preparation with quality checks and truncation
488
+
489
+ The current implementation:
490
+ - `FormatParser` implements the `CGMParser` interface (Stages 1-3)
491
+ - `FormatProcessor` implements the `CGMProcessor` interface (Stages 4-5)
492
+
493
+ ### Processing Stages Implementation
494
+
495
+ **Stage 1-3 (FormatParser):**
496
+ - BOM removal and encoding normalization
497
+ - Pattern-based format detection (first 15 lines)
498
+ - Vendor-specific CSV parsing with quirk handling
499
+ - Column mapping to unified schema
500
+ - Service field population (sequence_id, event_type, quality)
501
+
502
+ **Stage 4 (FormatProcessor.interpolate_gaps):**
503
+ - Time difference calculation between consecutive readings
504
+ - Sequence boundary detection (gaps > `small_gap_max_minutes`)
505
+ - Linear interpolation for small gaps (≤ `small_gap_max_minutes`)
506
+ - Imputation event creation with `event_type='IMPUTATN'`
507
+ - Calibration period marking (24h after gaps ≥ 2h45m)
508
+ - Warning collection for imputed data
509
+
510
+ **Stage 5 (FormatProcessor.synchronize_timestamps):**
511
+ - Timestamp rounding to minute boundaries
512
+ - Fixed-frequency grid generation at `expected_interval_minutes`
513
+ - Asof join (backward/forward) for value alignment
514
+ - Linear glucose interpolation between grid points
515
+ - Discrete event shifting to nearest timestamp
516
+
517
+ **Stage 6 (FormatProcessor.prepare_for_inference):**
518
+ - Zero-data validation (raises `ZeroValidInputError`)
519
+ - Latest sequence selection (max timestamp)
520
+ - Duration verification with `TOO_SHORT` warning
521
+ - Quality flag detection (`ILL`, `SENSOR_CALIBRATION`)
522
+ - Sequence truncation from beginning (preserves most recent data)
523
+ - Service column removal (data columns only)
524
+ - Warning flag aggregation and return
525
+
526
+ ### Processing Configuration Parameters
527
+
528
+ **FormatProcessor initialization:**
529
+
530
+ | Parameter | Default | Description | Effect |
531
+ |-----------|---------|-------------|--------|
532
+ | `expected_interval_minutes` | 5 | Normal reading interval | Grid spacing for synchronization; gap detection baseline |
533
+ | `small_gap_max_minutes` | 15 | Max gap to interpolate | Gaps > this create new sequences; gaps ≤ this are filled |
534
+
535
+ **Common configurations:**
536
+
537
+ ```python
538
+ # Dexcom G6/G7 (5-minute readings)
539
+ processor = FormatProcessor(expected_interval_minutes=5, small_gap_max_minutes=15)
540
+
541
+ # FreeStyle Libre (manual scans, typically 15 min)
542
+ processor = FormatProcessor(expected_interval_minutes=15, small_gap_max_minutes=45)
543
+
544
+ # Strict quality (minimal imputation)
545
+ processor = FormatProcessor(expected_interval_minutes=5, small_gap_max_minutes=10)
546
+
547
+ # Lenient (more gap filling for sparse data)
548
+ processor = FormatProcessor(expected_interval_minutes=5, small_gap_max_minutes=30)
549
+ ```
550
+
551
+ **prepare_for_inference parameters:**
552
+
553
+ | Parameter | Default | Description |
554
+ |-----------|---------|-------------|
555
+ | `minimum_duration_minutes` | 180 | Minimum sequence duration required (warns if shorter) |
556
+ | `maximum_wanted_duration` | 1440 | Maximum duration to keep (truncates from beginning) |
557
+
558
+ **Constants from interface:**
559
+
560
+ ```python
561
+ from interface.cgm_interface import (
562
+ MINIMUM_DURATION_MINUTES, # 180 (3 hours)
563
+ MAXIMUM_WANTED_DURATION_MINUTES, # 1440 (24 hours)
564
+ CALIBRATION_GAP_THRESHOLD, # 9900 seconds (2h45m)
565
+ )
566
+ ```
567
+
568
+ ### Schema System
569
+
570
+ Schemas are defined using `CGMSchemaDefinition` from `interface/schema.py`:
571
+
572
+ - **Type-safe**: Polars dtypes with constraints
573
+ - **Vendor-specific**: Each format has its own schema with quirks documented
574
+ - **Frictionless export**: Auto-generate validation schemas
575
+ - **Dialect support**: CSV parsing hints (header rows, comment rows, etc.)
576
+
577
+ ## Error Handling
578
+
579
+ ### Exceptions
580
+
581
+ | Exception | Base | Description |
582
+ |-----------|------|-------------|
583
+ | `UnknownFormatError` | `ValueError` | Format cannot be detected |
584
+ | `MalformedDataError` | `ValueError` | CSV parsing or conversion failed |
585
+ | `ZeroValidInputError` | `ValueError` | No valid data points found |
586
+
587
+ ### Processing Warnings
588
+
589
+ The `FormatProcessor` collects quality warnings during processing:
590
+
591
+ | Warning Flag | Description | Triggered By |
592
+ |--------------|-------------|--------------|
593
+ | `ProcessingWarning.TOO_SHORT` | Sequence duration < minimum_duration_minutes | `prepare_for_inference()` |
594
+ | `ProcessingWarning.QUALITY` | Data contains ILL or SENSOR_CALIBRATION quality flags | `prepare_for_inference()` |
595
+ | `ProcessingWarning.IMPUTATION` | Data contains interpolated values | `interpolate_gaps()` |
596
+ | `ProcessingWarning.CALIBRATION` | Data contains calibration events | `prepare_for_inference()` |
597
+
598
+ **Usage:**
599
+
600
+ ```python
601
+ processor = FormatProcessor()
602
+ processed_df = processor.interpolate_gaps(unified_df)
603
+ inference_df, warnings = processor.prepare_for_inference(processed_df)
604
+
605
+ # Check individual warnings
606
+ if warnings & ProcessingWarning.QUALITY:
607
+ print("Quality issues detected")
608
+
609
+ # Get all warnings as list
610
+ all_warnings = processor.get_warnings()
611
+ print(f"Collected {len(all_warnings)} warnings")
612
+
613
+ # Check if any warnings exist
614
+ if processor.has_warnings():
615
+ print("Processing completed with warnings")
616
+ ```
617
+
618
+ ## Testing
619
+
620
+ ```bash
621
+ # Run all tests
622
+ pytest tests/
623
+
624
+ # Run specific test
625
+ pytest tests/test_format_converter.py -v
626
+
627
+ # Generate validation report
628
+ python3 example_schema_usage.py
629
+
630
+ # Run usage examples with real data
631
+ uv run python usage_example.py
632
+ ```
633
+
634
+ ## Development
635
+
636
+ ### Regenerating Schema JSON Files
637
+
638
+ After modifying schema definitions:
639
+
640
+ ```bash
641
+ # Regenerate unified.json
642
+ python3 -c "from formats.unified import regenerate_schema_json; regenerate_schema_json()"
643
+
644
+ # Regenerate dexcom.json
645
+ python3 -c "from formats.dexcom import regenerate_schema_json; regenerate_schema_json()"
646
+
647
+ # Regenerate libre.json
648
+ python3 -c "from formats.libre import regenerate_schema_json; regenerate_schema_json()"
649
+ ```
650
+
651
+ ### Adding New Vendor Formats
652
+
653
+ 1. Create schema in `formats/your_vendor.py` using `CGMSchemaDefinition`
654
+ 2. Add format to `SupportedCGMFormat` enum in `interface/cgm_interface.py`
655
+ 3. Add detection patterns and implement parsing in `format_converter.py`
656
+ 4. Add tests in `tests/test_format_converter.py`
657
+
658
+ ## Requirements
659
+
660
+ - Python 3.12+
661
+ - polars 1.34.0+
662
+
663
+ Optional:
664
+
665
+ - pandas 2.3.3+ (compatibility layer)
666
+ - pyarrow 21.0.0+ (pandas conversion)
667
+ - frictionless 5.18.1+ (schema validation)
668
+ - pytest 8.0.0+ (testing)
669
+
670
+ ## Documentation
671
+
672
+ - **[USAGE.md](USAGE.md)** - Complete usage guide for inference workflows
673
+ - **[usage_example.py](usage_example.py)** - Runnable examples with real data
674
+ - **[interface/PIPELINE.md](interface/PIPELINE.md)** - Detailed pipeline architecture
675
+ - **[formats/UNIFIED_FORMAT.md](formats/UNIFIED_FORMAT.md)** - Unified schema specification
676
+ - **[example_schema_usage.py](example_schema_usage.py)** - Schema validation examples
677
+
678
+ ## License
679
+
680
+ See [LICENSE](LICENSE) file.