typedframes 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,901 @@
1
+ Metadata-Version: 2.4
2
+ Name: typedframes
3
+ Version: 0.1.0
4
+ Requires-Dist: mypy ; extra == 'mypy'
5
+ Requires-Dist: pandas>=2.3.3 ; extra == 'pandas'
6
+ Requires-Dist: pandera>=0.22.0 ; extra == 'pandera'
7
+ Requires-Dist: polars>=1.37.1 ; extra == 'polars'
8
+ Provides-Extra: mypy
9
+ Provides-Extra: pandas
10
+ Provides-Extra: pandera
11
+ Provides-Extra: polars
12
+ Summary: Static analysis for pandas and polars DataFrames. Catch column errors at lint-time, not runtime.
13
+ Home-Page: https://github.com/w-martin/typedframes
14
+ Author: William Martin
15
+ License: MIT
16
+ Requires-Python: >=3.11
17
+ Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
18
+
19
+ # typedframes
20
+
21
+ > ⚠️ **Project Status: Proof of Concept**
22
+ >
23
+ > `typedframes` (v0.1.0) is currently an experimental proof-of-concept. The core static analysis and mypy/Rust
24
+ > integrations work, but expect rough edges. The codebase prioritizes demonstrating the viability of static DataFrame
25
+ > schema validation over production-grade stability.
26
+ >
27
+ **Static analysis for pandas and polars DataFrames. Catch column errors at lint-time, not runtime.**
28
+
29
+ ```python
30
+ from typedframes import BaseSchema, Column
31
+ from typedframes.pandas import PandasFrame
32
+
33
+
34
+ class UserData(BaseSchema):
35
+ user_id = Column(type=int)
36
+ email = Column(type=str)
37
+ signup_date = Column(type=str)
38
+
39
+
40
+ def process(df: PandasFrame[UserData]) -> None:
41
+ df[UserData.user_id] # ✓ Schema descriptor — autocomplete, refactor-safe
42
+ df['user_id'] # ✓ String access — also validated by checker
43
+ df['username'] # ✗ Error: Column 'username' not in UserData
44
+ ```
45
+
46
+ **Why descriptors?** Rename a column in ONE place (`user_id = Column(type=int)`), and all references — `Schema.user_id`,
47
+ `str(Schema.user_id)`, `Schema.user_id.col` — update automatically. No find-and-replace across string literals.
48
+
49
+ ---
50
+
51
+ ## Table of Contents
52
+
53
+ - [Why typedframes?](#why-typedframes)
54
+ - [Installation](#installation)
55
+ - [Quick Start](#quick-start)
56
+ - [Static Analysis](#static-analysis)
57
+ - [Static Analysis Performance](#static-analysis-performance)
58
+ - [Comparison](#comparison)
59
+ - [Type Safety With Multiple Backends](#type-safety-with-multiple-backends)
60
+ - [Features](#features)
61
+ - [Advanced Usage](#advanced-usage)
62
+ - [Pandera Integration](#pandera-integration)
63
+ - [Examples](#examples)
64
+ - [Philosophy](#philosophy)
65
+ - [FAQ](#faq)
66
+
67
+ ---
68
+
69
+ ## Why typedframes?
70
+
71
+ **The problem:** Many pandas bugs are column mismatches. You access a column that doesn't exist, pass the wrong schema to a function, or make a typo. These errors only surface at runtime, often in production.
72
+
73
+ **The solution:** Define your DataFrame schemas as Python classes. Get static type checking that catches column errors before you even run your code.
74
+
75
+ **What you get:**
76
+
77
+ - ✅ **Static analysis** - Catch column errors at lint-time with mypy or the standalone checker
78
+ - ✅ **Beautiful runtime UX** - `df[Schema.column_group].mean()` (pandas) instead of ugly column lists
79
+ - ✅ **Works with pandas AND polars** - Same schema API, explicit backend types
80
+ - ✅ **Dynamic column matching** - Regex-based ColumnSets for time-series data
81
+ - ✅ **Zero runtime overhead** - No validation, no slowdown
82
+ - ✅ **Type-safe backends** - Type checker knows pandas vs polars methods
83
+
84
+ ---
85
+
86
+ ## Installation
87
+
88
+ ```shell
89
+ pip install typedframes
90
+ ```
91
+ or
92
+ ```shell
93
+ uv add typedframes
94
+ ```
95
+
96
+ The Rust-based checker is included — no separate install needed.
97
+
98
+ ---
99
+
100
+ ## Quick Start
101
+
102
+ ### Define Your Schema (Once)
103
+
104
+ ```python
105
+ from typedframes import BaseSchema, Column, ColumnSet
106
+
107
+
108
+ class SalesData(BaseSchema):
109
+ date = Column(type=str)
110
+ revenue = Column(type=float)
111
+ customer_id = Column(type=int)
112
+
113
+ # Dynamic columns with regex
114
+ metrics = ColumnSet(type=float, members=r"metric_\d+", regex=True)
115
+ ```
116
+
117
+ ### Use With Pandas
118
+
119
+ ```python
120
+ from typedframes.pandas import PandasFrame
121
+
122
+ # Load data with schema — one line
123
+ df = PandasFrame.read_csv("sales.csv", SalesData)
124
+
125
+ # Access columns via schema descriptors
126
+ print(df[SalesData.revenue].sum())
127
+ print(df[SalesData.metrics].mean()) # All metric_* columns
128
+
129
+
130
+ # Type-safe pandas operations
131
+ def analyze(data: PandasFrame[SalesData]) -> float:
132
+ data[SalesData.revenue] # ✓ Validated by type checker
133
+ data['profit'] # ✗ Error at lint-time: 'profit' not in SalesData
134
+ return data[SalesData.revenue].mean()
135
+
136
+
137
+ # Standard pandas access still works
138
+ filtered = df[df[SalesData.revenue] > 1000]
139
+ grouped = df.groupby(SalesData.customer_id)[str(SalesData.revenue)].sum()
140
+ ```
141
+
142
+ ### Use With Polars
143
+
144
+ ```python
145
+ from typedframes.polars import PolarsFrame
146
+ import polars as pl
147
+
148
+ # Load data with schema — one line
149
+ df = PolarsFrame.read_csv("sales.csv", SalesData)
150
+
151
+ # Use schema column references for type-safe expressions
152
+ print(df.select(SalesData.revenue.col).sum())
153
+
154
+
155
+ # Type-safe polars operations
156
+ def analyze_polars(data: PolarsFrame[SalesData]) -> pl.DataFrame:
157
+ data.select(SalesData.revenue.col) # ✓ OK
158
+ data.select(['profit']) # ✗ Error at lint-time: 'profit' not in SalesData
159
+ return data.select(SalesData.revenue.col).mean()
160
+
161
+
162
+ # Polars methods work as expected
163
+ filtered = df.filter(SalesData.revenue.col > 1000)
164
+ grouped = df.group_by('customer_id').agg(SalesData.revenue.col.sum())
165
+ ```
166
+
167
+ ---
168
+
169
+ ## Static Analysis
170
+
171
+ typedframes provides **two ways** to check your code:
172
+
173
+ ### Option 1: Standalone Checker (Fast)
174
+
175
+ ```shell
176
+ # Blazing fast Rust-based checker
177
+ typedframes check src/
178
+
179
+ # Output:
180
+ # ✓ Checked 47 files in 0.0s
181
+ # ✗ src/analysis.py:23 - Column 'profit' not in SalesData
182
+ # ✗ src/pipeline.py:56 - Column 'user_name' not in UserData
183
+ ```
184
+
185
+ **Features:**
186
+ - Catches column name errors
187
+ - Validates schema mismatches between functions
188
+ - Checks both pandas and polars code
189
+ - 10-100x faster than mypy
190
+
191
+ **Use this for:**
192
+ - Fast feedback during development
193
+ - CI/CD pipelines
194
+ - Pre-commit hooks
195
+
196
+ **Configuration:**
197
+ ```shell
198
+ # Check specific files
199
+ typedframes check src/pipeline.py
200
+
201
+ # Check directory
202
+ typedframes check src/
203
+
204
+ # Fail on any error (for CI)
205
+ typedframes check src/ --strict
206
+
207
+ # JSON output
208
+ typedframes check src/ --json
209
+ ```
210
+
211
+ ### Option 2: Mypy Plugin (Comprehensive)
212
+
213
+ ```shell
214
+ # Add to pyproject.toml
215
+ [tool.mypy]
216
+ plugins = ["typedframes.mypy"]
217
+
218
+ # Or mypy.ini
219
+ [mypy]
220
+ plugins = typedframes.mypy
221
+
222
+ # Run mypy
223
+ mypy src/
224
+ ```
225
+
226
+ **Features:**
227
+ - Full type checking across your codebase
228
+ - Catches column errors AND regular type errors
229
+ - IDE integration (VSCode, PyCharm)
230
+ - Works with existing mypy configuration
231
+
232
+ **Use this for:**
233
+ - Comprehensive type checking
234
+ - Integration with existing mypy setup
235
+ - IDE error highlighting
236
+
237
+ ---
238
+
239
+ ## Static Analysis Performance
240
+
241
+ Fast feedback reduces development time. The typedframes Rust binary provides near-instant column checking.
242
+
243
+ **Benchmark results** (10 runs, 3 warmup, caches cleared between runs):
244
+
245
+ | Tool | Version | What it does | typedframes (11 files) | great_expectations (490 files) |
246
+ |--------------------|---------|-------------------------------|------------------------|--------------------------------|
247
+ | typedframes | 0.1.0 | DataFrame column checker | 961µs ±56µs | 930µs ±89µs |
248
+ | ruff | 0.15.0 | Linter (no type checking) | 39ms ±12ms | 360ms ±18ms |
249
+ | ty | 0.0.16 | Type checker | 146ms ±13ms | 1.65s ±26ms |
250
+ | pyrefly | 0.52.0 | Type checker | 152ms ±7ms | 693ms ±33ms |
251
+ | mypy | 1.19.1 | Type checker (no plugin) | 9.15s ±218ms | 12.13s ±400ms |
252
+ | mypy + typedframes | 1.19.1 | Type checker + column checker | 9.34s ±331ms | 13.89s ±491ms |
253
+ | pyright | 1.1.408 | Type checker | 2.34s ±335ms | 8.37s ±253ms |
254
+
255
+ *Run `uv run python benchmarks/benchmark_checkers.py` to reproduce.*
256
+
257
+ The typedframes binary performs lexical column name resolution within a single file. It does not perform cross-file type
258
+ inference. Full type checkers (mypy, pyright, ty) analyze all Python types across your entire codebase. Use both: the
259
+ binary for fast iteration, mypy for comprehensive checking.
260
+
261
+ The standalone checker is built with [`ruff_python_parser`](https://github.com/astral-sh/ruff) for Python AST
262
+ parsing.
263
+
264
+ **Note:** ty (Astral) does not currently support mypy plugins, so use the standalone binary for column checking with ty.
265
+
266
+ ---
267
+
268
+ ## Comparison
269
+
270
+ ### Feature Matrix (Static Analysis Focus)
271
+
272
+ Comprehensive comparison of pandas/DataFrame typing and validation tools. **typedframes focuses on static analysis**
273
+ —catching errors at lint-time before your code runs.
274
+
275
+ | Feature | typedframes | Pandera | Great Expectations | strictly_typed_pandas | pandas-stubs | dataenforce | pandas-type-checks | StaticFrame | narwhals |
276
+ |---------------------------------|------------------------|-------------|--------------------|-----------------------|--------------|-------------|--------------------|------------------|----------|
277
+ | **Version tested** | 0.1.0 | 0.29.0 | 1.4.3 | 0.3.6 | 3.0.0 | 0.1.2 | 1.1.3 | 3.7.0 | 2.16.0 |
278
+ | **Analysis Type** |
279
+ | When errors are caught | **Static (lint-time)** | Runtime | Runtime | Static + Runtime | Static | Runtime | Runtime | Static + Runtime | Runtime |
280
+ | **Static Analysis (our focus)** |
281
+ | Mypy plugin | ✅ Yes | ⚠️ Limited | ❌ No | ✅ Yes | ✅ Yes | ❌ No | ❌ No | ⚠️ Basic | ❌ No |
282
+ | Standalone checker | ✅ Rust (~1ms) | ❌ No | ❌ No | ❌ No | ❌ No | ❌ No | ❌ No | ❌ No | ❌ No |
283
+ | Column name checking | ✅ Yes | ⚠️ Limited | ❌ No | ✅ Yes | ❌ No | ❌ No | ❌ No | ✅ Yes | ❌ No |
284
+ | Column type checking | ✅ Yes | ⚠️ Limited | ❌ No | ✅ Yes | ❌ No | ❌ No | ❌ No | ✅ Yes | ❌ No |
285
+ | Typo suggestions | ✅ Yes | ❌ No | ❌ No | ❌ No | ❌ No | ❌ No | ❌ No | ❌ No | ❌ No |
286
+ | **Runtime Validation** |
287
+ | Data validation | ❌ No | ✅ Excellent | ✅ Excellent | ✅ typeguard | ❌ No | ✅ Yes | ✅ Yes | ✅ Yes | ❌ No |
288
+ | Value constraints | ❌ No | ✅ Yes | ✅ Excellent | ❌ No | ❌ No | ❌ No | ❌ No | ✅ Yes | ❌ No |
289
+ | **Schema Features** |
290
+ | Column grouping | ✅ ColumnGroup | ❌ No | ❌ No | ❌ No | ❌ No | ❌ No | ❌ No | ❌ No | ❌ No |
291
+ | Regex column matching | ✅ Yes | ❌ No | ❌ No | ❌ No | ❌ No | ❌ No | ❌ No | ❌ No | ❌ No |
292
+ | **Backend Support** |
293
+ | Pandas | ✅ Yes | ✅ Yes | ✅ Yes | ✅ Yes | ✅ Yes | ✅ Yes | ✅ Yes | ❌ Own | ✅ Yes |
294
+ | Polars | ✅ Yes | ✅ Yes | ❌ No | ❌ No | ❌ No | ❌ No | ❌ No | ❌ Own | ✅ Yes |
295
+ | DuckDB, cuDF, etc. | ❌ No | ❌ No | ✅ Spark, SQL | ❌ No | ❌ No | ❌ No | ❌ No | ❌ No | ✅ Yes |
296
+ | **Project Status (Feb 2026)** |
297
+ | Active development | ✅ Yes | ✅ Yes | ✅ Yes | ⚠️ Low | ✅ Yes | ❌ Inactive | ⚠️ Low | ✅ Yes | ✅ Yes |
298
+
299
+ **Legend:** ✅ Full support | ⚠️ Limited/Partial | ❌ Not supported
300
+
301
+ ### Tool Descriptions
302
+
303
+ - **[Pandera](https://pandera.readthedocs.io/)** (v0.29.0): Excellent runtime validation. Static analysis support exists
304
+ but has limitations—column access via `df["column"]` is not validated, and schema mismatches between functions may not
305
+ be caught.
306
+
307
+ - **[strictly_typed_pandas](https://strictly-typed-pandas.readthedocs.io/)** (v0.3.6): Provides `DataSet[Schema]` type
308
+ hints with mypy support. No standalone checker. No polars support. Runtime validation via typeguard.
309
+
310
+ - **[pandas-stubs](https://github.com/pandas-dev/pandas-stubs)** (v3.0.0): Official pandas type stubs. Provides
311
+ API-level types but no column-level checking.
312
+
313
+ - **[dataenforce](https://github.com/CedricFR/dataenforce)** (v0.1.2): Runtime validation via decorator. Marked as
314
+ experimental/not production-ready. Appears inactive.
315
+
316
+ - **[pandas-type-checks](https://pypi.org/project/pandas-type-checks/)** (v1.1.3): Runtime validation decorator. No
317
+ static analysis.
318
+
319
+ - **[StaticFrame](https://github.com/static-frame/static-frame)** (v3.7.0): Alternative immutable DataFrame library with
320
+ built-in static typing. Not compatible with pandas/polars—requires using StaticFrame's own DataFrame implementation.
321
+
322
+ - **[narwhals](https://narwhals-dev.github.io/narwhals/)** (v2.16.0): Compatibility layer that provides a unified API
323
+ across pandas, polars, DuckDB, cuDF, and more. Solves a different problem—write-once-run-anywhere portability, not
324
+ type safety. See [Why Abstraction Layers Don't Solve Type Safety](#why-abstraction-layers-dont-solve-type-safety)
325
+ below.
326
+
327
+ - **[Great Expectations](https://greatexpectations.io/)** (v1.4.3): Comprehensive data quality framework. Defines
328
+ "expectations" (assertions) about data values, distributions, and schema properties. Excellent for runtime
329
+ validation, data documentation, and data quality monitoring. No static analysis or column-level type checking in
330
+ code. Supports pandas, Spark, and SQL backends.
331
+
332
+ ### Type Checkers (Not DataFrame-Specific)
333
+
334
+ These are general Python type checkers. They don't validate DataFrame column names, but they can be used alongside
335
+ typedframes for comprehensive type checking:
336
+
337
+ - **[mypy](https://mypy-lang.org/)** (v1.19.1): The original Python type checker. typedframes provides a mypy plugin for
338
+ column checking. See [performance benchmarks](#static-analysis-performance).
339
+
340
+ - **[ty](https://github.com/astral-sh/ty)** (v0.0.14, Astral): New Rust-based type checker, 10-60x faster than mypy on
341
+ large codebases. Does not support mypy plugins—use typedframes standalone checker.
342
+
343
+ - **[pyrefly](https://pyrefly.org/)** (v0.51.1, Meta): Rust-based type checker from Meta, replacement for Pyre. Fast,
344
+ but no DataFrame column checking.
345
+
346
+ - **[pyright](https://github.com/microsoft/pyright)** (v1.1.408, Microsoft): Type checker powering Pylance/VSCode. No
347
+ mypy plugin support—use typedframes standalone checker.
348
+
349
+ ### Not Directly Comparable
350
+
351
+ These tools serve different purposes:
352
+
353
+ - **[pandas_lint](https://github.com/Jean-EstevezT/pandas_lint)**: Lints pandas code patterns (performance, best
354
+ practices). Does not check column names/types.
355
+ - **[pandas-vet](https://github.com/deppen8/pandas-vet)**: Flake8 plugin for pandas best practices. Does not check
356
+ column names/types.
357
+
358
+ ### When to Use What
359
+
360
+ | Use Case | Recommended Tool |
361
+ |------------------------------------------------------|-------------------------------------|
362
+ | Static column checking (existing pandas/polars) | **typedframes** |
363
+ | Runtime data validation | Pandera |
364
+ | Both static + runtime | typedframes + `to_pandera_schema()` |
365
+ | Cross-library portability (write once, run anywhere) | narwhals |
366
+ | Data quality monitoring / pipeline validation | Great Expectations |
367
+ | Immutable DataFrames from scratch | StaticFrame |
368
+ | Pandas API type hints only | pandas-stubs |
369
+
370
+ ---
371
+
372
+ ## Type Safety With Multiple Backends
373
+
374
+ typedframes uses **explicit backend types** to ensure complete type safety:
375
+
376
+ ```python
377
+ from typedframes import BaseSchema, Column
378
+ from typedframes.pandas import PandasFrame
379
+ from typedframes.polars import PolarsFrame
380
+
381
+
382
+ class UserData(BaseSchema):
383
+ user_id = Column(type=int)
384
+ email = Column(type=str)
385
+
386
+
387
+ # Pandas pipeline - type checker knows pandas methods
388
+ def pandas_analyze(df: PandasFrame[UserData]) -> PandasFrame[UserData]:
389
+ return df[df[UserData.user_id] > 100] # ✓ Pandas syntax
390
+
391
+
392
+ # Polars pipeline - type checker knows polars methods
393
+ def polars_analyze(df: PolarsFrame[UserData]) -> PolarsFrame[UserData]:
394
+ return df.filter(UserData.user_id.col > 100) # ✓ Polars syntax
395
+
396
+
397
+ # Type checker prevents mixing backends
398
+ df_pandas = PandasFrame.read_csv("data.csv", UserData)
399
+ df_polars = PolarsFrame.read_csv("data.csv", UserData)
400
+
401
+ pandas_analyze(df_pandas) # ✓ OK
402
+ polars_analyze(df_polars) # ✓ OK
403
+ pandas_analyze(df_polars) # ✗ Type error: Expected PandasFrame, got PolarsFrame
404
+ ```
405
+
406
+ ---
407
+
408
+ ## Features
409
+
410
+ ### Clean Schema Definition
411
+
412
+ ```python
413
+ from typedframes import BaseSchema, Column, ColumnSet, ColumnGroup
414
+
415
+
416
+ class TimeSeriesData(BaseSchema):
417
+ timestamp = Column(type=str)
418
+ temperature = ColumnSet(type=float, members=r"temp_sensor_\d+", regex=True)
419
+ pressure = ColumnSet(type=float, members=r"pressure_\d+", regex=True)
420
+
421
+ # Logical grouping
422
+ sensors = ColumnGroup(members=[temperature, pressure])
423
+ ```
424
+
425
+ ### Beautiful Runtime API
426
+
427
+ ```python
428
+ from typedframes.pandas import PandasFrame
429
+
430
+ df = PandasFrame.read_csv("sensors.csv", TimeSeriesData)
431
+
432
+ # Access column groups as DataFrames
433
+ temps = df[TimeSeriesData.temperature] # All temp_sensor_* columns
434
+ all_sensors = df[TimeSeriesData.sensors] # All sensor columns
435
+
436
+ # Clean operations
437
+ avg_temp = df[TimeSeriesData.temperature].mean()
438
+ max_pressure = df[TimeSeriesData.pressure].max()
439
+
440
+ # Standard pandas access still works
441
+ df['timestamp'] # Single column
442
+ df[['timestamp', 'temp_sensor_1']] # Multi-column select
443
+ ```
444
+
445
+ ### Column-Level Static Checking
446
+
447
+ ```python
448
+ from typedframes.pandas import PandasFrame
449
+
450
+
451
+ def daily_summary(data: PandasFrame[TimeSeriesData]) -> PandasFrame[DailySummary]:
452
+ # Type checker validates column access
453
+ data['timestamp'] # ✓ OK - column exists
454
+ data['date'] # ✗ Error: Column 'date' not in TimeSeriesData
455
+
456
+ # Type checker validates ColumnSet access
457
+ temps = data[TimeSeriesData.temperature] # ✓ OK - ColumnSet exists
458
+ summary = temps.mean()
459
+ return summary
460
+ ```
461
+
462
+ ### Dynamic Column Matching
463
+
464
+ Perfect for time-series data where column counts change. Regex ColumnSets match actual DataFrame columns at
465
+ `from_schema()` time — the matched columns are stored and used when you access `df[Schema.sensors]`. Static analysis
466
+ validates that the ColumnSet *name* exists in the schema, but cannot verify which columns the regex will match at
467
+ runtime.
468
+
469
+ ```python
470
+ class SensorReadings(BaseSchema):
471
+ timestamp = Column(type=str)
472
+ # Matches: sensor_1, sensor_2, ..., sensor_N
473
+ sensors = ColumnSet(type=float, members=r"sensor_\d+", regex=True)
474
+
475
+ # Works regardless of how many sensor columns exist
476
+ df = PandasFrame.read_csv("readings_2024_01.csv", SensorReadings) # 50 sensors
477
+ df[SensorReadings.sensors].mean() # All sensor columns
478
+
479
+ df = PandasFrame.read_csv("readings_2024_02.csv", SensorReadings) # 75 sensors
480
+ df[SensorReadings.sensors].mean() # All sensor columns (different count, same code)
481
+ ```
482
+
483
+ ---
484
+
485
+ ## Advanced Usage
486
+
487
+ ### Merges, Joins, and Filters
488
+
489
+ Schema-typed DataFrames preserve their type through common operations:
490
+
491
+ **Pandas:**
492
+
493
+ ```python
494
+ from typedframes import BaseSchema, Column
495
+ from typedframes.pandas import PandasFrame
496
+ import pandas as pd
497
+
498
+
499
+ class UserSchema(BaseSchema):
500
+ user_id = Column(type=int)
501
+ email = Column(type=str)
502
+
503
+
504
+ class OrderSchema(BaseSchema):
505
+ order_id = Column(type=int)
506
+ user_id = Column(type=int)
507
+ total = Column(type=float)
508
+
509
+
510
+ # Schema preserved through filtering
511
+ def get_active_users(df: PandasFrame[UserSchema]) -> PandasFrame[UserSchema]:
512
+ return df[df[UserSchema.user_id] > 100] # ✓ Still PandasFrame[UserSchema]
513
+
514
+
515
+ # Schema preserved through merges
516
+ users: PandasFrame[UserSchema] = ...
517
+ orders: PandasFrame[OrderSchema] = ...
518
+ merged = users.merge(orders, on=str(UserSchema.user_id))
519
+ ```
520
+
521
+ **Polars:**
522
+ ```python
523
+ from typedframes.polars import PolarsFrame
524
+ import polars as pl
525
+
526
+
527
+ # Schema columns work in filter expressions
528
+ def filter_users(df: PolarsFrame[UserSchema]) -> pl.DataFrame:
529
+ return df.filter(UserSchema.user_id.col > 100)
530
+
531
+
532
+ # Schema columns work in join expressions
533
+ def join_data(
534
+ users: PolarsFrame[UserSchema],
535
+ orders: PolarsFrame[OrderSchema]
536
+ ) -> pl.DataFrame:
537
+ return users.join(
538
+ orders,
539
+ left_on=UserSchema.user_id.col,
540
+ right_on=OrderSchema.user_id.col
541
+ )
542
+
543
+
544
+ # Schema columns work in select expressions
545
+ def select_columns(df: PolarsFrame[UserSchema]) -> pl.DataFrame:
546
+ return df.select([UserSchema.user_id.col, UserSchema.email.col])
547
+ ```
548
+
549
+ ### Schema Composition
550
+
551
+ Compose upward — build bigger schemas from smaller ones via inheritance. Type checkers see all columns natively.
552
+
553
+ ```python
554
+ from typedframes import BaseSchema, Column
555
+ from typedframes.pandas import PandasFrame
556
+
557
+
558
+ # Start with the smallest useful schema
559
+ class UserPublic(BaseSchema):
560
+ user_id = Column(type=int)
561
+ email = Column(type=str)
562
+ name = Column(type=str)
563
+
564
+
565
+ # Extend it — never strip down
566
+ class UserFull(UserPublic):
567
+ password_hash = Column(type=str)
568
+
569
+
570
+ class Orders(BaseSchema):
571
+ order_id = Column(type=int)
572
+ user_id = Column(type=int)
573
+ total = Column(type=float)
574
+
575
+
576
+ # Combine via multiple inheritance
577
+ class UserOrders(UserPublic, Orders):
578
+ """Type checkers see all columns from both parents."""
579
+ ...
580
+
581
+
582
+ # Or use the + operator
583
+ UserOrdersDynamic = UserPublic + Orders
584
+
585
+ users: PandasFrame[UserPublic] = ...
586
+ orders: PandasFrame[Orders] = ...
587
+ merged: PandasFrame[UserOrders] = PandasFrame.from_schema(
588
+ users.merge(orders, on=str(UserPublic.user_id)), UserOrders
589
+ )
590
+ ```
591
+
592
+ Overlapping columns with the same type are allowed (common after merges). Conflicting types raise `SchemaConflictError`.
593
+
594
+ See [`examples/schema_algebra_example.py`](examples/schema_algebra_example.py) for a complete walkthrough.
595
+
596
+ ---
597
+
598
+ ## Pandera Integration
599
+
600
+ Convert typedframes schemas to [Pandera](https://pandera.readthedocs.io/) schemas for runtime validation. Define your
601
+ schema once, get both static and runtime checking.
602
+
603
+ ```shell
604
+ pip install typedframes[pandera]
605
+ ```
606
+
607
+ ```python
608
+ from typedframes import BaseSchema, Column
609
+ from typedframes.pandera import to_pandera_schema
610
+ import pandas as pd
611
+
612
+
613
+ class UserData(BaseSchema):
614
+ user_id = Column(type=int)
615
+ email = Column(type=str)
616
+ age = Column(type=int, nullable=True)
617
+
618
+
619
+ # Convert to pandera schema
620
+ pandera_schema = to_pandera_schema(UserData)
621
+
622
+ # Validate data at runtime
623
+ df = pd.read_csv("users.csv")
624
+ validated_df = pandera_schema.validate(df) # Raises SchemaError on failure
625
+ ```
626
+
627
+ The conversion maps:
628
+
629
+ - `Column` type/nullable/alias to `pa.Column` dtype/nullable/name
630
+ - `ColumnSet` with explicit members to individual `pa.Column` entries
631
+ - `ColumnSet` with regex to `pa.Column(regex=True)`
632
+ - `allow_extra_columns` to pandera's `strict` mode
633
+
634
+ ---
635
+
636
+ ## Examples
637
+
638
+ ### Basic CSV Processing
639
+
640
+ ```python
641
+ from typedframes import BaseSchema, Column
642
+ from typedframes.pandas import PandasFrame
643
+
644
+
645
+ class Orders(BaseSchema):
646
+ order_id = Column(type=int)
647
+ customer_id = Column(type=int)
648
+ total = Column(type=float)
649
+ date = Column(type=str)
650
+
651
+
652
+ def calculate_revenue(orders: PandasFrame[Orders]) -> float:
653
+ return orders[Orders.total].sum()
654
+
655
+
656
+ df = PandasFrame.read_csv("orders.csv", Orders)
657
+ revenue = calculate_revenue(df)
658
+ ```
659
+
660
+ ### Time Series Analysis
661
+
662
+ ```python
663
+ from typedframes import BaseSchema, Column, ColumnSet, ColumnGroup
664
+ from typedframes.pandas import PandasFrame
665
+
666
+
667
+ class SensorData(BaseSchema):
668
+ timestamp = Column(type=str)
669
+ temperature = ColumnSet(type=float, members=r"temp_\d+", regex=True)
670
+ humidity = ColumnSet(type=float, members=r"humidity_\d+", regex=True)
671
+
672
+ all_sensors = ColumnGroup(members=[temperature, humidity])
673
+
674
+
675
+ df = PandasFrame.read_csv("sensors.csv", SensorData)
676
+
677
+ # Clean, type-safe operations
678
+ avg_temp_per_row = df[SensorData.temperature].mean(axis=1)
679
+ all_readings_stats = df[SensorData.all_sensors].describe()
680
+ ```
681
+
682
+ ### Multi-Step Pipeline
683
+
684
+ ```python
685
+ from typedframes import BaseSchema, Column
686
+ from typedframes.pandas import PandasFrame
687
+
688
+
689
+ class RawSales(BaseSchema):
690
+ date = Column(type=str)
691
+ product_id = Column(type=int)
692
+ quantity = Column(type=int)
693
+ price = Column(type=float)
694
+
695
+
696
+ class AggregatedSales(BaseSchema):
697
+ date = Column(type=str)
698
+ total_revenue = Column(type=float)
699
+ total_quantity = Column(type=int)
700
+
701
+
702
+ def aggregate_daily(df: PandasFrame[RawSales]) -> PandasFrame[AggregatedSales]:
703
+ result = df.groupby(RawSales.date).agg({
704
+ str(RawSales.price): 'sum',
705
+ str(RawSales.quantity): 'sum',
706
+ }).reset_index()
707
+
708
+ result.columns = ['date', 'total_revenue', 'total_quantity']
709
+ return PandasFrame.from_schema(result, AggregatedSales)
710
+
711
+
712
+ # Type-safe pipeline
713
+ raw = PandasFrame.read_csv("sales.csv", RawSales)
714
+ aggregated = aggregate_daily(raw)
715
+
716
+
717
+ # Type checker validates schema transformations
718
+ def analyze(df: PandasFrame[AggregatedSales]) -> float:
719
+ df[AggregatedSales.total_revenue] # ✓ OK
720
+ df['price'] # ✗ Error: 'price' not in AggregatedSales
721
+ return df[AggregatedSales.total_revenue].mean()
722
+ ```
723
+
724
+ ### Polars Performance Pipeline
725
+
726
+ ```python
727
+ from typedframes import BaseSchema, Column
728
+ from typedframes.polars import PolarsFrame
729
+ import polars as pl
730
+
731
+
732
+ class LargeDataset(BaseSchema):
733
+ id = Column(type=int)
734
+ value = Column(type=float)
735
+ category = Column(type=str)
736
+
737
+
738
+ def efficient_aggregation(df: PolarsFrame[LargeDataset]) -> pl.DataFrame:
739
+ return (
740
+ df.filter(LargeDataset.value.col > 100)
741
+ .group_by('category')
742
+ .agg(LargeDataset.value.col.mean())
743
+ )
744
+
745
+
746
+ # Polars handles large files efficiently
747
+ df = PolarsFrame.read_csv("huge_file.csv", LargeDataset)
748
+ result = efficient_aggregation(df)
749
+ ```
750
+
751
+ ---
752
+
753
+ ## Philosophy
754
+
755
+ ### Type Safety Over Validation
756
+
757
+ We believe static analysis catches bugs earlier and cheaper than runtime validation.
758
+
759
+ **typedframes focuses on:**
760
+ - ✅ Catching errors at lint-time
761
+ - ✅ Zero runtime overhead
762
+ - ✅ Developer experience
763
+
764
+ **We explicitly don't focus on:**
765
+ - ❌ Runtime data validation (use Pandera)
766
+ - ❌ Statistical checks (use Pandera)
767
+ - ❌ Data quality monitoring (use Great Expectations)
768
+
769
+ **Important:** `PandasFrame.from_schema()` is a *trust assertion*, not a validation step. It tells the type checker "
770
+ this DataFrame conforms to this schema" without verifying the actual data. The linter catches mistakes in your code (
771
+ wrong
772
+ column names, schema mismatches between functions), but it cannot verify that a CSV file contains the expected columns.
773
+ For runtime validation of external data, use [`to_pandera_schema()`](#pandera-integration) to convert your typedframes
774
+ schemas to Pandera schemas.
775
+
776
+ ### Explicit Backend Types
777
+
778
+ We use explicit `PandasFrame` and `PolarsFrame` types because:
779
+ - Pandas and polars have different APIs
780
+ - Type safety requires knowing which methods are available
781
+ - Being explicit prevents bugs
782
+
783
+ **Trade-offs we avoid:**
784
+ - ❌ "Universal DataFrame" abstractions (you lose features)
785
+ - ❌ Implicit backend detection (runtime errors)
786
+ - ❌ Lowest-common-denominator APIs
787
+
788
+ The reason to choose polars over pandas is its lazy evaluation, native parallelism, and expressive query syntax.
789
+ Abstraction layers often must expose a lowest-common-denominator API. By using explicit backend types instead,
790
+ typedframes lets you use each library's full, native API while still getting schema-level type safety.
791
+
792
+ ### Why Abstraction Layers Don't Solve Type Safety
793
+
794
+ Tools like [narwhals](https://narwhals-dev.github.io/narwhals/) solve a different problem: writing portable code that runs on pandas, polars, DuckDB, cuDF, and other backends. This is useful for library authors who want to support multiple backends without maintaining separate codebases.
795
+
796
+ However, abstraction layers don't provide column-level type safety:
797
+
798
+ ```python
799
+ import narwhals as nw
800
+
801
+ def process(df: nw.DataFrame) -> nw.DataFrame:
802
+ # No static checking - "revenue" typo won't be caught until runtime
803
+ return df.filter(nw.col("revnue") > 100) # Typo: "revnue" vs "revenue"
804
+ ```
805
+
806
+ **The fundamental issue:** Abstraction layers abstract over *which library* you're using, not *what columns* your data has. They can't know at lint-time whether "revenue" is a valid column in your DataFrame.
807
+
808
+ typedframes solves the orthogonal problem of schema safety:
809
+
810
+ ```python
811
+ from typedframes.polars import PolarsFrame
812
+
813
+ class SalesData(BaseSchema):
814
+ revenue = Column(type=float)
815
+
816
+ def process(df: PolarsFrame[SalesData]) -> PolarsFrame[SalesData]:
817
+ return df.filter(df['revnue'] > 100) # ✗ Error at lint-time: 'revnue' not in SalesData
818
+ ```
819
+
820
+ **Use narwhals when:** You're writing a library that needs to work with multiple DataFrame backends.
821
+
822
+ **Use typedframes when:** You want to catch column name/type errors before your code runs.
823
+
824
+ ### Why No Built-in Validation?
825
+
826
+ Ideally, validation happens at the point of data ingestion rather than in Python application code. If you're validating
827
+ DataFrames in Python, consider whether your data pipeline could enforce constraints earlier. Use Pandera for cases where
828
+ runtime validation is genuinely necessary.
829
+
830
+ ---
831
+
832
+ ## License
833
+
834
+ MIT License - see [LICENSE](LICENSE)
835
+
836
+ ---
837
+
838
+ ## Roadmap
839
+
840
+ **Shipped:**
841
+ - [x] Schema definition API
842
+ - [x] Pandas support
843
+ - [x] Polars support
844
+ - [x] Mypy plugin
845
+ - [x] Standalone checker (Rust)
846
+ - [x] Explicit backend types
847
+ - [x] Merge/join schema preservation
848
+ - [x] Schema Composition (multiple inheritance, `SchemaA + SchemaB`)
849
+ - [x] Column name collision warnings
850
+ - [x] Pandera integration (`to_pandera_schema()`)
851
+
852
+ **Planned:**
853
+
854
+ - [ ] **Opt-in data loading constraints** - `Field` class with constraints (`gt`, `ge`, `lt`, `le`), strictly isolated
855
+ to `from_schema()` ingestion boundaries
856
+
857
+ ---
858
+
859
+ ## FAQ
860
+
861
+ **Q: Do I need to choose between pandas and polars?**
862
+ A: No. Define your schema once, use it with both. Just use the appropriate type (`PandasFrame` or `PolarsFrame`) in your function signatures.
863
+
864
+ **Q: Does this replace Pandera?**
865
+ A: No, it complements it. Use typedframes for static analysis, and `to_pandera_schema()` to convert your schemas to
866
+ Pandera for runtime validation. See [Pandera Integration](#pandera-integration).
867
+
868
+ **Q: Is the standalone checker required?**
869
+ A: No. You can use just the mypy plugin, just the standalone checker, or both. They catch the same errors.
870
+
871
+ **Q: What works without any plugin?**
872
+ A: The `__getitem__` overloads on `PandasFrame` mean that any type checker (mypy, pyright, ty) understands
873
+ `df[Schema.column]` returns `pd.Series` and `df[Schema.column_set]` returns `pd.DataFrame` — no plugin or stubs needed.
874
+ Column *name* validation (catching typos like `df["revnue"]` in string-based access) still requires the standalone
875
+ checker or mypy plugin.
876
+
877
+ **Q: What about pyright/pylance users?**
878
+ A: The mypy plugin doesn't work with pyright. Use the standalone checker (`typedframes check`) for column name
879
+ validation. Schema descriptor access (`df[Schema.column]`) works natively in pyright without any plugin.
880
+
881
+ **Q: Does this work with existing pandas/polars code?**
882
+ A: Yes. You can gradually adopt typedframes by adding schemas to new code. Existing code continues to work.
883
+
884
+ **Q: What if my column name conflicts with a pandas/polars method?**
885
+ A: No problem. Since column access uses bracket syntax with schema descriptors (`df[Schema.mean]`), there is no conflict
886
+ with DataFrame methods (`df.mean()`). Both work independently.
887
+
888
+ ---
889
+
890
+ ## Credits
891
+
892
+ Built by developers who believe DataFrame bugs should be caught at lint-time, not in production.
893
+
894
+ Inspired by the needs of ML/data science teams working with complex data pipelines.
895
+
896
+ ---
897
+
898
+ **Questions? Issues? Ideas?** [Open an issue](https://github.com/yourusername/typedframes/issues)
899
+
900
+ **Ready to catch DataFrame bugs before runtime?** `pip install typedframes`
901
+