typedframes 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- typedframes-0.1.0/PKG-INFO +901 -0
- typedframes-0.1.0/README.md +882 -0
- typedframes-0.1.0/pyproject.toml +169 -0
- typedframes-0.1.0/rust/Cargo.lock +1324 -0
- typedframes-0.1.0/rust/Cargo.toml +50 -0
- typedframes-0.1.0/rust/README.md +31 -0
- typedframes-0.1.0/rust/benches/parser_bench.rs +21 -0
- typedframes-0.1.0/rust/src/lib.rs +1240 -0
- typedframes-0.1.0/rust/src/main.rs +31 -0
- typedframes-0.1.0/rust/tests/integration_test.rs +335 -0
- typedframes-0.1.0/src/typedframes/__init__.py +67 -0
- typedframes-0.1.0/src/typedframes/base_schema.py +256 -0
- typedframes-0.1.0/src/typedframes/cli.py +94 -0
- typedframes-0.1.0/src/typedframes/column.py +70 -0
- typedframes-0.1.0/src/typedframes/column_group.py +99 -0
- typedframes-0.1.0/src/typedframes/column_group_error.py +42 -0
- typedframes-0.1.0/src/typedframes/column_set.py +84 -0
- typedframes-0.1.0/src/typedframes/missing_dependency_error.py +17 -0
- typedframes-0.1.0/src/typedframes/mypy.py +120 -0
- typedframes-0.1.0/src/typedframes/pandas.py +263 -0
- typedframes-0.1.0/src/typedframes/pandera.py +81 -0
- typedframes-0.1.0/src/typedframes/polars.py +185 -0
- typedframes-0.1.0/src/typedframes/py.typed +0 -0
- typedframes-0.1.0/src/typedframes/schema_algebra.py +101 -0
|
@@ -0,0 +1,901 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: typedframes
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Requires-Dist: mypy ; extra == 'mypy'
|
|
5
|
+
Requires-Dist: pandas>=2.3.3 ; extra == 'pandas'
|
|
6
|
+
Requires-Dist: pandera>=0.22.0 ; extra == 'pandera'
|
|
7
|
+
Requires-Dist: polars>=1.37.1 ; extra == 'polars'
|
|
8
|
+
Provides-Extra: mypy
|
|
9
|
+
Provides-Extra: pandas
|
|
10
|
+
Provides-Extra: pandera
|
|
11
|
+
Provides-Extra: polars
|
|
12
|
+
Summary: Static analysis for pandas and polars DataFrames. Catch column errors at lint-time, not runtime.
|
|
13
|
+
Home-Page: https://github.com/w-martin/typedframes
|
|
14
|
+
Author: William Martin
|
|
15
|
+
License: MIT
|
|
16
|
+
Requires-Python: >=3.11
|
|
17
|
+
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
|
|
18
|
+
|
|
19
|
+
# typedframes
|
|
20
|
+
|
|
21
|
+
> ⚠️ **Project Status: Proof of Concept**
|
|
22
|
+
>
|
|
23
|
+
> `typedframes` (v0.1.0) is currently an experimental proof-of-concept. The core static analysis and mypy/Rust
|
|
24
|
+
> integrations work, but expect rough edges. The codebase prioritizes demonstrating the viability of static DataFrame
|
|
25
|
+
> schema validation over production-grade stability.
|
|
26
|
+
>
|
|
27
|
+
**Static analysis for pandas and polars DataFrames. Catch column errors at lint-time, not runtime.**
|
|
28
|
+
|
|
29
|
+
```python
|
|
30
|
+
from typedframes import BaseSchema, Column
|
|
31
|
+
from typedframes.pandas import PandasFrame
|
|
32
|
+
|
|
33
|
+
|
|
34
|
+
class UserData(BaseSchema):
|
|
35
|
+
user_id = Column(type=int)
|
|
36
|
+
email = Column(type=str)
|
|
37
|
+
signup_date = Column(type=str)
|
|
38
|
+
|
|
39
|
+
|
|
40
|
+
def process(df: PandasFrame[UserData]) -> None:
|
|
41
|
+
df[UserData.user_id] # ✓ Schema descriptor — autocomplete, refactor-safe
|
|
42
|
+
df['user_id'] # ✓ String access — also validated by checker
|
|
43
|
+
df['username'] # ✗ Error: Column 'username' not in UserData
|
|
44
|
+
```
|
|
45
|
+
|
|
46
|
+
**Why descriptors?** Rename a column in ONE place (`user_id = Column(type=int)`), and all references — `Schema.user_id`,
|
|
47
|
+
`str(Schema.user_id)`, `Schema.user_id.col` — update automatically. No find-and-replace across string literals.
|
|
48
|
+
|
|
49
|
+
---
|
|
50
|
+
|
|
51
|
+
## Table of Contents
|
|
52
|
+
|
|
53
|
+
- [Why typedframes?](#why-typedframes)
|
|
54
|
+
- [Installation](#installation)
|
|
55
|
+
- [Quick Start](#quick-start)
|
|
56
|
+
- [Static Analysis](#static-analysis)
|
|
57
|
+
- [Static Analysis Performance](#static-analysis-performance)
|
|
58
|
+
- [Comparison](#comparison)
|
|
59
|
+
- [Type Safety With Multiple Backends](#type-safety-with-multiple-backends)
|
|
60
|
+
- [Features](#features)
|
|
61
|
+
- [Advanced Usage](#advanced-usage)
|
|
62
|
+
- [Pandera Integration](#pandera-integration)
|
|
63
|
+
- [Examples](#examples)
|
|
64
|
+
- [Philosophy](#philosophy)
|
|
65
|
+
- [FAQ](#faq)
|
|
66
|
+
|
|
67
|
+
---
|
|
68
|
+
|
|
69
|
+
## Why typedframes?
|
|
70
|
+
|
|
71
|
+
**The problem:** Many pandas bugs are column mismatches. You access a column that doesn't exist, pass the wrong schema to a function, or make a typo. These errors only surface at runtime, often in production.
|
|
72
|
+
|
|
73
|
+
**The solution:** Define your DataFrame schemas as Python classes. Get static type checking that catches column errors before you even run your code.
|
|
74
|
+
|
|
75
|
+
**What you get:**
|
|
76
|
+
|
|
77
|
+
- ✅ **Static analysis** - Catch column errors at lint-time with mypy or the standalone checker
|
|
78
|
+
- ✅ **Beautiful runtime UX** - `df[Schema.column_group].mean()` (pandas) instead of ugly column lists
|
|
79
|
+
- ✅ **Works with pandas AND polars** - Same schema API, explicit backend types
|
|
80
|
+
- ✅ **Dynamic column matching** - Regex-based ColumnSets for time-series data
|
|
81
|
+
- ✅ **Zero runtime overhead** - No validation, no slowdown
|
|
82
|
+
- ✅ **Type-safe backends** - Type checker knows pandas vs polars methods
|
|
83
|
+
|
|
84
|
+
---
|
|
85
|
+
|
|
86
|
+
## Installation
|
|
87
|
+
|
|
88
|
+
```shell
|
|
89
|
+
pip install typedframes
|
|
90
|
+
```
|
|
91
|
+
or
|
|
92
|
+
```shell
|
|
93
|
+
uv add typedframes
|
|
94
|
+
```
|
|
95
|
+
|
|
96
|
+
The Rust-based checker is included — no separate install needed.
|
|
97
|
+
|
|
98
|
+
---
|
|
99
|
+
|
|
100
|
+
## Quick Start
|
|
101
|
+
|
|
102
|
+
### Define Your Schema (Once)
|
|
103
|
+
|
|
104
|
+
```python
|
|
105
|
+
from typedframes import BaseSchema, Column, ColumnSet
|
|
106
|
+
|
|
107
|
+
|
|
108
|
+
class SalesData(BaseSchema):
|
|
109
|
+
date = Column(type=str)
|
|
110
|
+
revenue = Column(type=float)
|
|
111
|
+
customer_id = Column(type=int)
|
|
112
|
+
|
|
113
|
+
# Dynamic columns with regex
|
|
114
|
+
metrics = ColumnSet(type=float, members=r"metric_\d+", regex=True)
|
|
115
|
+
```
|
|
116
|
+
|
|
117
|
+
### Use With Pandas
|
|
118
|
+
|
|
119
|
+
```python
|
|
120
|
+
from typedframes.pandas import PandasFrame
|
|
121
|
+
|
|
122
|
+
# Load data with schema — one line
|
|
123
|
+
df = PandasFrame.read_csv("sales.csv", SalesData)
|
|
124
|
+
|
|
125
|
+
# Access columns via schema descriptors
|
|
126
|
+
print(df[SalesData.revenue].sum())
|
|
127
|
+
print(df[SalesData.metrics].mean()) # All metric_* columns
|
|
128
|
+
|
|
129
|
+
|
|
130
|
+
# Type-safe pandas operations
|
|
131
|
+
def analyze(data: PandasFrame[SalesData]) -> float:
|
|
132
|
+
data[SalesData.revenue] # ✓ Validated by type checker
|
|
133
|
+
data['profit'] # ✗ Error at lint-time: 'profit' not in SalesData
|
|
134
|
+
return data[SalesData.revenue].mean()
|
|
135
|
+
|
|
136
|
+
|
|
137
|
+
# Standard pandas access still works
|
|
138
|
+
filtered = df[df[SalesData.revenue] > 1000]
|
|
139
|
+
grouped = df.groupby(SalesData.customer_id)[str(SalesData.revenue)].sum()
|
|
140
|
+
```
|
|
141
|
+
|
|
142
|
+
### Use With Polars
|
|
143
|
+
|
|
144
|
+
```python
|
|
145
|
+
from typedframes.polars import PolarsFrame
|
|
146
|
+
import polars as pl
|
|
147
|
+
|
|
148
|
+
# Load data with schema — one line
|
|
149
|
+
df = PolarsFrame.read_csv("sales.csv", SalesData)
|
|
150
|
+
|
|
151
|
+
# Use schema column references for type-safe expressions
|
|
152
|
+
print(df.select(SalesData.revenue.col).sum())
|
|
153
|
+
|
|
154
|
+
|
|
155
|
+
# Type-safe polars operations
|
|
156
|
+
def analyze_polars(data: PolarsFrame[SalesData]) -> pl.DataFrame:
|
|
157
|
+
data.select(SalesData.revenue.col) # ✓ OK
|
|
158
|
+
data.select(['profit']) # ✗ Error at lint-time: 'profit' not in SalesData
|
|
159
|
+
return data.select(SalesData.revenue.col).mean()
|
|
160
|
+
|
|
161
|
+
|
|
162
|
+
# Polars methods work as expected
|
|
163
|
+
filtered = df.filter(SalesData.revenue.col > 1000)
|
|
164
|
+
grouped = df.group_by('customer_id').agg(SalesData.revenue.col.sum())
|
|
165
|
+
```
|
|
166
|
+
|
|
167
|
+
---
|
|
168
|
+
|
|
169
|
+
## Static Analysis
|
|
170
|
+
|
|
171
|
+
typedframes provides **two ways** to check your code:
|
|
172
|
+
|
|
173
|
+
### Option 1: Standalone Checker (Fast)
|
|
174
|
+
|
|
175
|
+
```shell
|
|
176
|
+
# Blazing fast Rust-based checker
|
|
177
|
+
typedframes check src/
|
|
178
|
+
|
|
179
|
+
# Output:
|
|
180
|
+
# ✓ Checked 47 files in 0.0s
|
|
181
|
+
# ✗ src/analysis.py:23 - Column 'profit' not in SalesData
|
|
182
|
+
# ✗ src/pipeline.py:56 - Column 'user_name' not in UserData
|
|
183
|
+
```
|
|
184
|
+
|
|
185
|
+
**Features:**
|
|
186
|
+
- Catches column name errors
|
|
187
|
+
- Validates schema mismatches between functions
|
|
188
|
+
- Checks both pandas and polars code
|
|
189
|
+
- 10-100x faster than mypy
|
|
190
|
+
|
|
191
|
+
**Use this for:**
|
|
192
|
+
- Fast feedback during development
|
|
193
|
+
- CI/CD pipelines
|
|
194
|
+
- Pre-commit hooks
|
|
195
|
+
|
|
196
|
+
**Configuration:**
|
|
197
|
+
```shell
|
|
198
|
+
# Check specific files
|
|
199
|
+
typedframes check src/pipeline.py
|
|
200
|
+
|
|
201
|
+
# Check directory
|
|
202
|
+
typedframes check src/
|
|
203
|
+
|
|
204
|
+
# Fail on any error (for CI)
|
|
205
|
+
typedframes check src/ --strict
|
|
206
|
+
|
|
207
|
+
# JSON output
|
|
208
|
+
typedframes check src/ --json
|
|
209
|
+
```
|
|
210
|
+
|
|
211
|
+
### Option 2: Mypy Plugin (Comprehensive)
|
|
212
|
+
|
|
213
|
+
```shell
|
|
214
|
+
# Add to pyproject.toml
|
|
215
|
+
[tool.mypy]
|
|
216
|
+
plugins = ["typedframes.mypy"]
|
|
217
|
+
|
|
218
|
+
# Or mypy.ini
|
|
219
|
+
[mypy]
|
|
220
|
+
plugins = typedframes.mypy
|
|
221
|
+
|
|
222
|
+
# Run mypy
|
|
223
|
+
mypy src/
|
|
224
|
+
```
|
|
225
|
+
|
|
226
|
+
**Features:**
|
|
227
|
+
- Full type checking across your codebase
|
|
228
|
+
- Catches column errors AND regular type errors
|
|
229
|
+
- IDE integration (VSCode, PyCharm)
|
|
230
|
+
- Works with existing mypy configuration
|
|
231
|
+
|
|
232
|
+
**Use this for:**
|
|
233
|
+
- Comprehensive type checking
|
|
234
|
+
- Integration with existing mypy setup
|
|
235
|
+
- IDE error highlighting
|
|
236
|
+
|
|
237
|
+
---
|
|
238
|
+
|
|
239
|
+
## Static Analysis Performance
|
|
240
|
+
|
|
241
|
+
Fast feedback reduces development time. The typedframes Rust binary provides near-instant column checking.
|
|
242
|
+
|
|
243
|
+
**Benchmark results** (10 runs, 3 warmup, caches cleared between runs):
|
|
244
|
+
|
|
245
|
+
| Tool | Version | What it does | typedframes (11 files) | great_expectations (490 files) |
|
|
246
|
+
|--------------------|---------|-------------------------------|------------------------|--------------------------------|
|
|
247
|
+
| typedframes | 0.1.0 | DataFrame column checker | 961µs ±56µs | 930µs ±89µs |
|
|
248
|
+
| ruff | 0.15.0 | Linter (no type checking) | 39ms ±12ms | 360ms ±18ms |
|
|
249
|
+
| ty | 0.0.16 | Type checker | 146ms ±13ms | 1.65s ±26ms |
|
|
250
|
+
| pyrefly | 0.52.0 | Type checker | 152ms ±7ms | 693ms ±33ms |
|
|
251
|
+
| mypy | 1.19.1 | Type checker (no plugin) | 9.15s ±218ms | 12.13s ±400ms |
|
|
252
|
+
| mypy + typedframes | 1.19.1 | Type checker + column checker | 9.34s ±331ms | 13.89s ±491ms |
|
|
253
|
+
| pyright | 1.1.408 | Type checker | 2.34s ±335ms | 8.37s ±253ms |
|
|
254
|
+
|
|
255
|
+
*Run `uv run python benchmarks/benchmark_checkers.py` to reproduce.*
|
|
256
|
+
|
|
257
|
+
The typedframes binary performs lexical column name resolution within a single file. It does not perform cross-file type
|
|
258
|
+
inference. Full type checkers (mypy, pyright, ty) analyze all Python types across your entire codebase. Use both: the
|
|
259
|
+
binary for fast iteration, mypy for comprehensive checking.
|
|
260
|
+
|
|
261
|
+
The standalone checker is built with [`ruff_python_parser`](https://github.com/astral-sh/ruff) for Python AST
|
|
262
|
+
parsing.
|
|
263
|
+
|
|
264
|
+
**Note:** ty (Astral) does not currently support mypy plugins, so use the standalone binary for column checking with ty.
|
|
265
|
+
|
|
266
|
+
---
|
|
267
|
+
|
|
268
|
+
## Comparison
|
|
269
|
+
|
|
270
|
+
### Feature Matrix (Static Analysis Focus)
|
|
271
|
+
|
|
272
|
+
Comprehensive comparison of pandas/DataFrame typing and validation tools. **typedframes focuses on static analysis**
|
|
273
|
+
—catching errors at lint-time before your code runs.
|
|
274
|
+
|
|
275
|
+
| Feature | typedframes | Pandera | Great Expectations | strictly_typed_pandas | pandas-stubs | dataenforce | pandas-type-checks | StaticFrame | narwhals |
|
|
276
|
+
|---------------------------------|------------------------|-------------|--------------------|-----------------------|--------------|-------------|--------------------|------------------|----------|
|
|
277
|
+
| **Version tested** | 0.1.0 | 0.29.0 | 1.4.3 | 0.3.6 | 3.0.0 | 0.1.2 | 1.1.3 | 3.7.0 | 2.16.0 |
|
|
278
|
+
| **Analysis Type** |
|
|
279
|
+
| When errors are caught | **Static (lint-time)** | Runtime | Runtime | Static + Runtime | Static | Runtime | Runtime | Static + Runtime | Runtime |
|
|
280
|
+
| **Static Analysis (our focus)** |
|
|
281
|
+
| Mypy plugin | ✅ Yes | ⚠️ Limited | ❌ No | ✅ Yes | ✅ Yes | ❌ No | ❌ No | ⚠️ Basic | ❌ No |
|
|
282
|
+
| Standalone checker | ✅ Rust (~1ms) | ❌ No | ❌ No | ❌ No | ❌ No | ❌ No | ❌ No | ❌ No | ❌ No |
|
|
283
|
+
| Column name checking | ✅ Yes | ⚠️ Limited | ❌ No | ✅ Yes | ❌ No | ❌ No | ❌ No | ✅ Yes | ❌ No |
|
|
284
|
+
| Column type checking | ✅ Yes | ⚠️ Limited | ❌ No | ✅ Yes | ❌ No | ❌ No | ❌ No | ✅ Yes | ❌ No |
|
|
285
|
+
| Typo suggestions | ✅ Yes | ❌ No | ❌ No | ❌ No | ❌ No | ❌ No | ❌ No | ❌ No | ❌ No |
|
|
286
|
+
| **Runtime Validation** |
|
|
287
|
+
| Data validation | ❌ No | ✅ Excellent | ✅ Excellent | ✅ typeguard | ❌ No | ✅ Yes | ✅ Yes | ✅ Yes | ❌ No |
|
|
288
|
+
| Value constraints | ❌ No | ✅ Yes | ✅ Excellent | ❌ No | ❌ No | ❌ No | ❌ No | ✅ Yes | ❌ No |
|
|
289
|
+
| **Schema Features** |
|
|
290
|
+
| Column grouping | ✅ ColumnGroup | ❌ No | ❌ No | ❌ No | ❌ No | ❌ No | ❌ No | ❌ No | ❌ No |
|
|
291
|
+
| Regex column matching | ✅ Yes | ❌ No | ❌ No | ❌ No | ❌ No | ❌ No | ❌ No | ❌ No | ❌ No |
|
|
292
|
+
| **Backend Support** |
|
|
293
|
+
| Pandas | ✅ Yes | ✅ Yes | ✅ Yes | ✅ Yes | ✅ Yes | ✅ Yes | ✅ Yes | ❌ Own | ✅ Yes |
|
|
294
|
+
| Polars | ✅ Yes | ✅ Yes | ❌ No | ❌ No | ❌ No | ❌ No | ❌ No | ❌ Own | ✅ Yes |
|
|
295
|
+
| DuckDB, cuDF, etc. | ❌ No | ❌ No | ✅ Spark, SQL | ❌ No | ❌ No | ❌ No | ❌ No | ❌ No | ✅ Yes |
|
|
296
|
+
| **Project Status (Feb 2026)** |
|
|
297
|
+
| Active development | ✅ Yes | ✅ Yes | ✅ Yes | ⚠️ Low | ✅ Yes | ❌ Inactive | ⚠️ Low | ✅ Yes | ✅ Yes |
|
|
298
|
+
|
|
299
|
+
**Legend:** ✅ Full support | ⚠️ Limited/Partial | ❌ Not supported
|
|
300
|
+
|
|
301
|
+
### Tool Descriptions
|
|
302
|
+
|
|
303
|
+
- **[Pandera](https://pandera.readthedocs.io/)** (v0.29.0): Excellent runtime validation. Static analysis support exists
|
|
304
|
+
but has limitations—column access via `df["column"]` is not validated, and schema mismatches between functions may not
|
|
305
|
+
be caught.
|
|
306
|
+
|
|
307
|
+
- **[strictly_typed_pandas](https://strictly-typed-pandas.readthedocs.io/)** (v0.3.6): Provides `DataSet[Schema]` type
|
|
308
|
+
hints with mypy support. No standalone checker. No polars support. Runtime validation via typeguard.
|
|
309
|
+
|
|
310
|
+
- **[pandas-stubs](https://github.com/pandas-dev/pandas-stubs)** (v3.0.0): Official pandas type stubs. Provides
|
|
311
|
+
API-level types but no column-level checking.
|
|
312
|
+
|
|
313
|
+
- **[dataenforce](https://github.com/CedricFR/dataenforce)** (v0.1.2): Runtime validation via decorator. Marked as
|
|
314
|
+
experimental/not production-ready. Appears inactive.
|
|
315
|
+
|
|
316
|
+
- **[pandas-type-checks](https://pypi.org/project/pandas-type-checks/)** (v1.1.3): Runtime validation decorator. No
|
|
317
|
+
static analysis.
|
|
318
|
+
|
|
319
|
+
- **[StaticFrame](https://github.com/static-frame/static-frame)** (v3.7.0): Alternative immutable DataFrame library with
|
|
320
|
+
built-in static typing. Not compatible with pandas/polars—requires using StaticFrame's own DataFrame implementation.
|
|
321
|
+
|
|
322
|
+
- **[narwhals](https://narwhals-dev.github.io/narwhals/)** (v2.16.0): Compatibility layer that provides a unified API
|
|
323
|
+
across pandas, polars, DuckDB, cuDF, and more. Solves a different problem—write-once-run-anywhere portability, not
|
|
324
|
+
type safety. See [Why Abstraction Layers Don't Solve Type Safety](#why-abstraction-layers-dont-solve-type-safety)
|
|
325
|
+
below.
|
|
326
|
+
|
|
327
|
+
- **[Great Expectations](https://greatexpectations.io/)** (v1.4.3): Comprehensive data quality framework. Defines
|
|
328
|
+
"expectations" (assertions) about data values, distributions, and schema properties. Excellent for runtime
|
|
329
|
+
validation, data documentation, and data quality monitoring. No static analysis or column-level type checking in
|
|
330
|
+
code. Supports pandas, Spark, and SQL backends.
|
|
331
|
+
|
|
332
|
+
### Type Checkers (Not DataFrame-Specific)
|
|
333
|
+
|
|
334
|
+
These are general Python type checkers. They don't validate DataFrame column names, but they can be used alongside
|
|
335
|
+
typedframes for comprehensive type checking:
|
|
336
|
+
|
|
337
|
+
- **[mypy](https://mypy-lang.org/)** (v1.19.1): The original Python type checker. typedframes provides a mypy plugin for
|
|
338
|
+
column checking. See [performance benchmarks](#static-analysis-performance).
|
|
339
|
+
|
|
340
|
+
- **[ty](https://github.com/astral-sh/ty)** (v0.0.14, Astral): New Rust-based type checker, 10-60x faster than mypy on
|
|
341
|
+
large codebases. Does not support mypy plugins—use typedframes standalone checker.
|
|
342
|
+
|
|
343
|
+
- **[pyrefly](https://pyrefly.org/)** (v0.51.1, Meta): Rust-based type checker from Meta, replacement for Pyre. Fast,
|
|
344
|
+
but no DataFrame column checking.
|
|
345
|
+
|
|
346
|
+
- **[pyright](https://github.com/microsoft/pyright)** (v1.1.408, Microsoft): Type checker powering Pylance/VSCode. No
|
|
347
|
+
mypy plugin support—use typedframes standalone checker.
|
|
348
|
+
|
|
349
|
+
### Not Directly Comparable
|
|
350
|
+
|
|
351
|
+
These tools serve different purposes:
|
|
352
|
+
|
|
353
|
+
- **[pandas_lint](https://github.com/Jean-EstevezT/pandas_lint)**: Lints pandas code patterns (performance, best
|
|
354
|
+
practices). Does not check column names/types.
|
|
355
|
+
- **[pandas-vet](https://github.com/deppen8/pandas-vet)**: Flake8 plugin for pandas best practices. Does not check
|
|
356
|
+
column names/types.
|
|
357
|
+
|
|
358
|
+
### When to Use What
|
|
359
|
+
|
|
360
|
+
| Use Case | Recommended Tool |
|
|
361
|
+
|------------------------------------------------------|-------------------------------------|
|
|
362
|
+
| Static column checking (existing pandas/polars) | **typedframes** |
|
|
363
|
+
| Runtime data validation | Pandera |
|
|
364
|
+
| Both static + runtime | typedframes + `to_pandera_schema()` |
|
|
365
|
+
| Cross-library portability (write once, run anywhere) | narwhals |
|
|
366
|
+
| Data quality monitoring / pipeline validation | Great Expectations |
|
|
367
|
+
| Immutable DataFrames from scratch | StaticFrame |
|
|
368
|
+
| Pandas API type hints only | pandas-stubs |
|
|
369
|
+
|
|
370
|
+
---
|
|
371
|
+
|
|
372
|
+
## Type Safety With Multiple Backends
|
|
373
|
+
|
|
374
|
+
typedframes uses **explicit backend types** to ensure complete type safety:
|
|
375
|
+
|
|
376
|
+
```python
|
|
377
|
+
from typedframes import BaseSchema, Column
|
|
378
|
+
from typedframes.pandas import PandasFrame
|
|
379
|
+
from typedframes.polars import PolarsFrame
|
|
380
|
+
|
|
381
|
+
|
|
382
|
+
class UserData(BaseSchema):
|
|
383
|
+
user_id = Column(type=int)
|
|
384
|
+
email = Column(type=str)
|
|
385
|
+
|
|
386
|
+
|
|
387
|
+
# Pandas pipeline - type checker knows pandas methods
|
|
388
|
+
def pandas_analyze(df: PandasFrame[UserData]) -> PandasFrame[UserData]:
|
|
389
|
+
return df[df[UserData.user_id] > 100] # ✓ Pandas syntax
|
|
390
|
+
|
|
391
|
+
|
|
392
|
+
# Polars pipeline - type checker knows polars methods
|
|
393
|
+
def polars_analyze(df: PolarsFrame[UserData]) -> PolarsFrame[UserData]:
|
|
394
|
+
return df.filter(UserData.user_id.col > 100) # ✓ Polars syntax
|
|
395
|
+
|
|
396
|
+
|
|
397
|
+
# Type checker prevents mixing backends
|
|
398
|
+
df_pandas = PandasFrame.read_csv("data.csv", UserData)
|
|
399
|
+
df_polars = PolarsFrame.read_csv("data.csv", UserData)
|
|
400
|
+
|
|
401
|
+
pandas_analyze(df_pandas) # ✓ OK
|
|
402
|
+
polars_analyze(df_polars) # ✓ OK
|
|
403
|
+
pandas_analyze(df_polars) # ✗ Type error: Expected PandasFrame, got PolarsFrame
|
|
404
|
+
```
|
|
405
|
+
|
|
406
|
+
---
|
|
407
|
+
|
|
408
|
+
## Features
|
|
409
|
+
|
|
410
|
+
### Clean Schema Definition
|
|
411
|
+
|
|
412
|
+
```python
|
|
413
|
+
from typedframes import BaseSchema, Column, ColumnSet, ColumnGroup
|
|
414
|
+
|
|
415
|
+
|
|
416
|
+
class TimeSeriesData(BaseSchema):
|
|
417
|
+
timestamp = Column(type=str)
|
|
418
|
+
temperature = ColumnSet(type=float, members=r"temp_sensor_\d+", regex=True)
|
|
419
|
+
pressure = ColumnSet(type=float, members=r"pressure_\d+", regex=True)
|
|
420
|
+
|
|
421
|
+
# Logical grouping
|
|
422
|
+
sensors = ColumnGroup(members=[temperature, pressure])
|
|
423
|
+
```
|
|
424
|
+
|
|
425
|
+
### Beautiful Runtime API
|
|
426
|
+
|
|
427
|
+
```python
|
|
428
|
+
from typedframes.pandas import PandasFrame
|
|
429
|
+
|
|
430
|
+
df = PandasFrame.read_csv("sensors.csv", TimeSeriesData)
|
|
431
|
+
|
|
432
|
+
# Access column groups as DataFrames
|
|
433
|
+
temps = df[TimeSeriesData.temperature] # All temp_sensor_* columns
|
|
434
|
+
all_sensors = df[TimeSeriesData.sensors] # All sensor columns
|
|
435
|
+
|
|
436
|
+
# Clean operations
|
|
437
|
+
avg_temp = df[TimeSeriesData.temperature].mean()
|
|
438
|
+
max_pressure = df[TimeSeriesData.pressure].max()
|
|
439
|
+
|
|
440
|
+
# Standard pandas access still works
|
|
441
|
+
df['timestamp'] # Single column
|
|
442
|
+
df[['timestamp', 'temp_sensor_1']] # Multi-column select
|
|
443
|
+
```
|
|
444
|
+
|
|
445
|
+
### Column-Level Static Checking
|
|
446
|
+
|
|
447
|
+
```python
|
|
448
|
+
from typedframes.pandas import PandasFrame
|
|
449
|
+
|
|
450
|
+
|
|
451
|
+
def daily_summary(data: PandasFrame[TimeSeriesData]) -> PandasFrame[DailySummary]:
|
|
452
|
+
# Type checker validates column access
|
|
453
|
+
data['timestamp'] # ✓ OK - column exists
|
|
454
|
+
data['date'] # ✗ Error: Column 'date' not in TimeSeriesData
|
|
455
|
+
|
|
456
|
+
# Type checker validates ColumnSet access
|
|
457
|
+
temps = data[TimeSeriesData.temperature] # ✓ OK - ColumnSet exists
|
|
458
|
+
summary = temps.mean()
|
|
459
|
+
return summary
|
|
460
|
+
```
|
|
461
|
+
|
|
462
|
+
### Dynamic Column Matching
|
|
463
|
+
|
|
464
|
+
Perfect for time-series data where column counts change. Regex ColumnSets match actual DataFrame columns at
|
|
465
|
+
`from_schema()` time — the matched columns are stored and used when you access `df[Schema.sensors]`. Static analysis
|
|
466
|
+
validates that the ColumnSet *name* exists in the schema, but cannot verify which columns the regex will match at
|
|
467
|
+
runtime.
|
|
468
|
+
|
|
469
|
+
```python
|
|
470
|
+
class SensorReadings(BaseSchema):
|
|
471
|
+
timestamp = Column(type=str)
|
|
472
|
+
# Matches: sensor_1, sensor_2, ..., sensor_N
|
|
473
|
+
sensors = ColumnSet(type=float, members=r"sensor_\d+", regex=True)
|
|
474
|
+
|
|
475
|
+
# Works regardless of how many sensor columns exist
|
|
476
|
+
df = PandasFrame.read_csv("readings_2024_01.csv", SensorReadings) # 50 sensors
|
|
477
|
+
df[SensorReadings.sensors].mean() # All sensor columns
|
|
478
|
+
|
|
479
|
+
df = PandasFrame.read_csv("readings_2024_02.csv", SensorReadings) # 75 sensors
|
|
480
|
+
df[SensorReadings.sensors].mean() # All sensor columns (different count, same code)
|
|
481
|
+
```
|
|
482
|
+
|
|
483
|
+
---
|
|
484
|
+
|
|
485
|
+
## Advanced Usage
|
|
486
|
+
|
|
487
|
+
### Merges, Joins, and Filters
|
|
488
|
+
|
|
489
|
+
Schema-typed DataFrames preserve their type through common operations:
|
|
490
|
+
|
|
491
|
+
**Pandas:**
|
|
492
|
+
|
|
493
|
+
```python
|
|
494
|
+
from typedframes import BaseSchema, Column
|
|
495
|
+
from typedframes.pandas import PandasFrame
|
|
496
|
+
import pandas as pd
|
|
497
|
+
|
|
498
|
+
|
|
499
|
+
class UserSchema(BaseSchema):
|
|
500
|
+
user_id = Column(type=int)
|
|
501
|
+
email = Column(type=str)
|
|
502
|
+
|
|
503
|
+
|
|
504
|
+
class OrderSchema(BaseSchema):
|
|
505
|
+
order_id = Column(type=int)
|
|
506
|
+
user_id = Column(type=int)
|
|
507
|
+
total = Column(type=float)
|
|
508
|
+
|
|
509
|
+
|
|
510
|
+
# Schema preserved through filtering
|
|
511
|
+
def get_active_users(df: PandasFrame[UserSchema]) -> PandasFrame[UserSchema]:
|
|
512
|
+
return df[df[UserSchema.user_id] > 100] # ✓ Still PandasFrame[UserSchema]
|
|
513
|
+
|
|
514
|
+
|
|
515
|
+
# Schema preserved through merges
|
|
516
|
+
users: PandasFrame[UserSchema] = ...
|
|
517
|
+
orders: PandasFrame[OrderSchema] = ...
|
|
518
|
+
merged = users.merge(orders, on=str(UserSchema.user_id))
|
|
519
|
+
```
|
|
520
|
+
|
|
521
|
+
**Polars:**
|
|
522
|
+
```python
|
|
523
|
+
from typedframes.polars import PolarsFrame
|
|
524
|
+
import polars as pl
|
|
525
|
+
|
|
526
|
+
|
|
527
|
+
# Schema columns work in filter expressions
|
|
528
|
+
def filter_users(df: PolarsFrame[UserSchema]) -> pl.DataFrame:
|
|
529
|
+
return df.filter(UserSchema.user_id.col > 100)
|
|
530
|
+
|
|
531
|
+
|
|
532
|
+
# Schema columns work in join expressions
|
|
533
|
+
def join_data(
|
|
534
|
+
users: PolarsFrame[UserSchema],
|
|
535
|
+
orders: PolarsFrame[OrderSchema]
|
|
536
|
+
) -> pl.DataFrame:
|
|
537
|
+
return users.join(
|
|
538
|
+
orders,
|
|
539
|
+
left_on=UserSchema.user_id.col,
|
|
540
|
+
right_on=OrderSchema.user_id.col
|
|
541
|
+
)
|
|
542
|
+
|
|
543
|
+
|
|
544
|
+
# Schema columns work in select expressions
|
|
545
|
+
def select_columns(df: PolarsFrame[UserSchema]) -> pl.DataFrame:
|
|
546
|
+
return df.select([UserSchema.user_id.col, UserSchema.email.col])
|
|
547
|
+
```
|
|
548
|
+
|
|
549
|
+
### Schema Composition
|
|
550
|
+
|
|
551
|
+
Compose upward — build bigger schemas from smaller ones via inheritance. Type checkers see all columns natively.
|
|
552
|
+
|
|
553
|
+
```python
|
|
554
|
+
from typedframes import BaseSchema, Column
|
|
555
|
+
from typedframes.pandas import PandasFrame
|
|
556
|
+
|
|
557
|
+
|
|
558
|
+
# Start with the smallest useful schema
|
|
559
|
+
class UserPublic(BaseSchema):
|
|
560
|
+
user_id = Column(type=int)
|
|
561
|
+
email = Column(type=str)
|
|
562
|
+
name = Column(type=str)
|
|
563
|
+
|
|
564
|
+
|
|
565
|
+
# Extend it — never strip down
|
|
566
|
+
class UserFull(UserPublic):
|
|
567
|
+
password_hash = Column(type=str)
|
|
568
|
+
|
|
569
|
+
|
|
570
|
+
class Orders(BaseSchema):
|
|
571
|
+
order_id = Column(type=int)
|
|
572
|
+
user_id = Column(type=int)
|
|
573
|
+
total = Column(type=float)
|
|
574
|
+
|
|
575
|
+
|
|
576
|
+
# Combine via multiple inheritance
|
|
577
|
+
class UserOrders(UserPublic, Orders):
|
|
578
|
+
"""Type checkers see all columns from both parents."""
|
|
579
|
+
...
|
|
580
|
+
|
|
581
|
+
|
|
582
|
+
# Or use the + operator
|
|
583
|
+
UserOrdersDynamic = UserPublic + Orders
|
|
584
|
+
|
|
585
|
+
users: PandasFrame[UserPublic] = ...
|
|
586
|
+
orders: PandasFrame[Orders] = ...
|
|
587
|
+
merged: PandasFrame[UserOrders] = PandasFrame.from_schema(
|
|
588
|
+
users.merge(orders, on=str(UserPublic.user_id)), UserOrders
|
|
589
|
+
)
|
|
590
|
+
```
|
|
591
|
+
|
|
592
|
+
Overlapping columns with the same type are allowed (common after merges). Conflicting types raise `SchemaConflictError`.
|
|
593
|
+
|
|
594
|
+
See [`examples/schema_algebra_example.py`](examples/schema_algebra_example.py) for a complete walkthrough.
|
|
595
|
+
|
|
596
|
+
---
|
|
597
|
+
|
|
598
|
+
## Pandera Integration
|
|
599
|
+
|
|
600
|
+
Convert typedframes schemas to [Pandera](https://pandera.readthedocs.io/) schemas for runtime validation. Define your
|
|
601
|
+
schema once, get both static and runtime checking.
|
|
602
|
+
|
|
603
|
+
```shell
|
|
604
|
+
pip install typedframes[pandera]
|
|
605
|
+
```
|
|
606
|
+
|
|
607
|
+
```python
|
|
608
|
+
from typedframes import BaseSchema, Column
|
|
609
|
+
from typedframes.pandera import to_pandera_schema
|
|
610
|
+
import pandas as pd
|
|
611
|
+
|
|
612
|
+
|
|
613
|
+
class UserData(BaseSchema):
|
|
614
|
+
user_id = Column(type=int)
|
|
615
|
+
email = Column(type=str)
|
|
616
|
+
age = Column(type=int, nullable=True)
|
|
617
|
+
|
|
618
|
+
|
|
619
|
+
# Convert to pandera schema
|
|
620
|
+
pandera_schema = to_pandera_schema(UserData)
|
|
621
|
+
|
|
622
|
+
# Validate data at runtime
|
|
623
|
+
df = pd.read_csv("users.csv")
|
|
624
|
+
validated_df = pandera_schema.validate(df) # Raises SchemaError on failure
|
|
625
|
+
```
|
|
626
|
+
|
|
627
|
+
The conversion maps:
|
|
628
|
+
|
|
629
|
+
- `Column` type/nullable/alias to `pa.Column` dtype/nullable/name
|
|
630
|
+
- `ColumnSet` with explicit members to individual `pa.Column` entries
|
|
631
|
+
- `ColumnSet` with regex to `pa.Column(regex=True)`
|
|
632
|
+
- `allow_extra_columns` to pandera's `strict` mode
|
|
633
|
+
|
|
634
|
+
---
|
|
635
|
+
|
|
636
|
+
## Examples
|
|
637
|
+
|
|
638
|
+
### Basic CSV Processing
|
|
639
|
+
|
|
640
|
+
```python
|
|
641
|
+
from typedframes import BaseSchema, Column
|
|
642
|
+
from typedframes.pandas import PandasFrame
|
|
643
|
+
|
|
644
|
+
|
|
645
|
+
class Orders(BaseSchema):
|
|
646
|
+
order_id = Column(type=int)
|
|
647
|
+
customer_id = Column(type=int)
|
|
648
|
+
total = Column(type=float)
|
|
649
|
+
date = Column(type=str)
|
|
650
|
+
|
|
651
|
+
|
|
652
|
+
def calculate_revenue(orders: PandasFrame[Orders]) -> float:
|
|
653
|
+
return orders[Orders.total].sum()
|
|
654
|
+
|
|
655
|
+
|
|
656
|
+
df = PandasFrame.read_csv("orders.csv", Orders)
|
|
657
|
+
revenue = calculate_revenue(df)
|
|
658
|
+
```
|
|
659
|
+
|
|
660
|
+
### Time Series Analysis
|
|
661
|
+
|
|
662
|
+
```python
|
|
663
|
+
from typedframes import BaseSchema, Column, ColumnSet, ColumnGroup
|
|
664
|
+
from typedframes.pandas import PandasFrame
|
|
665
|
+
|
|
666
|
+
|
|
667
|
+
class SensorData(BaseSchema):
|
|
668
|
+
timestamp = Column(type=str)
|
|
669
|
+
temperature = ColumnSet(type=float, members=r"temp_\d+", regex=True)
|
|
670
|
+
humidity = ColumnSet(type=float, members=r"humidity_\d+", regex=True)
|
|
671
|
+
|
|
672
|
+
all_sensors = ColumnGroup(members=[temperature, humidity])
|
|
673
|
+
|
|
674
|
+
|
|
675
|
+
df = PandasFrame.read_csv("sensors.csv", SensorData)
|
|
676
|
+
|
|
677
|
+
# Clean, type-safe operations
|
|
678
|
+
avg_temp_per_row = df[SensorData.temperature].mean(axis=1)
|
|
679
|
+
all_readings_stats = df[SensorData.all_sensors].describe()
|
|
680
|
+
```
|
|
681
|
+
|
|
682
|
+
### Multi-Step Pipeline
|
|
683
|
+
|
|
684
|
+
```python
|
|
685
|
+
from typedframes import BaseSchema, Column
|
|
686
|
+
from typedframes.pandas import PandasFrame
|
|
687
|
+
|
|
688
|
+
|
|
689
|
+
class RawSales(BaseSchema):
|
|
690
|
+
date = Column(type=str)
|
|
691
|
+
product_id = Column(type=int)
|
|
692
|
+
quantity = Column(type=int)
|
|
693
|
+
price = Column(type=float)
|
|
694
|
+
|
|
695
|
+
|
|
696
|
+
class AggregatedSales(BaseSchema):
|
|
697
|
+
date = Column(type=str)
|
|
698
|
+
total_revenue = Column(type=float)
|
|
699
|
+
total_quantity = Column(type=int)
|
|
700
|
+
|
|
701
|
+
|
|
702
|
+
def aggregate_daily(df: PandasFrame[RawSales]) -> PandasFrame[AggregatedSales]:
|
|
703
|
+
result = df.groupby(RawSales.date).agg({
|
|
704
|
+
str(RawSales.price): 'sum',
|
|
705
|
+
str(RawSales.quantity): 'sum',
|
|
706
|
+
}).reset_index()
|
|
707
|
+
|
|
708
|
+
result.columns = ['date', 'total_revenue', 'total_quantity']
|
|
709
|
+
return PandasFrame.from_schema(result, AggregatedSales)
|
|
710
|
+
|
|
711
|
+
|
|
712
|
+
# Type-safe pipeline
|
|
713
|
+
raw = PandasFrame.read_csv("sales.csv", RawSales)
|
|
714
|
+
aggregated = aggregate_daily(raw)
|
|
715
|
+
|
|
716
|
+
|
|
717
|
+
# Type checker validates schema transformations
|
|
718
|
+
def analyze(df: PandasFrame[AggregatedSales]) -> float:
|
|
719
|
+
df[AggregatedSales.total_revenue] # ✓ OK
|
|
720
|
+
df['price'] # ✗ Error: 'price' not in AggregatedSales
|
|
721
|
+
return df[AggregatedSales.total_revenue].mean()
|
|
722
|
+
```
|
|
723
|
+
|
|
724
|
+
### Polars Performance Pipeline
|
|
725
|
+
|
|
726
|
+
```python
|
|
727
|
+
from typedframes import BaseSchema, Column
|
|
728
|
+
from typedframes.polars import PolarsFrame
|
|
729
|
+
import polars as pl
|
|
730
|
+
|
|
731
|
+
|
|
732
|
+
class LargeDataset(BaseSchema):
|
|
733
|
+
id = Column(type=int)
|
|
734
|
+
value = Column(type=float)
|
|
735
|
+
category = Column(type=str)
|
|
736
|
+
|
|
737
|
+
|
|
738
|
+
def efficient_aggregation(df: PolarsFrame[LargeDataset]) -> pl.DataFrame:
|
|
739
|
+
return (
|
|
740
|
+
df.filter(LargeDataset.value.col > 100)
|
|
741
|
+
.group_by('category')
|
|
742
|
+
.agg(LargeDataset.value.col.mean())
|
|
743
|
+
)
|
|
744
|
+
|
|
745
|
+
|
|
746
|
+
# Polars handles large files efficiently
|
|
747
|
+
df = PolarsFrame.read_csv("huge_file.csv", LargeDataset)
|
|
748
|
+
result = efficient_aggregation(df)
|
|
749
|
+
```
|
|
750
|
+
|
|
751
|
+
---
|
|
752
|
+
|
|
753
|
+
## Philosophy
|
|
754
|
+
|
|
755
|
+
### Type Safety Over Validation
|
|
756
|
+
|
|
757
|
+
We believe static analysis catches bugs earlier and cheaper than runtime validation.
|
|
758
|
+
|
|
759
|
+
**typedframes focuses on:**
|
|
760
|
+
- ✅ Catching errors at lint-time
|
|
761
|
+
- ✅ Zero runtime overhead
|
|
762
|
+
- ✅ Developer experience
|
|
763
|
+
|
|
764
|
+
**We explicitly don't focus on:**
|
|
765
|
+
- ❌ Runtime data validation (use Pandera)
|
|
766
|
+
- ❌ Statistical checks (use Pandera)
|
|
767
|
+
- ❌ Data quality monitoring (use Great Expectations)
|
|
768
|
+
|
|
769
|
+
**Important:** `PandasFrame.from_schema()` is a *trust assertion*, not a validation step. It tells the type checker "
|
|
770
|
+
this DataFrame conforms to this schema" without verifying the actual data. The linter catches mistakes in your code (
|
|
771
|
+
wrong
|
|
772
|
+
column names, schema mismatches between functions), but it cannot verify that a CSV file contains the expected columns.
|
|
773
|
+
For runtime validation of external data, use [`to_pandera_schema()`](#pandera-integration) to convert your typedframes
|
|
774
|
+
schemas to Pandera schemas.
|
|
775
|
+
|
|
776
|
+
### Explicit Backend Types
|
|
777
|
+
|
|
778
|
+
We use explicit `PandasFrame` and `PolarsFrame` types because:
|
|
779
|
+
- Pandas and polars have different APIs
|
|
780
|
+
- Type safety requires knowing which methods are available
|
|
781
|
+
- Being explicit prevents bugs
|
|
782
|
+
|
|
783
|
+
**Trade-offs we avoid:**
|
|
784
|
+
- ❌ "Universal DataFrame" abstractions (you lose features)
|
|
785
|
+
- ❌ Implicit backend detection (runtime errors)
|
|
786
|
+
- ❌ Lowest-common-denominator APIs
|
|
787
|
+
|
|
788
|
+
The reason to choose polars over pandas is its lazy evaluation, native parallelism, and expressive query syntax.
|
|
789
|
+
Abstraction layers often must expose a lowest-common-denominator API. By using explicit backend types instead,
|
|
790
|
+
typedframes lets you use each library's full, native API while still getting schema-level type safety.
|
|
791
|
+
|
|
792
|
+
### Why Abstraction Layers Don't Solve Type Safety
|
|
793
|
+
|
|
794
|
+
Tools like [narwhals](https://narwhals-dev.github.io/narwhals/) solve a different problem: writing portable code that runs on pandas, polars, DuckDB, cuDF, and other backends. This is useful for library authors who want to support multiple backends without maintaining separate codebases.
|
|
795
|
+
|
|
796
|
+
However, abstraction layers don't provide column-level type safety:
|
|
797
|
+
|
|
798
|
+
```python
|
|
799
|
+
import narwhals as nw
|
|
800
|
+
|
|
801
|
+
def process(df: nw.DataFrame) -> nw.DataFrame:
|
|
802
|
+
# No static checking - "revenue" typo won't be caught until runtime
|
|
803
|
+
return df.filter(nw.col("revnue") > 100) # Typo: "revnue" vs "revenue"
|
|
804
|
+
```
|
|
805
|
+
|
|
806
|
+
**The fundamental issue:** Abstraction layers abstract over *which library* you're using, not *what columns* your data has. They can't know at lint-time whether "revenue" is a valid column in your DataFrame.
|
|
807
|
+
|
|
808
|
+
typedframes solves the orthogonal problem of schema safety:
|
|
809
|
+
|
|
810
|
+
```python
|
|
811
|
+
from typedframes.polars import PolarsFrame
|
|
812
|
+
|
|
813
|
+
class SalesData(BaseSchema):
|
|
814
|
+
revenue = Column(type=float)
|
|
815
|
+
|
|
816
|
+
def process(df: PolarsFrame[SalesData]) -> PolarsFrame[SalesData]:
|
|
817
|
+
return df.filter(df['revnue'] > 100) # ✗ Error at lint-time: 'revnue' not in SalesData
|
|
818
|
+
```
|
|
819
|
+
|
|
820
|
+
**Use narwhals when:** You're writing a library that needs to work with multiple DataFrame backends.
|
|
821
|
+
|
|
822
|
+
**Use typedframes when:** You want to catch column name/type errors before your code runs.
|
|
823
|
+
|
|
824
|
+
### Why No Built-in Validation?
|
|
825
|
+
|
|
826
|
+
Ideally, validation happens at the point of data ingestion rather than in Python application code. If you're validating
|
|
827
|
+
DataFrames in Python, consider whether your data pipeline could enforce constraints earlier. Use Pandera for cases where
|
|
828
|
+
runtime validation is genuinely necessary.
|
|
829
|
+
|
|
830
|
+
---
|
|
831
|
+
|
|
832
|
+
## License
|
|
833
|
+
|
|
834
|
+
MIT License - see [LICENSE](LICENSE)
|
|
835
|
+
|
|
836
|
+
---
|
|
837
|
+
|
|
838
|
+
## Roadmap
|
|
839
|
+
|
|
840
|
+
**Shipped:**
|
|
841
|
+
- [x] Schema definition API
|
|
842
|
+
- [x] Pandas support
|
|
843
|
+
- [x] Polars support
|
|
844
|
+
- [x] Mypy plugin
|
|
845
|
+
- [x] Standalone checker (Rust)
|
|
846
|
+
- [x] Explicit backend types
|
|
847
|
+
- [x] Merge/join schema preservation
|
|
848
|
+
- [x] Schema Composition (multiple inheritance, `SchemaA + SchemaB`)
|
|
849
|
+
- [x] Column name collision warnings
|
|
850
|
+
- [x] Pandera integration (`to_pandera_schema()`)
|
|
851
|
+
|
|
852
|
+
**Planned:**
|
|
853
|
+
|
|
854
|
+
- [ ] **Opt-in data loading constraints** - `Field` class with constraints (`gt`, `ge`, `lt`, `le`), strictly isolated
|
|
855
|
+
to `from_schema()` ingestion boundaries
|
|
856
|
+
|
|
857
|
+
---
|
|
858
|
+
|
|
859
|
+
## FAQ
|
|
860
|
+
|
|
861
|
+
**Q: Do I need to choose between pandas and polars?**
|
|
862
|
+
A: No. Define your schema once, use it with both. Just use the appropriate type (`PandasFrame` or `PolarsFrame`) in your function signatures.
|
|
863
|
+
|
|
864
|
+
**Q: Does this replace Pandera?**
|
|
865
|
+
A: No, it complements it. Use typedframes for static analysis, and `to_pandera_schema()` to convert your schemas to
|
|
866
|
+
Pandera for runtime validation. See [Pandera Integration](#pandera-integration).
|
|
867
|
+
|
|
868
|
+
**Q: Is the standalone checker required?**
|
|
869
|
+
A: No. You can use just the mypy plugin, just the standalone checker, or both. They catch the same errors.
|
|
870
|
+
|
|
871
|
+
**Q: What works without any plugin?**
|
|
872
|
+
A: The `__getitem__` overloads on `PandasFrame` mean that any type checker (mypy, pyright, ty) understands
|
|
873
|
+
`df[Schema.column]` returns `pd.Series` and `df[Schema.column_set]` returns `pd.DataFrame` — no plugin or stubs needed.
|
|
874
|
+
Column *name* validation (catching typos like `df["revnue"]` in string-based access) still requires the standalone
|
|
875
|
+
checker or mypy plugin.
|
|
876
|
+
|
|
877
|
+
**Q: What about pyright/pylance users?**
|
|
878
|
+
A: The mypy plugin doesn't work with pyright. Use the standalone checker (`typedframes check`) for column name
|
|
879
|
+
validation. Schema descriptor access (`df[Schema.column]`) works natively in pyright without any plugin.
|
|
880
|
+
|
|
881
|
+
**Q: Does this work with existing pandas/polars code?**
|
|
882
|
+
A: Yes. You can gradually adopt typedframes by adding schemas to new code. Existing code continues to work.
|
|
883
|
+
|
|
884
|
+
**Q: What if my column name conflicts with a pandas/polars method?**
|
|
885
|
+
A: No problem. Since column access uses bracket syntax with schema descriptors (`df[Schema.mean]`), there is no conflict
|
|
886
|
+
with DataFrame methods (`df.mean()`). Both work independently.
|
|
887
|
+
|
|
888
|
+
---
|
|
889
|
+
|
|
890
|
+
## Credits
|
|
891
|
+
|
|
892
|
+
Built by developers who believe DataFrame bugs should be caught at lint-time, not in production.
|
|
893
|
+
|
|
894
|
+
Inspired by the needs of ML/data science teams working with complex data pipelines.
|
|
895
|
+
|
|
896
|
+
---
|
|
897
|
+
|
|
898
|
+
**Questions? Issues? Ideas?** [Open an issue](https://github.com/yourusername/typedframes/issues)
|
|
899
|
+
|
|
900
|
+
**Ready to catch DataFrame bugs before runtime?** `pip install typedframes`
|
|
901
|
+
|