jano 0.3.1__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
jano-0.3.1/LICENSE.txt ADDED
@@ -0,0 +1,20 @@
1
+ The MIT License (MIT)
2
+
3
+ Copyright (c) 2020, Marcos Manuel Muraro
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy of
6
+ this software and associated documentation files (the "Software"), to deal in
7
+ the Software without restriction, including without limitation the rights to
8
+ use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
9
+ the Software, and to permit persons to whom the Software is furnished to do so,
10
+ subject to the following conditions:
11
+
12
+ 1. The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
17
+ FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
18
+ COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
19
+ IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
20
+ CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
jano-0.3.1/PKG-INFO ADDED
@@ -0,0 +1,522 @@
1
+ Metadata-Version: 2.4
2
+ Name: jano
3
+ Version: 0.3.1
4
+ Summary: Temporal partitioning and backtesting utilities for time-correlated datasets.
5
+ Author-email: Marcos Manuel Muraro <mmmuraro@gmail.com>
6
+ Maintainer-email: Marcos Manuel Muraro <mmmuraro@gmail.com>
7
+ License-Expression: MIT
8
+ Project-URL: Homepage, https://github.com/marmurar/jano
9
+ Project-URL: Documentation, https://marmurar.github.io/jano/
10
+ Project-URL: Repository, https://github.com/marmurar/jano
11
+ Project-URL: Issues, https://github.com/marmurar/jano/issues
12
+ Project-URL: Changelog, https://github.com/marmurar/jano/releases
13
+ Keywords: backtesting,time-series,walk-forward-validation,model-selection,temporal-validation,pandas,time-series-validation,simulation
14
+ Classifier: Development Status :: 3 - Alpha
15
+ Classifier: Intended Audience :: Developers
16
+ Classifier: Intended Audience :: Science/Research
17
+ Classifier: Operating System :: OS Independent
18
+ Classifier: Programming Language :: Python :: 3
19
+ Classifier: Programming Language :: Python :: 3 :: Only
20
+ Classifier: Programming Language :: Python :: 3.9
21
+ Classifier: Programming Language :: Python :: 3.10
22
+ Classifier: Programming Language :: Python :: 3.11
23
+ Classifier: Programming Language :: Python :: 3.12
24
+ Classifier: Topic :: Scientific/Engineering
25
+ Classifier: Topic :: Scientific/Engineering :: Information Analysis
26
+ Classifier: Topic :: Software Development :: Testing
27
+ Requires-Python: >=3.9
28
+ Description-Content-Type: text/markdown
29
+ License-File: LICENSE.txt
30
+ Requires-Dist: pandas>=2.0
31
+ Requires-Dist: numpy>=1.24
32
+ Provides-Extra: mcp
33
+ Requires-Dist: mcp[cli]<2,>=1.0; python_version >= "3.10" and extra == "mcp"
34
+ Provides-Extra: polars
35
+ Requires-Dist: polars>=1.0; extra == "polars"
36
+ Provides-Extra: dev
37
+ Requires-Dist: build>=1.2; extra == "dev"
38
+ Requires-Dist: mcp[cli]<2,>=1.0; python_version >= "3.10" and extra == "dev"
39
+ Requires-Dist: polars>=1.0; extra == "dev"
40
+ Requires-Dist: pytest>=8.0; extra == "dev"
41
+ Requires-Dist: pytest-cov>=5.0; extra == "dev"
42
+ Requires-Dist: sphinx>=7.4; extra == "dev"
43
+ Requires-Dist: twine>=5.1; extra == "dev"
44
+ Dynamic: license-file
45
+
46
+ # Jano
47
+
48
+ <p align="center">
49
+ <img src="https://raw.githubusercontent.com/marmurar/jano/master/imgs/jano_logo.png" alt="Jano logo" width="260" />
50
+ </p>
51
+
52
+ [![CI](https://github.com/marmurar/jano/actions/workflows/ci.yml/badge.svg)](https://github.com/marmurar/jano/actions/workflows/ci.yml)
53
+ [![Docs](https://github.com/marmurar/jano/actions/workflows/docs.yml/badge.svg)](https://github.com/marmurar/jano/actions/workflows/docs.yml)
54
+ [![codecov](https://codecov.io/gh/marmurar/jano/graph/badge.svg)](https://codecov.io/gh/marmurar/jano)
55
+
56
+ Jano is a Python library for defining temporal partitions and backtesting schemes over time-correlated datasets.
57
+
58
+ The missing layer between ML models and production temporal validation.
59
+
60
+ Documentation: [marmurar.github.io/jano](https://marmurar.github.io/jano/)
61
+
62
+ It is designed for cases where a plain `train_test_split()` is not enough: transactional data, production simulations, repeated retraining, walk-forward validation, model monitoring, rule evaluation, or any experiment where the ordering of time matters.
63
+
64
+ The core accepts `pandas.DataFrame`, `numpy.ndarray` and `polars.DataFrame` inputs. `pandas` remains the internal execution engine, while NumPy and Polars inputs are normalized at the boundary so the split/reporting API stays consistent.
65
+
66
+ The project is named after Janus, the Roman god of beginnings, transitions and thresholds. That framing fits the library well: Jano helps define how a dataset moves from training periods into evaluation periods, fold after fold.
67
+
68
+ ## MCP server
69
+
70
+ Jano also ships an optional local MCP server so AI agents can use the library through a small, explicit tool surface instead of generating Python ad hoc.
71
+
72
+ Current MCP tools:
73
+
74
+ - `preview_local_dataset`
75
+ - `plan_walk_forward_simulation`
76
+ - `run_walk_forward_simulation`
77
+
78
+ Install it in a Python 3.10+ environment:
79
+
80
+ ```bash
81
+ python -m pip install "jano[mcp]"
82
+ ```
83
+
84
+ Run it locally over stdio:
85
+
86
+ ```bash
87
+ jano-mcp
88
+ ```
89
+
90
+ Or use the module entrypoint:
91
+
92
+ ```bash
93
+ python -m jano.mcp_server
94
+ ```
95
+
96
+ Example MCP client configuration:
97
+
98
+ ```json
99
+ {
100
+ "mcpServers": {
101
+ "jano": {
102
+ "command": "jano-mcp"
103
+ }
104
+ }
105
+ }
106
+ ```
107
+
108
+ The MCP layer is intentionally opinionated: it exposes planning and walk-forward simulation first, while the full Python library remains available when you need custom composition.
109
+
110
+ This is meant for MCP-aware coding assistants such as Claude Code, Claude Desktop, Cursor, Codex runtimes with MCP support, and other local agent environments. The server runs locally and reads only the file paths you provide to its tools; Jano does not upload datasets anywhere by itself.
111
+
112
+ ## Why Jano exists
113
+
114
+ Many machine learning datasets are not just tabular; they are structured over time and often across multiple entities such as users, routes, sellers or products. In those settings, a more faithful view of the data is not "a bag of independent rows" but a temporally ordered process.
115
+
116
+ Standard evaluation tooling usually assumes observations are i.i.d. enough that a static split is acceptable. That assumption breaks quickly when time matters: future information leaks into training, performance estimates become optimistic, and offline validation stops reflecting what really happens in production.
117
+
118
+ Most train/test utilities answer a simple question:
119
+
120
+ "How do I split this dataset once?"
121
+
122
+ Jano is meant to answer a richer one:
123
+
124
+ "How would this system have behaved over time if I had trained, retrained and evaluated it under a specific temporal policy?"
125
+
126
+ That difference is the core of the project. Jano treats evaluation as a temporal simulation rather than a static partition. Instead of defining one split, it defines a policy over time: train window, evaluation horizon, shift between iterations and optional leakage-control gaps. Running that policy produces a sequence of causally valid folds rather than one aggregate estimate.
127
+
128
+ That also makes it a useful way to evidence drift in simulation results, because temporal shifts in behavior, performance or calibration become visible fold after fold.
129
+
130
+ That makes it useful not only for machine learning, but for any workflow where the data is time-dependent:
131
+
132
+ - Backtesting predictive models on transactional data.
133
+ - Simulating daily or weekly retraining in production.
134
+ - Comparing rolling versus expanding windows.
135
+ - Introducing explicit gaps between training and evaluation periods.
136
+ - Defining `train/test` or `train/validation/test` partitions with durations, row counts or percentages.
137
+ - Surfacing drift in simulation outcomes by making temporal changes explicit across folds.
138
+
139
+ ## Project direction
140
+
141
+ Jano is being reshaped as a small, explicit temporal partitioning toolkit with an interface inspired by `sklearn.model_selection`.
142
+
143
+ The design goals are:
144
+
145
+ - Clear, composable temporal partition definitions.
146
+ - Low hidden state and predictable behavior.
147
+ - Compatibility with pandas-first workflows.
148
+ - A splitter-style API that can evolve toward stronger scikit-learn interoperability.
149
+ - Rich split objects for inspection, auditability and simulation.
150
+
151
+ ## Current API
152
+
153
+ The recommended high-level surface is intentionally small:
154
+
155
+ - `WalkForwardPolicy` for production-like walk-forward evaluation,
156
+ - `TrainHistoryPolicy` for fixed-test, growing-train questions,
157
+ - `DriftMonitoringPolicy` for fixed-train, moving-test questions.
158
+
159
+ Those classes sit on top of the lower-level building blocks that remain available:
160
+
161
+ - `TemporalSimulation` for explicit simulation objects,
162
+ - `TemporalBacktestSplitter` for manual fold iteration,
163
+ - `TrainGrowthPolicy` and `PerformanceDecayPolicy` for lower-level temporal hypothesis primitives.
164
+
165
+ The workflow is intentionally compositional:
166
+
167
+ - start simple with predefined layouts and strategies,
168
+ - move to `plan()` when you want to inspect or filter iterations before running them,
169
+ - use the small policy surface when the question is already encapsulated,
170
+ - and fall back to manual fold iteration when you want to compose everything yourself: partitions, gaps, feature history and model training logic.
171
+
172
+ The workflow is intentionally compositional:
173
+
174
+ - start simple with predefined layouts and strategies,
175
+ - move to `plan()` when you want to inspect or filter iterations before running them,
176
+ - use higher-level policies such as `TrainGrowthPolicy` or `PerformanceDecayPolicy` when the question is already encapsulated,
177
+ - and fall back to manual fold iteration when you want to compose everything yourself: partitions, gaps, feature history and model training logic.
178
+
179
+ It supports:
180
+
181
+ - `single`, `rolling` and `expanding` strategies.
182
+ - `train_test` and `train_val_test` layouts.
183
+ - Segment sizes defined as durations like `"30D"`, row counts like `5000`, or fractions like `0.7`.
184
+ - Calendar-aligned duration windows with `calendar_frequency="D"` when you want complete days instead of elapsed-time windows anchored at the first timestamp.
185
+ - Optional gaps before validation or test segments.
186
+ - Plain index output through `split()`.
187
+ - Rich fold objects through `iter_splits()`.
188
+ - Simulation summaries, HTML timeline reports and plot-ready chart data through `describe_simulation()`.
189
+ - An adaptive partition engine that keeps pandas, NumPy and Polars inputs native for planning when it is safe, and falls back to pandas when stability is more important.
190
+
191
+ ## Example: run a full simulation without manual iteration
192
+
193
+ ```python
194
+ import pandas as pd
195
+
196
+ from jano import TemporalPartitionSpec, WalkForwardPolicy
197
+
198
+ frame = pd.DataFrame(
199
+ {
200
+ "timestamp": pd.date_range("2024-01-01", periods=60, freq="D"),
201
+ "feature": range(60),
202
+ "target": range(100, 160),
203
+ }
204
+ )
205
+
206
+ policy = WalkForwardPolicy(
207
+ time_col="timestamp",
208
+ partition=TemporalPartitionSpec(
209
+ layout="train_test",
210
+ train_size="30D",
211
+ test_size="1D",
212
+ ),
213
+ step="1D",
214
+ strategy="rolling",
215
+ )
216
+
217
+ result = policy.run(frame, title="One month in production")
218
+
219
+ print(result.total_folds)
220
+ print(result.engine_metadata.to_dict())
221
+ print(result.summary.to_frame().head())
222
+ print(result.chart_data.segment_stats)
223
+ ```
224
+
225
+ By default, `engine="auto"` lets Jano choose the safest fast path for partitioning:
226
+ pandas inputs stay pandas, Polars inputs use Polars column extraction, and NumPy arrays
227
+ use array indexing. You can force a path with `engine="pandas"`, `engine="polars"` or
228
+ `engine="numpy"` when you need deterministic behavior for a pipeline.
229
+
230
+ If you want to inspect the full simulation geometry before materializing folds, plan it first:
231
+
232
+ ```python
233
+ plan = policy.plan(frame, title="One month in production")
234
+ print(plan.total_folds)
235
+ print(plan.to_frame().head())
236
+
237
+ filtered = plan.exclude_windows(
238
+ train=[("2025-12-20", "2026-01-05")],
239
+ ).select_from_iteration(5)
240
+
241
+ result = filtered.materialize()
242
+ ```
243
+
244
+ That plan frame includes the explicit iteration index, segment boundaries and row counts for each fold.
245
+
246
+ You can also anchor a simulation to a specific date and limit how many folds are materialized:
247
+
248
+ ```python
249
+ policy = WalkForwardPolicy(
250
+ time_col="timestamp",
251
+ partition=TemporalPartitionSpec(
252
+ layout="train_test",
253
+ train_size="15D",
254
+ test_size="4D",
255
+ ),
256
+ step="1D",
257
+ strategy="rolling",
258
+ start_at="2025-09-01",
259
+ max_folds=15,
260
+ )
261
+
262
+ result = policy.run(frame, title="15 daily retraining iterations")
263
+ ```
264
+
265
+ The recommended walk-forward surface also supports `end_at` when you want to constrain the simulation to a bounded time window before folds are generated.
266
+
267
+ When a single timestamp is not enough, `WalkForwardPolicy`, `TemporalSimulation` and `TemporalBacktestSplitter` can also receive a `TemporalSemanticsSpec`. That lets you keep one column as the reported timeline while using different timestamp columns to decide whether `train`, `validation` or `test` rows are actually eligible. This is useful for production-style leakage control, for example when a target only becomes available at `arrived_at` even if the operational timeline is anchored on `departured_at`.
268
+
269
+ For `numpy.ndarray` inputs, use integer column references:
270
+
271
+ ```python
272
+ import numpy as np
273
+
274
+ values = np.array(
275
+ [
276
+ ["2025-09-01", 1.2, 10],
277
+ ["2025-09-02", 1.5, 11],
278
+ ["2025-09-03", 1.1, 12],
279
+ ],
280
+ dtype=object,
281
+ )
282
+
283
+ splitter = TemporalBacktestSplitter(
284
+ time_col=0,
285
+ partition=TemporalPartitionSpec(
286
+ layout="train_test",
287
+ train_size="2D",
288
+ test_size="1D",
289
+ ),
290
+ step="1D",
291
+ strategy="single",
292
+ )
293
+ ```
294
+
295
+ ## Example: manual control with the low-level splitter
296
+
297
+ ```python
298
+ from jano import TemporalBacktestSplitter, TemporalPartitionSpec
299
+
300
+ splitter = TemporalBacktestSplitter(
301
+ time_col="timestamp",
302
+ partition=TemporalPartitionSpec(
303
+ layout="train_val_test",
304
+ train_size=0.6,
305
+ validation_size=0.2,
306
+ test_size=0.2,
307
+ ),
308
+ step=0.2,
309
+ strategy="single",
310
+ )
311
+
312
+ for split in splitter.iter_splits(frame):
313
+ print(split.summary())
314
+ ```
315
+
316
+ ## Example: keep the same test window and grow train backward
317
+
318
+ This is a special use case. It is useful when you want to study whether more training history really improves the same test slice.
319
+
320
+ ```python
321
+ from jano import TrainHistoryPolicy
322
+
323
+ policy = TrainHistoryPolicy(
324
+ "timestamp",
325
+ cutoff="2025-09-15",
326
+ train_sizes=["7D", "14D", "21D", "28D"],
327
+ test_size="4D",
328
+ )
329
+
330
+ result = policy.evaluate(
331
+ frame,
332
+ model=model,
333
+ target_col="target",
334
+ feature_cols=["feature_1", "feature_2"],
335
+ metrics=["mae", "rmse"],
336
+ )
337
+
338
+ print(result.to_frame()[["train_size", "rmse"]])
339
+ print(result.find_optimal_train_size(metric="rmse", tolerance=0.01))
340
+ ```
341
+
342
+ That pattern keeps `test` fixed while `train` expands toward the past. It is a practical way to study data efficiency or to estimate how much history is actually needed.
343
+
344
+ The opposite special case is also common: keep `train` fixed and move `test` forward day by day to estimate how long a model or rule keeps its performance without retraining. The two patterns answer different questions:
345
+
346
+ - fixed `test` + growing `train`: how much history do I actually need?
347
+ - fixed `train` + moving `test`: for how long does performance hold after deployment?
348
+
349
+ Example of the second pattern:
350
+
351
+ ```python
352
+ from jano import DriftMonitoringPolicy
353
+
354
+ policy = DriftMonitoringPolicy(
355
+ "timestamp",
356
+ cutoff="2025-09-15",
357
+ train_size="30D",
358
+ test_size="3D",
359
+ step="1D",
360
+ max_windows=10,
361
+ )
362
+
363
+ result = policy.evaluate(
364
+ frame,
365
+ model=model,
366
+ target_col="target",
367
+ feature_cols=["feature_1", "feature_2"],
368
+ metrics=["mae", "rmse"],
369
+ )
370
+
371
+ print(result.to_frame()[["window", "test_start", "rmse"]])
372
+ print(result.find_drift_onset(metric="rmse", threshold=0.15, baseline="first"))
373
+ ```
374
+
375
+ ## Example: optimize training history inside each walk-forward iteration
376
+
377
+ This is the next-level composed question: if each outer test window is allowed to choose its own optimal training history, how much history is needed on average?
378
+
379
+ ```python
380
+ from jano import RollingTrainHistoryPolicy, TemporalPartitionSpec
381
+
382
+ policy = RollingTrainHistoryPolicy(
383
+ "timestamp",
384
+ partition=TemporalPartitionSpec(
385
+ layout="train_test",
386
+ train_size="30D",
387
+ test_size="1D",
388
+ ),
389
+ step="1D",
390
+ strategy="rolling",
391
+ max_folds=10,
392
+ train_sizes=["5D", "10D", "15D", "30D"],
393
+ )
394
+
395
+ result = policy.evaluate(
396
+ frame,
397
+ model=model,
398
+ target_col="target",
399
+ feature_cols=["feature_1", "feature_2"],
400
+ metrics="rmse",
401
+ metric="rmse",
402
+ tolerance=0.01,
403
+ )
404
+
405
+ print(result.to_frame().head())
406
+ print(result.summary())
407
+ ```
408
+
409
+ ## Example: different feature groups can require different history depths
410
+
411
+ The supervised fold can stay fixed while feature engineering still asks for different
412
+ lookback windows per feature group.
413
+
414
+ ```python
415
+ from jano import FeatureLookbackSpec
416
+
417
+ split = next(splitter.iter_splits(frame))
418
+ lookbacks = FeatureLookbackSpec(
419
+ default_lookback="15D",
420
+ group_lookbacks={"lag_features": "65D"},
421
+ feature_groups={"lag_features": ["lag_30", "lag_60"]},
422
+ )
423
+
424
+ history = split.slice_feature_history(
425
+ frame,
426
+ lookbacks,
427
+ time_col="timestamp",
428
+ segment_name="train",
429
+ )
430
+
431
+ recent_context = history["__default__"]
432
+ lag_context = history["lag_features"]
433
+ ```
434
+
435
+ This is useful when recent features only need a short window while lagged or seasonal
436
+ features need much deeper historical context for the same model.
437
+
438
+ ## Example: describe a simulation as HTML
439
+
440
+ ```python
441
+ summary = splitter.describe_simulation(frame, title="Walk-forward simulation")
442
+ html = splitter.describe_simulation(frame, output="html")
443
+ chart_data = splitter.describe_simulation(frame, output="chart_data")
444
+
445
+ print(summary.total_folds)
446
+ print(summary.to_frame().head())
447
+ print(chart_data.segment_stats)
448
+ ```
449
+
450
+ That gives you three ways to consume the same simulation:
451
+
452
+ - `summary` for tabular metadata and export helpers,
453
+ - `html` for a standalone visual report,
454
+ - `chart_data` for direct Python plotting without reparsing HTML.
455
+
456
+ The generated report shows each fold across the dataset timeline, with richer summary cards, clearer segment labels and row counts per partition.
457
+
458
+ ## Installation
459
+
460
+ After the first PyPI release, install the package with:
461
+
462
+ ```bash
463
+ python -m pip install jano
464
+ ```
465
+
466
+ To use Polars inputs directly:
467
+
468
+ ```bash
469
+ python -m pip install "jano[polars]"
470
+ ```
471
+
472
+ For local development:
473
+
474
+ ```bash
475
+ python -m pip install -e ".[dev]"
476
+ python -m pytest --cov=jano --cov-report=term-missing
477
+ python -m sphinx -b html docs docs/_build/html
478
+ ```
479
+
480
+ Jano also exposes its runtime version through `jano.__version__`.
481
+
482
+ ## Release flow
483
+
484
+ The repository includes a dedicated GitHub Actions workflow for PyPI publication through trusted publishing.
485
+
486
+ The release path is:
487
+
488
+ 1. Update `jano/_version.py`.
489
+ 2. Run `python -m pytest -q`.
490
+ 3. Run `python -m build` and `python -m twine check dist/*`.
491
+ 4. Push a tag like `v0.3.0`.
492
+
493
+ That tag triggers the `Publish` workflow, which builds the wheel and source distribution and publishes them to PyPI.
494
+
495
+ In parallel, the repository also includes a `GitHub Release` workflow that can create a GitHub Release and attach the built wheel and source distribution for any `v*` tag. That gives the project a distribution channel even while PyPI access is still being recovered.
496
+
497
+ ## Continuous integration and coverage
498
+
499
+ The repository includes:
500
+
501
+ - GitHub Actions for tests across multiple Python versions.
502
+ - GitHub Pages publication for Sphinx documentation.
503
+ - Coverage reporting with `pytest-cov`.
504
+ - Codecov upload and status tracking.
505
+
506
+ ## Status
507
+
508
+ Jano is currently in an early redesign phase. The public API is stabilizing around temporal partition specs, reusable splitters and rich split objects.
509
+
510
+ That means the project is already usable for experimentation, but it is still a good moment to refine naming, ergonomics and compatibility guarantees before publishing broadly.
511
+
512
+ ## Authors
513
+
514
+ - Marcos Manuel Muraro
515
+
516
+ ## Contributing
517
+
518
+ Feedback and design discussion are especially valuable right now. If you are using temporal backtesting for ML, analytics, operations or experimentation, that context can help shape the API in the right direction.
519
+
520
+ ## Star history
521
+
522
+ [![Star History Chart](https://api.star-history.com/svg?repos=marmurar/jano&type=Date)](https://star-history.com/#marmurar/jano&Date)