PyPI - rustystats - Versions diffs - 0.1.5__cp313-cp313-manylinux_2_34_x86_64.whl - Mend

rustystats 0.1.5__cp313-cp313-manylinux_2_34_x86_64.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (14) hide show

rustystats/__init__.py +151 -0
rustystats/_rustystats.cpython-313-x86_64-linux-gnu.so +0 -0
rustystats/diagnostics.py +2471 -0
rustystats/families.py +423 -0
rustystats/formula.py +1074 -0
rustystats/glm.py +249 -0
rustystats/interactions.py +1246 -0
rustystats/links.py +221 -0
rustystats/splines.py +367 -0
rustystats/target_encoding.py +375 -0
rustystats-0.1.5.dist-info/METADATA +476 -0
rustystats-0.1.5.dist-info/RECORD +14 -0
rustystats-0.1.5.dist-info/WHEEL +4 -0
rustystats-0.1.5.dist-info/licenses/LICENSE +21 -0

rustystats-0.1.5.dist-info/METADATA ADDED Viewed

@@ -0,0 +1,476 @@
+Metadata-Version: 2.4
+Name: rustystats
+Version: 0.1.5
+Classifier: Programming Language :: Rust
+Classifier: Programming Language :: Python :: Implementation :: CPython
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.9
+Classifier: Programming Language :: Python :: 3.10
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Topic :: Scientific/Engineering :: Mathematics
+Classifier: Intended Audience :: Financial and Insurance Industry
+Classifier: Intended Audience :: Science/Research
+Requires-Dist: numpy>=1.20
+Requires-Dist: polars>=1.0
+Requires-Dist: pytest>=7.0 ; extra == 'dev'
+Requires-Dist: statsmodels>=0.14 ; extra == 'dev'
+Requires-Dist: maturin>=1.4 ; extra == 'dev'
+Requires-Dist: jupyter>=1.0 ; extra == 'dev'
+Requires-Dist: pyarrow>=14.0 ; extra == 'dev'
+Requires-Dist: psutil>=5.9 ; extra == 'dev'
+Provides-Extra: dev
+License-File: LICENSE
+Summary: Fast Generalized Linear Models with a Rust backend - statsmodels compatible
+Keywords: statistics,glm,actuarial,regression,modeling
+License: MIT
+Requires-Python: >=3.9
+Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
+Project-URL: Repository, https://github.com/PricingFrontier/rustystats
+Project-URL: Documentation, https://github.com/PricingFrontier/rustystats#readme
+# RustyStats 🦀📊
+**High-performance Generalized Linear Models with a Rust backend and Python API**
+**Codebase Documentation**: [pricingfrontier.github.io/rustystats/](https://pricingfrontier.github.io/rustystats/)
+## Performance Benchmarks
+**RustyStats vs Statsmodels** — Synthetic data, 101 features (10 continuous + 10 categorical with 10 levels each).
+| Family | 10K rows | 250K rows | 500K rows |
+|--------|----------|-----------|-----------|
+| Gaussian | **15.6x** | **5.7x** | **4.3x** |
+| Poisson | **16.3x** | **6.2x** | **4.2x** |
+| Binomial | **19.5x** | **6.8x** | **4.4x** |
+| Gamma | **33.7x** | **13.4x** | **8.4x** |
+| NegBinomial | **26.7x** | **6.7x** | **5.0x** |
+**Average speedup: 10.5x** (range: 4.2x – 33.7x)
+### Memory Usage
+RustyStats uses significantly less RAM by reusing buffers and avoiding Python object overhead:
+| Rows | RustyStats | Statsmodels | Reduction |
+|------|------------|-------------|-----------|
+| 10K | 38 MB | 72 MB | **1.9x** |
+| 250K | 460 MB | 1,796 MB | **3.9x** |
+| 500K | 836 MB | 3,590 MB | **4.3x** |
+*Memory advantage grows with data size — at 500K rows, RustyStats uses ~4x less RAM.*
+<details>
+<summary>Full benchmark details</summary>
+| Family | Rows | RustyStats | Statsmodels | Speedup |
+|--------|------|------------|-------------|---------|
+| Gaussian | 10,000 | 0.100s | 1.559s | **15.6x** |
+| Gaussian | 250,000 | 1.991s | 11.363s | **5.7x** |
+| Gaussian | 500,000 | 4.023s | 17.386s | **4.3x** |
+| Poisson | 10,000 | 0.165s | 2.692s | **16.3x** |
+| Poisson | 250,000 | 2.429s | 15.072s | **6.2x** |
+| Poisson | 500,000 | 5.668s | 23.693s | **4.2x** |
+| Binomial | 10,000 | 0.112s | 2.189s | **19.5x** |
+| Binomial | 250,000 | 1.946s | 13.155s | **6.8x** |
+| Binomial | 500,000 | 4.708s | 20.862s | **4.4x** |
+| Gamma | 10,000 | 0.129s | 4.353s | **33.7x** |
+| Gamma | 250,000 | 2.385s | 31.885s | **13.4x** |
+| Gamma | 500,000 | 5.499s | 46.167s | **8.4x** |
+| NegBinomial | 10,000 | 0.119s | 3.177s | **26.7x** |
+| NegBinomial | 250,000 | 2.281s | 15.278s | **6.7x** |
+| NegBinomial | 500,000 | 4.821s | 24.331s | **5.0x** |
+*Times are median of 3 runs. Benchmark scripts in `benchmarks/`.*
+</details>
+---
+## Features
+- **Fast** - Parallel Rust backend, 4-30x faster than statsmodels
+- **Memory Efficient** - 4x less RAM than statsmodels at scale
+- **Stable** - Step-halving IRLS, warm starts for robust convergence
+- **Splines** - B-splines `bs()` and natural splines `ns()` in formulas
+- **Regularisation** - Ridge, Lasso, and Elastic Net via coordinate descent
+- **Complete** - 7 families, robust SEs, full diagnostics
+- **Minimal** - Only `numpy` and `polars` required
+## Installation
+```bash
+uv add rustystats
+```
+## Quick Start
+```python
+import rustystats as rs
+import polars as pl
+# Load data
+data = pl.read_parquet("insurance.parquet")
+# Fit a Poisson GLM for claim frequency
+result = rs.glm(
+    "ClaimCount ~ VehAge + VehPower + C(Area) + C(Region)",
+    data=data,
+    family="poisson",
+    offset="Exposure"
+).fit()
+# View results
+print(result.summary())
+```
+---
+## Families & Links
+| Family | Default Link | Use Case |
+|--------|--------------|----------|
+| `gaussian` | identity | Linear regression |
+| `poisson` | log | Claim frequency |
+| `binomial` | logit | Binary outcomes |
+| `gamma` | log | Claim severity |
+| `tweedie` | log | Pure premium (var_power=1.5) |
+| `quasipoisson` | log | Overdispersed counts |
+| `quasibinomial` | logit | Overdispersed binary |
+| `negbinomial` | log | Overdispersed counts (proper distribution) |
+---
+## Formula Syntax
+```python
+# Main effects
+"y ~ x1 + x2 + C(category)"
+# Interactions
+"y ~ x1*x2"              # x1 + x2 + x1:x2
+"y ~ C(area):age"        # Area-specific age effects
+"y ~ C(area)*C(brand)"   # Categorical × categorical
+# Splines (non-linear effects)
+"y ~ bs(age, df=5)"      # B-spline basis
+"y ~ ns(income, df=4)"   # Natural spline (better extrapolation)
+# Target encoding (high-cardinality categoricals)
+"y ~ TE(brand) + TE(model)"
+# Combined
+"y ~ bs(age, df=5) + C(region)*income + ns(vehicle_age, df=3) + TE(brand)"
+```
+---
+## Results Methods
+```python
+# Coefficients & Inference
+result.params              # Coefficients
+result.fittedvalues        # Predicted means
+result.deviance            # Model deviance
+result.bse()               # Standard errors
+result.tvalues()           # z-statistics
+result.pvalues()           # P-values
+result.conf_int(alpha)     # Confidence intervals
+# Robust Standard Errors (sandwich estimators)
+result.bse_robust("HC1")   # Robust SE (HC0, HC1, HC2, HC3)
+result.tvalues_robust()    # z-stats with robust SE
+result.pvalues_robust()    # P-values with robust SE
+result.conf_int_robust()   # Confidence intervals with robust SE
+result.cov_robust()        # Full robust covariance matrix
+# Diagnostics (statsmodels-compatible)
+result.resid_response()    # Raw residuals (y - μ)
+result.resid_pearson()     # Pearson residuals
+result.resid_deviance()    # Deviance residuals
+result.resid_working()     # Working residuals
+result.llf()               # Log-likelihood
+result.aic()               # Akaike Information Criterion
+result.bic()               # Bayesian Information Criterion
+result.null_deviance()     # Null model deviance
+result.pearson_chi2()      # Pearson chi-squared
+result.scale()             # Dispersion (deviance-based)
+result.scale_pearson()     # Dispersion (Pearson-based)
+result.family              # Family name
+```
+---
+## Regularization
+```python
+# Ridge (L2) - shrinks coefficients, keeps all variables
+result = rs.glm("y ~ x1 + x2 + C(cat)", data, family="gaussian").fit(
+    alpha=0.1, l1_ratio=0.0
+)
+# Lasso (L1) - variable selection, zeros out weak predictors
+result = rs.glm("y ~ x1 + x2 + C(cat)", data, family="poisson").fit(
+    alpha=0.1, l1_ratio=1.0
+)
+print(f"Selected {result.n_nonzero()} variables")
+print(f"Features: {result.selected_features()}")
+# Elastic Net - mix of L1 and L2
+result = rs.glm("y ~ x1 + x2 + C(cat)", data, family="gaussian").fit(
+    alpha=0.1, l1_ratio=0.5
+)
+```
+---
+## Interaction Terms
+```python
+# Continuous × Continuous interaction (main effects + interaction)
+result = rs.glm(
+    "ClaimNb ~ Age*VehPower",  # Equivalent to Age + VehPower + Age:VehPower
+    data, family="poisson", offset="Exposure"
+).fit()
+# Categorical × Continuous interaction
+result = rs.glm(
+    "ClaimNb ~ C(Area)*Age",  # Each area level has different age effect
+    data, family="poisson", offset="Exposure"
+).fit()
+# Categorical × Categorical interaction
+result = rs.glm(
+    "ClaimNb ~ C(Area)*C(VehBrand)",
+    data, family="poisson", offset="Exposure"
+).fit()
+# Pure interaction (no main effects added)
+result = rs.glm(
+    "ClaimNb ~ Age + C(Area):VehPower",  # Area-specific VehPower slopes
+    data, family="poisson", offset="Exposure"
+).fit()
+```
+---
+## Spline Basis Functions
+```python
+# Use splines in formulas - automatic parsing
+result = rs.glm(
+    "ClaimNb ~ bs(Age, df=5) + ns(VehPower, df=4) + C(Region)",
+    data=data,
+    family="poisson",
+    offset="Exposure"
+).fit()
+# Combine splines with interactions
+result = rs.glm(
+    "y ~ bs(age, df=4)*C(gender) + ns(income, df=3)",
+    data=data,
+    family="gaussian"
+).fit()
+# Direct basis computation for custom use
+import numpy as np
+x = np.linspace(0, 10, 100)
+basis = rs.bs(x, df=5)  # 5 degrees of freedom (4 basis columns)
+basis_ns = rs.ns(x, df=5)  # Natural splines - linear extrapolation at boundaries
+```
+**When to use each spline type:**
+- **B-splines (`bs`)**: Standard choice, more flexible at boundaries
+- **Natural splines (`ns`)**: Better extrapolation, linear beyond boundaries (recommended for actuarial work)
+---
+## Quasi-Families for Overdispersion
+```python
+# Fit a standard Poisson model first
+result_poisson = rs.glm("ClaimNb ~ Age + C(Region)", data, family="poisson", offset="Exposure").fit()
+# Check for overdispersion: Pearson χ² / df >> 1 indicates overdispersion
+dispersion_ratio = result_poisson.pearson_chi2() / result_poisson.df_resid
+print(f"Dispersion ratio: {dispersion_ratio:.2f}")  # If >> 1, use quasi-family
+# Fit QuasiPoisson if overdispersed
+result_quasi = rs.glm("ClaimNb ~ Age + C(Region)", data, family="quasipoisson", offset="Exposure").fit()
+# Coefficients are IDENTICAL to Poisson, but standard errors are inflated by √φ
+print(f"Estimated dispersion (φ): {result_quasi.scale():.3f}")
+# For binary data with overdispersion
+result_qb = rs.glm("Binary ~ x1 + x2", data, family="quasibinomial").fit()
+```
+**Key properties of quasi-families:**
+- **Point estimates**: Identical to base family (Poisson/Binomial)
+- **Standard errors**: Inflated by √φ where φ = Pearson χ²/(n-p)
+- **P-values**: More conservative (larger), accounting for extra variance
+---
+## Negative Binomial for Overdispersed Counts
+```python
+# Automatic θ estimation (default when theta not supplied)
+result = rs.glm("ClaimNb ~ Age + C(Region)", data, family="negbinomial", offset="Exposure").fit()
+print(result.family)  # "NegativeBinomial(theta=2.1234)"
+# Fixed θ value
+result = rs.glm("ClaimNb ~ Age + C(Region)", data, family="negbinomial", theta=1.0, offset="Exposure").fit()
+# θ controls overdispersion: Var(Y) = μ + μ²/θ
+# - θ=0.5: Strong overdispersion (variance = μ + 2μ²)
+# - θ=1.0: Moderate overdispersion (variance = μ + μ²)
+# - θ→∞: Approaches Poisson (variance = μ)
+```
+**NegativeBinomial vs QuasiPoisson:**
+| Aspect | QuasiPoisson | NegativeBinomial |
+|--------|--------------|------------------|
+| **Variance** | φ × μ | μ + μ²/θ |
+| **True distribution** | No (quasi) | Yes |
+| **AIC/BIC valid** | Questionable | Yes |
+| **Prediction intervals** | Not principled | Proper |
+---
+## Target Encoding for High-Cardinality Categoricals
+```python
+# Formula API - TE() in formulas
+result = rs.glm(
+    "ClaimNb ~ TE(Brand) + TE(Model) + Age + C(Region)",
+    data=data,
+    family="poisson",
+    offset="Exposure"
+).fit()
+# With options
+result = rs.glm(
+    "y ~ TE(brand, prior_weight=2.0, n_permutations=8) + age",
+    data=data,
+    family="gaussian"
+).fit()
+# Sklearn-style API
+encoder = rs.TargetEncoder(prior_weight=1.0, n_permutations=4)
+train_encoded = encoder.fit_transform(train_categories, train_target)
+test_encoded = encoder.transform(test_categories)
+```
+**Key benefits:**
+- **No target leakage**: Ordered target statistics
+- **Regularization**: Prior weight controls shrinkage toward global mean
+- **High-cardinality**: Single column instead of thousands of dummies
+---
+## Model Diagnostics
+```python
+# Compute all diagnostics at once
+diagnostics = result.diagnostics(
+    data=data,
+    categorical_factors=["Region", "VehBrand", "Area"],  # Including non-fitted
+    continuous_factors=["Age", "Income", "VehPower"],    # Including non-fitted
+)
+# Export as compact JSON (optimized for LLM consumption)
+json_str = diagnostics.to_json()
+# Pre-fit data exploration (no model needed)
+exploration = rs.explore_data(
+    data=data,
+    response="ClaimNb",
+    categorical_factors=["Region", "VehBrand", "Area"],
+    continuous_factors=["Age", "VehPower", "Income"],
+    exposure="Exposure",
+    family="poisson",
+    detect_interactions=True,
+)
+```
+**Diagnostic Features:**
+- **Calibration**: Overall A/E ratio, calibration by decile with CIs, Hosmer-Lemeshow test
+- **Discrimination**: Gini coefficient, AUC, KS statistic, lift metrics
+- **Factor Diagnostics**: A/E by level/bin for ALL factors (fitted and non-fitted)
+- **Interaction Detection**: Greedy residual-based detection of potential interactions
+- **Warnings**: Auto-generated alerts for high dispersion, poor calibration, missing factors
+---
+## RustyStats vs Statsmodels
+| Feature | RustyStats | Statsmodels |
+|---------|------------|-------------|
+| **Parallel IRLS Solver** | ✅ Multi-threaded via Rayon | ❌ Single-threaded only |
+| **Native Polars Support** | ✅ Formula API works with Polars DataFrames | ❌ Pandas only |
+| **Built-in Lasso/Elastic Net for GLMs** | ✅ Fast coordinate descent with all families | ⚠️ Limited |
+| **Relativities Table** | ✅ `result.relativities()` for pricing | ❌ Must compute manually |
+| **Robust Standard Errors** | ✅ HC0, HC1, HC2, HC3 sandwich estimators | ✅ HC0-HC3 |
+---
+## Project Structure
+```
+rustystats/
+├── Cargo.toml                    # Workspace config
+├── pyproject.toml                # Python package config
+│
+├── crates/
+│   ├── rustystats-core/          # Pure Rust GLM library
+│   │   └── src/
+│   │       ├── families/         # Gaussian, Poisson, Binomial, Gamma, Tweedie, Quasi, NegativeBinomial
+│   │       ├── links/            # Identity, Log, Logit
+│   │       ├── solvers/          # IRLS, coordinate descent
+│   │       ├── inference/        # P-values, CIs, robust SE (HC0-HC3)
+│   │       ├── interactions/     # Lazy interaction term computation
+│   │       ├── splines/          # B-spline and natural spline basis functions
+│   │       ├── design_matrix/    # Categorical encoding, interaction matrices
+│   │       ├── formula/          # R-style formula parsing
+│   │       ├── target_encoding/  # Ordered target statistics
+│   │       └── diagnostics/      # Residuals, dispersion, AIC/BIC, calibration, loss
+│   │
+│   └── rustystats/               # Python bindings (PyO3)
+│       └── src/lib.rs
+│
+├── python/rustystats/            # Python package
+│   ├── __init__.py               # Main exports
+│   ├── formula.py                # Formula API with DataFrame support
+│   ├── splines.py                # bs() and ns() spline basis functions
+│   ├── target_encoding.py        # Target encoding
+│   ├── diagnostics.py            # Model diagnostics with JSON export
+│   └── families.py               # Family wrappers
+│
+├── examples/
+│   └── frequency.ipynb           # Claim frequency example
+│
+└── tests/python/                 # Python test suite
+```
+---
+## Dependencies
+### Rust
+- `ndarray`, `nalgebra` - Linear algebra
+- `rayon` - Parallel iterators (multi-threading)
+- `statrs` - Statistical distributions
+- `pyo3` - Python bindings
+### Python
+- `numpy` - Array operations (required)
+- `polars` - DataFrame support (required)
+---
+## License
+MIT

rustystats-0.1.5.dist-info/RECORD ADDED Viewed

@@ -0,0 +1,14 @@
+rustystats-0.1.5.dist-info/METADATA,sha256=gcMU-FCDylYCfffKFDlg5l5U7IHcjpRvJIU7etxVcGM,15761
+rustystats-0.1.5.dist-info/WHEEL,sha256=jmh_XJXNl6S4nP-PSN0xl3LQkj9yKhrwCYo0d4TS4ew,109
+rustystats-0.1.5.dist-info/licenses/LICENSE,sha256=MMgZDMsAZZqE7jrwmPD2K_pwsdY6Cw4OBVmp5pJYOKg,1072
+rustystats/__init__.py,sha256=XIfx7I1nVPLnp3iwJy297klDJpb2_2LOwErLhNPkoyk,4052
+rustystats/_rustystats.cpython-313-x86_64-linux-gnu.so,sha256=u8-h8i0aZuJw-SF6Xu2cyF7A-0-fYWx4E4-I0EXQ5wU,1896704
+rustystats/diagnostics.py,sha256=CNZTRibvf00PATHXcyQlinrLgrPxIrypRVUruYdWm58,90498
+rustystats/families.py,sha256=PNgaXPd09C4T72sNBCswnGD2bIec81hjTq8zxzsjaUc,14510
+rustystats/formula.py,sha256=U2nETIndcmjSvkEnpU6Qd-8WRNSU4H0nodTwkVqhXh4,36398
+rustystats/glm.py,sha256=N2DKV3J1X-2x_LUYjNXHmoogjVGLqqePlNfAMfAEF-Q,8133
+rustystats/interactions.py,sha256=3objuZgqXBG2XTuBXNY--flXoM2cqrpIRvicO8MuaV4,47405
+rustystats/links.py,sha256=qvlSV5IJQPtDy92PNKkxQUEsni3Jfsw55lFzp9RXlVI,6754
+rustystats/splines.py,sha256=y6nRxVEmzpXVgeT92UA7Fs9LkHutQ2qtdHY0xaaLsZU,11137
+rustystats/target_encoding.py,sha256=M9KFs1AILqvv-yZr3MVGUw541GU9-DyOuQW5p0YzOC4,11563
+rustystats-0.1.5.dist-info/RECORD,,

rustystats-0.1.5.dist-info/WHEEL ADDED Viewed

@@ -0,0 +1,4 @@
+Wheel-Version: 1.0
+Generator: maturin (1.10.2)
+Root-Is-Purelib: false
+Tag: cp313-cp313-manylinux_2_34_x86_64

rustystats-0.1.5.dist-info/licenses/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2024 PricingFrontier
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.