PyPI - coxstream - Versions diffs - 0.1.0__tar.gz - Mend

coxstream 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (17) hide show

coxstream-0.1.0/LICENSE +21 -0
coxstream-0.1.0/PKG-INFO +164 -0
coxstream-0.1.0/README.md +139 -0
coxstream-0.1.0/pyproject.toml +45 -0
coxstream-0.1.0/setup.cfg +4 -0
coxstream-0.1.0/setup.py +38 -0
coxstream-0.1.0/src/coxstream/__init__.py +18 -0
coxstream-0.1.0/src/coxstream/_kernel.c +30747 -0
coxstream-0.1.0/src/coxstream/_kernel.pyi +45 -0
coxstream-0.1.0/src/coxstream/_kernel.pyx +192 -0
coxstream-0.1.0/src/coxstream/coxstream.py +298 -0
coxstream-0.1.0/src/coxstream.egg-info/PKG-INFO +164 -0
coxstream-0.1.0/src/coxstream.egg-info/SOURCES.txt +15 -0
coxstream-0.1.0/src/coxstream.egg-info/dependency_links.txt +1 -0
coxstream-0.1.0/src/coxstream.egg-info/requires.txt +7 -0
coxstream-0.1.0/src/coxstream.egg-info/top_level.txt +1 -0
coxstream-0.1.0/tests/test_coxstream.py +176 -0

coxstream-0.1.0/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2026 Tommy Carstensen
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

coxstream-0.1.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,164 @@
+Metadata-Version: 2.4
+Name: coxstream
+Version: 0.1.0
+Summary: Exact out-of-core Cox proportional hazards regression via streaming Newton-Raphson
+Author-email: Tommy Carstensen <zenodo@tommycarstensen.com>
+License-Expression: MIT
+Project-URL: Homepage, https://github.com/tommycarstensen/coxstream
+Project-URL: Repository, https://github.com/tommycarstensen/coxstream
+Project-URL: Issues, https://github.com/tommycarstensen/coxstream/issues
+Keywords: survival analysis,cox proportional hazards,out-of-core,streaming,epidemiology,statistics
+Classifier: Development Status :: 3 - Alpha
+Classifier: Intended Audience :: Science/Research
+Classifier: Programming Language :: Python :: 3
+Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
+Classifier: Topic :: Scientific/Engineering :: Medical Science Apps.
+Requires-Python: >=3.10
+Description-Content-Type: text/markdown
+License-File: LICENSE
+Requires-Dist: numpy>=1.24
+Provides-Extra: parquet
+Requires-Dist: pyarrow>=14; extra == "parquet"
+Provides-Extra: test
+Requires-Dist: pytest>=7; extra == "test"
+Dynamic: license-file
+# coxstream
+**Exact out-of-core Cox proportional hazards regression via streaming
+Newton-Raphson.**
+[![PyPI](https://img.shields.io/pypi/v/coxstream.svg)](https://pypi.org/project/coxstream/)
+<!-- DOI badge: uncomment once the Zenodo record exists.
+[![DOI](https://zenodo.org/badge/DOI/TODO.svg)](https://doi.org/TODO)
+-->
+Standard CoxPH solvers (`lifelines`, `scikit-survival`, R `survival`) load the
+full cohort into memory before fitting, so on registry-scale data they exhaust
+RAM long before the computation is hard. `coxstream` computes the **exact** Efron
+partial-likelihood estimate by streaming a single time-sorted pass over the data
+per Newton-Raphson iteration, holding only `O(p^2)` state for `p` covariates.
+Working memory is therefore **independent of the number of observations `n`**:
+the model fits on a workstation even when the cohort is far larger than RAM.
+The streamed estimate *is* the in-memory maximum-likelihood estimate, and the
+Efron tie correction is carried across chunk boundaries, so heavily tied data
+are handled exactly.
+![coxstream holds peak RAM flat as the cohort grows, while in-memory solvers (lifelines, R survival::coxph) scale with n; coefficients agree to machine precision.](https://raw.githubusercontent.com/tommycarstensen/coxstream/main/docs/benchmark.png)
+*Memory vs. speed against `lifelines` and R `survival::coxph`: coxstream's peak
+RAM stays flat in the number of rows while in-memory solvers grow with the
+cohort, at matching coefficients. See the accompanying paper for the full
+methodology.*
+## Install
+```bash
+pip install coxstream             # core (numpy only)
+pip install coxstream[parquet]    # + out-of-core fit_parquet (pyarrow)
+```
+The package builds a small Cython kernel, so a C compiler is required.
+## Usage
+In memory:
+```python
+import numpy as np
+from coxstream import CoxStream
+model = CoxStream().fit(durations, events, X, feature_names=names)
+print(model.coef_, model.n_iter_)
+```
+Out of core, from a Parquet file **pre-sorted by descending event time** (never
+materialises the cohort):
+```python
+from coxstream import CoxStream
+# The file must already be sorted by duration DESC. `fit_parquet` verifies this
+# from the Parquet footer statistics alone (no full pass) and rejects a file
+# that is out of order; pass assume_sorted=True to skip the check.
+#
+# Sort it once with an out-of-core sorter -- both spill to disk, so they handle
+# a cohort larger than RAM (a sort-engine benchmark found these the fastest):
+#   duckdb:  COPY (SELECT * FROM 'cohort.parquet' ORDER BY duration DESC)
+#            TO 'cohort_desc.parquet' (FORMAT PARQUET);
+#   polars:  (pl.scan_parquet("cohort.parquet")
+#              .sort("duration", descending=True)
+#              .sink_parquet("cohort_desc.parquet"))
+#   R:       duckdb via its R client runs the same COPY ... ORDER BY DESC.
+# If the cohort fits in RAM, skip the file and call .fit, which sorts for you.
+model = CoxStream().fit_parquet(
+    "cohort_desc.parquet",
+    duration_col="duration",
+    event_col="event",
+    covariate_cols=["age_std", "sex", "treatment"],
+)
+print(model.coef_)
+```
+To validate a file's order ahead of time -- a dry run, e.g. a CI or pipeline
+gate right after you sort and before a long fit -- call `check_sorted`, which
+runs the same footer-only check without fitting and raises on a file that is
+provably out of order:
+```python
+from coxstream import check_sorted
+check_sorted("cohort_desc.parquet", duration_col="duration")  # raises if unsorted
+```
+It doubles as a shell gate -- it exits non-zero on an out-of-order file, so a
+pipeline step can fail fast without a bespoke CLI:
+```bash
+python -c "import coxstream; coxstream.check_sorted('cohort_desc.parquet', 'duration')"
+```
+## Validation
+`coxstream` is verified against `lifelines` and R `survival::coxph`:
+- It reproduces the in-memory maximum-likelihood estimate to **machine
+  precision** on synthetic data.
+- On the heavily tied Synthea 100K cohort (51 % of event times tied) it matches
+  `lifelines` to ~`1e-6`.
+- Peak resident memory is flat in `n` while in-memory solvers grow with the
+  cohort and eventually exhaust RAM.
+The package's own test suite is dependency-free: it checks exactness against a
+self-contained plain-numpy Cox Newton-Raphson reference. The cross-checks
+against `lifelines` and R `survival::coxph` above live in the accompanying
+benchmark and paper.
+The methodology and full results are in the accompanying paper (see
+[Citation](#citation)).
+## Scope
+`coxstream` implements the exact Efron partial likelihood for large-`n`,
+modest-`p` tabular survival data. It is a focused estimator, not a full survival
+suite: it does not provide baseline-hazard estimation, time-varying covariates,
+or proportional-hazards diagnostics.
+## Testing
+```bash
+pip install -e '.[test]'           # core suite (numpy only)
+pip install -e '.[test,parquet]'   # + the out-of-core fit_parquet test
+pytest
+```
+## Citation
+If you use `coxstream`, please cite it via the metadata in
+[`CITATION.cff`](CITATION.cff).
+## License
+MIT. See [LICENSE](LICENSE).

coxstream-0.1.0/README.md ADDED Viewed

@@ -0,0 +1,139 @@
+# coxstream
+**Exact out-of-core Cox proportional hazards regression via streaming
+Newton-Raphson.**
+[![PyPI](https://img.shields.io/pypi/v/coxstream.svg)](https://pypi.org/project/coxstream/)
+<!-- DOI badge: uncomment once the Zenodo record exists.
+[![DOI](https://zenodo.org/badge/DOI/TODO.svg)](https://doi.org/TODO)
+-->
+Standard CoxPH solvers (`lifelines`, `scikit-survival`, R `survival`) load the
+full cohort into memory before fitting, so on registry-scale data they exhaust
+RAM long before the computation is hard. `coxstream` computes the **exact** Efron
+partial-likelihood estimate by streaming a single time-sorted pass over the data
+per Newton-Raphson iteration, holding only `O(p^2)` state for `p` covariates.
+Working memory is therefore **independent of the number of observations `n`**:
+the model fits on a workstation even when the cohort is far larger than RAM.
+The streamed estimate *is* the in-memory maximum-likelihood estimate, and the
+Efron tie correction is carried across chunk boundaries, so heavily tied data
+are handled exactly.
+![coxstream holds peak RAM flat as the cohort grows, while in-memory solvers (lifelines, R survival::coxph) scale with n; coefficients agree to machine precision.](https://raw.githubusercontent.com/tommycarstensen/coxstream/main/docs/benchmark.png)
+*Memory vs. speed against `lifelines` and R `survival::coxph`: coxstream's peak
+RAM stays flat in the number of rows while in-memory solvers grow with the
+cohort, at matching coefficients. See the accompanying paper for the full
+methodology.*
+## Install
+```bash
+pip install coxstream             # core (numpy only)
+pip install coxstream[parquet]    # + out-of-core fit_parquet (pyarrow)
+```
+The package builds a small Cython kernel, so a C compiler is required.
+## Usage
+In memory:
+```python
+import numpy as np
+from coxstream import CoxStream
+model = CoxStream().fit(durations, events, X, feature_names=names)
+print(model.coef_, model.n_iter_)
+```
+Out of core, from a Parquet file **pre-sorted by descending event time** (never
+materialises the cohort):
+```python
+from coxstream import CoxStream
+# The file must already be sorted by duration DESC. `fit_parquet` verifies this
+# from the Parquet footer statistics alone (no full pass) and rejects a file
+# that is out of order; pass assume_sorted=True to skip the check.
+#
+# Sort it once with an out-of-core sorter -- both spill to disk, so they handle
+# a cohort larger than RAM (a sort-engine benchmark found these the fastest):
+#   duckdb:  COPY (SELECT * FROM 'cohort.parquet' ORDER BY duration DESC)
+#            TO 'cohort_desc.parquet' (FORMAT PARQUET);
+#   polars:  (pl.scan_parquet("cohort.parquet")
+#              .sort("duration", descending=True)
+#              .sink_parquet("cohort_desc.parquet"))
+#   R:       duckdb via its R client runs the same COPY ... ORDER BY DESC.
+# If the cohort fits in RAM, skip the file and call .fit, which sorts for you.
+model = CoxStream().fit_parquet(
+    "cohort_desc.parquet",
+    duration_col="duration",
+    event_col="event",
+    covariate_cols=["age_std", "sex", "treatment"],
+)
+print(model.coef_)
+```
+To validate a file's order ahead of time -- a dry run, e.g. a CI or pipeline
+gate right after you sort and before a long fit -- call `check_sorted`, which
+runs the same footer-only check without fitting and raises on a file that is
+provably out of order:
+```python
+from coxstream import check_sorted
+check_sorted("cohort_desc.parquet", duration_col="duration")  # raises if unsorted
+```
+It doubles as a shell gate -- it exits non-zero on an out-of-order file, so a
+pipeline step can fail fast without a bespoke CLI:
+```bash
+python -c "import coxstream; coxstream.check_sorted('cohort_desc.parquet', 'duration')"
+```
+## Validation
+`coxstream` is verified against `lifelines` and R `survival::coxph`:
+- It reproduces the in-memory maximum-likelihood estimate to **machine
+  precision** on synthetic data.
+- On the heavily tied Synthea 100K cohort (51 % of event times tied) it matches
+  `lifelines` to ~`1e-6`.
+- Peak resident memory is flat in `n` while in-memory solvers grow with the
+  cohort and eventually exhaust RAM.
+The package's own test suite is dependency-free: it checks exactness against a
+self-contained plain-numpy Cox Newton-Raphson reference. The cross-checks
+against `lifelines` and R `survival::coxph` above live in the accompanying
+benchmark and paper.
+The methodology and full results are in the accompanying paper (see
+[Citation](#citation)).
+## Scope
+`coxstream` implements the exact Efron partial likelihood for large-`n`,
+modest-`p` tabular survival data. It is a focused estimator, not a full survival
+suite: it does not provide baseline-hazard estimation, time-varying covariates,
+or proportional-hazards diagnostics.
+## Testing
+```bash
+pip install -e '.[test]'           # core suite (numpy only)
+pip install -e '.[test,parquet]'   # + the out-of-core fit_parquet test
+pytest
+```
+## Citation
+If you use `coxstream`, please cite it via the metadata in
+[`CITATION.cff`](CITATION.cff).
+## License
+MIT. See [LICENSE](LICENSE).

coxstream-0.1.0/pyproject.toml ADDED Viewed

@@ -0,0 +1,45 @@
+[build-system]
+requires      = ["setuptools>=64", "Cython>=3.0", "numpy>=1.24"]
+build-backend = "setuptools.build_meta"
+[project]
+name        = "coxstream"
+version     = "0.1.0"
+description = "Exact out-of-core Cox proportional hazards regression via streaming Newton-Raphson"
+readme      = "README.md"
+license     = "MIT"
+authors     = [{ name = "Tommy Carstensen", email = "zenodo@tommycarstensen.com" }]
+requires-python = ">=3.10"
+dependencies = [
+    "numpy>=1.24",
+]
+keywords = [
+    "survival analysis",
+    "cox proportional hazards",
+    "out-of-core",
+    "streaming",
+    "epidemiology",
+    "statistics",
+]
+classifiers = [
+    "Development Status :: 3 - Alpha",
+    "Intended Audience :: Science/Research",
+    "Programming Language :: Python :: 3",
+    "Topic :: Scientific/Engineering :: Bio-Informatics",
+    "Topic :: Scientific/Engineering :: Medical Science Apps.",
+]
+[project.optional-dependencies]
+parquet = ["pyarrow>=14"]
+test    = ["pytest>=7"]
+[project.urls]
+Homepage   = "https://github.com/tommycarstensen/coxstream"
+Repository = "https://github.com/tommycarstensen/coxstream"
+Issues     = "https://github.com/tommycarstensen/coxstream/issues"
+[tool.setuptools.packages.find]
+where = ["src"]
+[tool.setuptools.package-data]
+coxstream = ["*.pyx"]

coxstream-0.1.0/setup.cfg ADDED Viewed

@@ -0,0 +1,4 @@
+[egg_info]
+tag_build =
+tag_date = 0

coxstream-0.1.0/setup.py ADDED Viewed

@@ -0,0 +1,38 @@
+"""Build the vendored Cython Efron streaming kernel (coxstream._kernel).
+The project metadata lives in pyproject.toml; this file only declares the
+compiled extension. Build in place for development with:
+    pip install -e .
+"""
+import platform
+import numpy as np
+from Cython.Build import cythonize
+from setuptools import Extension, setup
+# -O3 -ffast-math lets the compiler auto-vectorise the O(p^2) inner loops.
+# -march=native is added only on Linux: macOS Pythons are universal2 and
+# -march=native breaks the cross-arch build.
+_flags = ["-O3", "-ffast-math"]
+if platform.system() == "Linux":
+    _flags.append("-march=native")
+setup(
+    ext_modules=cythonize(
+        [
+            Extension(
+                "coxstream._kernel",
+                sources=["src/coxstream/_kernel.pyx"],
+                include_dirs=[np.get_include()],
+                extra_compile_args=_flags,
+            )
+        ],
+        compiler_directives={
+            "language_level": "3",
+            "boundscheck": False,
+            "wraparound": False,
+            "cdivision": True,
+        },
+    ),
+)

coxstream-0.1.0/src/coxstream/__init__.py ADDED Viewed

@@ -0,0 +1,18 @@
+"""coxstream: exact out-of-core Cox proportional hazards via streaming NR.
+Public API
+----------
+CoxStream
+    Exact Efron Cox proportional hazards estimator. Computes the score and
+    observed information in a single descending-time pass per Newton-Raphson
+    iteration, with O(p^2) working memory independent of the cohort size.
+    ``fit`` takes in-memory arrays; ``fit_parquet`` streams out-of-core.
+check_sorted
+    Dry run for the ``fit_parquet`` precondition: validate that a Parquet file
+    is descending-time sorted, from footer statistics alone (no full pass), so a
+    sort mistake fails fast instead of yielding a silently wrong fit.
+"""
+from coxstream.coxstream import CoxStream, check_sorted
+__all__ = ["CoxStream", "check_sorted"]
+__version__ = "0.1.0"