lstar-sc 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (60) hide show
  1. lstar_sc-0.1.0/LICENSE +21 -0
  2. lstar_sc-0.1.0/MANIFEST.in +6 -0
  3. lstar_sc-0.1.0/PKG-INFO +260 -0
  4. lstar_sc-0.1.0/README.md +222 -0
  5. lstar_sc-0.1.0/core/include/lstar/lstar.hpp +1121 -0
  6. lstar_sc-0.1.0/core/include/nlohmann/json.hpp +24765 -0
  7. lstar_sc-0.1.0/pyproject.toml +69 -0
  8. lstar_sc-0.1.0/python/src/lstar/__init__.py +25 -0
  9. lstar_sc-0.1.0/python/src/lstar/__main__.py +7 -0
  10. lstar_sc-0.1.0/python/src/lstar/_accel.cpp +162 -0
  11. lstar_sc-0.1.0/python/src/lstar/_engine.py +58 -0
  12. lstar_sc-0.1.0/python/src/lstar/_native_check.py +161 -0
  13. lstar_sc-0.1.0/python/src/lstar/cli.py +548 -0
  14. lstar_sc-0.1.0/python/src/lstar/collection.py +110 -0
  15. lstar_sc-0.1.0/python/src/lstar/de.py +174 -0
  16. lstar_sc-0.1.0/python/src/lstar/kernels.py +34 -0
  17. lstar_sc-0.1.0/python/src/lstar/lazy.py +271 -0
  18. lstar_sc-0.1.0/python/src/lstar/model.py +259 -0
  19. lstar_sc-0.1.0/python/src/lstar/passthrough.py +98 -0
  20. lstar_sc-0.1.0/python/src/lstar/profiles/__init__.py +1 -0
  21. lstar_sc-0.1.0/python/src/lstar/profiles/anndata.py +861 -0
  22. lstar_sc-0.1.0/python/src/lstar/profiles/anndata_direct.py +360 -0
  23. lstar_sc-0.1.0/python/src/lstar/profiles/mudata.py +224 -0
  24. lstar_sc-0.1.0/python/src/lstar/py.typed +0 -0
  25. lstar_sc-0.1.0/python/src/lstar/validate.py +125 -0
  26. lstar_sc-0.1.0/python/src/lstar/zarr_io.py +293 -0
  27. lstar_sc-0.1.0/python/src/lstar_sc.egg-info/PKG-INFO +260 -0
  28. lstar_sc-0.1.0/python/src/lstar_sc.egg-info/SOURCES.txt +58 -0
  29. lstar_sc-0.1.0/python/src/lstar_sc.egg-info/dependency_links.txt +1 -0
  30. lstar_sc-0.1.0/python/src/lstar_sc.egg-info/entry_points.txt +2 -0
  31. lstar_sc-0.1.0/python/src/lstar_sc.egg-info/requires.txt +23 -0
  32. lstar_sc-0.1.0/python/src/lstar_sc.egg-info/top_level.txt +1 -0
  33. lstar_sc-0.1.0/python/tests/corpus.py +212 -0
  34. lstar_sc-0.1.0/python/tests/synth.py +258 -0
  35. lstar_sc-0.1.0/python/tests/test_accel.py +52 -0
  36. lstar_sc-0.1.0/python/tests/test_anndata_profile.py +89 -0
  37. lstar_sc-0.1.0/python/tests/test_arity3.py +69 -0
  38. lstar_sc-0.1.0/python/tests/test_aux.py +111 -0
  39. lstar_sc-0.1.0/python/tests/test_categorical.py +70 -0
  40. lstar_sc-0.1.0/python/tests/test_collection_reduce.py +97 -0
  41. lstar_sc-0.1.0/python/tests/test_crossimpl.py +57 -0
  42. lstar_sc-0.1.0/python/tests/test_de.py +148 -0
  43. lstar_sc-0.1.0/python/tests/test_determinism.py +48 -0
  44. lstar_sc-0.1.0/python/tests/test_fuzz.py +79 -0
  45. lstar_sc-0.1.0/python/tests/test_induce.py +156 -0
  46. lstar_sc-0.1.0/python/tests/test_lazy.py +97 -0
  47. lstar_sc-0.1.0/python/tests/test_legacy_format.py +56 -0
  48. lstar_sc-0.1.0/python/tests/test_mudata.py +243 -0
  49. lstar_sc-0.1.0/python/tests/test_nullable.py +94 -0
  50. lstar_sc-0.1.0/python/tests/test_partial.py +73 -0
  51. lstar_sc-0.1.0/python/tests/test_real_atlas.py +74 -0
  52. lstar_sc-0.1.0/python/tests/test_roundtrip.py +74 -0
  53. lstar_sc-0.1.0/python/tests/test_spatial.py +62 -0
  54. lstar_sc-0.1.0/python/tests/test_stream_write.py +132 -0
  55. lstar_sc-0.1.0/python/tests/test_synth_faithful.py +112 -0
  56. lstar_sc-0.1.0/python/tests/test_tier1_promote.py +92 -0
  57. lstar_sc-0.1.0/python/tests/test_validate.py +59 -0
  58. lstar_sc-0.1.0/python/tests/test_versions.py +51 -0
  59. lstar_sc-0.1.0/setup.cfg +4 -0
  60. lstar_sc-0.1.0/setup.py +80 -0
lstar_sc-0.1.0/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 Peter Kharchenko
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,6 @@
1
+ # Carry the header-only C++ core into the sdist so wheels build from a clean unpack.
2
+ graft core/include
3
+ include python/src/lstar/_accel.cpp
4
+ include README.md LICENSE
5
+ recursive-include python/tests *.py
6
+ include python/src/lstar/py.typed
@@ -0,0 +1,260 @@
1
+ Metadata-Version: 2.4
2
+ Name: lstar-sc
3
+ Version: 0.1.0
4
+ Summary: L* model and Zarr interchange for single-cell/spatial omics, with a fast C++ core
5
+ Author-email: Peter Kharchenko <pk.restricted@gmail.com>
6
+ License: MIT
7
+ Project-URL: Homepage, https://github.com/kharchenkolab/lstar
8
+ Project-URL: Source, https://github.com/kharchenkolab/lstar
9
+ Keywords: single-cell,omics,zarr,anndata,interchange,bioinformatics
10
+ Classifier: Development Status :: 3 - Alpha
11
+ Classifier: Intended Audience :: Science/Research
12
+ Classifier: License :: OSI Approved :: MIT License
13
+ Classifier: Programming Language :: Python :: 3
14
+ Classifier: Programming Language :: C++
15
+ Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
16
+ Requires-Python: >=3.8
17
+ Description-Content-Type: text/markdown
18
+ License-File: LICENSE
19
+ Requires-Dist: numpy
20
+ Requires-Dist: scipy
21
+ Requires-Dist: zarr<3
22
+ Provides-Extra: anndata
23
+ Requires-Dist: anndata; extra == "anndata"
24
+ Provides-Extra: mudata
25
+ Requires-Dist: mudata; extra == "mudata"
26
+ Provides-Extra: direct
27
+ Requires-Dist: h5py; extra == "direct"
28
+ Provides-Extra: all
29
+ Requires-Dist: anndata; extra == "all"
30
+ Requires-Dist: mudata; extra == "all"
31
+ Requires-Dist: h5py; extra == "all"
32
+ Provides-Extra: test
33
+ Requires-Dist: anndata; extra == "test"
34
+ Requires-Dist: mudata; extra == "test"
35
+ Requires-Dist: h5py; extra == "test"
36
+ Requires-Dist: numcodecs; extra == "test"
37
+ Dynamic: license-file
38
+
39
+ # L★
40
+
41
+ **A general model for single-cell omics data — built from *axes* and *fields* — and the
42
+ lightweight glue that moves data losslessly between AnnData, Seurat, SingleCellExperiment, and
43
+ pagoda/conos, including their disk-backed forms (backed AnnData, Seurat v5/BPCells, SCE/HDF5Array) — so
44
+ even datasets too large for memory convert in bounded memory.**
45
+
46
+ L★ represents a dataset as **axes** (the entities you index by — cells, genes, samples, clusters) and
47
+ **fields** (typed data over them — counts, embeddings, graphs, labels, designs). Because everything is
48
+ just axes and fields, one small model spans the diversity of real single-cell work that a fixed
49
+ `cells × genes` container strains on — for example a multi-sample (even cross-species) integration kept
50
+ as a *collection* of heterogeneous samples rather than one concatenated matrix; a CITE-seq object with
51
+ a second, protein feature axis; or a case-control cohort carrying a statistical *design* over its
52
+ samples. The routine count-matrix-plus-a-clustering case stays just as simple, while the harder cases
53
+ use the same vocabulary instead of an opaque `uns`/`misc` blob (see [Why lstar?](#why-lstar)).
54
+
55
+ In the short term, the most immediately useful thing this buys you is **[moving data between the formats
56
+ people already use](SUPPORT.md)**. Each existing container — AnnData (Python), Seurat and SingleCellExperiment (R),
57
+ pagoda/conos — fixes a few named slots; routing a dataset through L★ converts one to another while
58
+ preserving the *meaning* of each piece and **reporting** anything a target can't hold instead of
59
+ dropping it silently.
60
+
61
+ lstar is available in **Python, R, and C++** (sharing one fast C++ core), reads and writes a portable
62
+ [Zarr](https://zarr.dev)-based format, and is built to scale. Everything heavy can be **streamed in
63
+ bounded memory** — convert a multi-gigabyte dataset, write a store, or compute per-gene statistics
64
+ without ever loading the whole matrix, so work that needs a big machine today runs on a laptop (see
65
+ [Large data: lazy reads and streaming](#large-data-lazy-reads-and-streaming)). You can also open a
66
+ million-cell dataset over the network and read just the parts you need.
67
+
68
+ > **Status:** early development, not yet released. Working today: read/write the same store from
69
+ > Python, C++, and R; profiles for AnnData, Seurat (legacy v2 → v5), SingleCellExperiment, and Conos; the
70
+ > collection model; lazy/streaming reads; a browser/WebAssembly data layer.
71
+
72
+ ---
73
+
74
+ ## Why lstar?
75
+
76
+ Three things are hard with today's fixed-schema containers, and L★ is designed around them:
77
+
78
+ 1. **Conversion is lossy and pairwise.** Every container hard-codes a few named slots; what fits the
79
+ slots converts, and the rest is lost. Routing every format through *one shared model with a shared
80
+ vocabulary* makes conversion lossless on the common core and **explicit** about the remainder.
81
+ 2. **The interesting results have no home.** A gene-regulatory network, a cell–cell communication
82
+ tensor, RNA-velocity graphs, a fitted model — none of these fit a `cells × genes` slot, so they end
83
+ up as opaque blobs in `uns`/`misc`. In L★ they are ordinary, typed, queryable *fields*.
84
+ 3. **A study is many samples, not one matrix.** Different donors, conditions, even species and gene
85
+ sets cannot be honestly concatenated into a single matrix. L★ keeps a multi-sample study as a
86
+ *collection* of heterogeneous parts joined by a graph.
87
+
88
+ If you only ever need to move data between AnnData, Seurat, and SCE, point 1 is reason enough to use
89
+ lstar. Points 2 and 3 are why the model is shaped the way it is.
90
+
91
+ ## Converting between formats (the common case)
92
+
93
+ One command — `lstar convert` detects each format from its path, routes through the L★ store (in-process
94
+ for Python formats, an `Rscript` bridge for Seurat/SCE), and reports what crossed:
95
+
96
+ ```bash
97
+ lstar convert pbmc.h5ad pbmc.rds # AnnData (Python) -> Seurat (R), bridged automatically
98
+ lstar convert atlas.h5ad atlas.lstar.zarr # -> a portable L* store (--to sce for SingleCellExperiment)
99
+ lstar convert pbmc.rds pbmc.h5ad --report # + a fidelity report (every field, and what was `dropped`)
100
+ ```
101
+
102
+ Two things make it more than a one-liner:
103
+
104
+ - a **fidelity report** (`--report` / `--report-json`) lists every axis and field with its role, state,
105
+ and `provenance`, and — crucially — **`dropped`**: what the target couldn't represent, made visible
106
+ rather than silently lost.
107
+ - a **native-acceptance check** (`--check`, on by default; `--strict` to gate the exit code) opens the
108
+ result in its *own* library and runs a canonical-ops smoke (scanpy / Seurat / scran), so you know the
109
+ native analysis tools will accept it — not just that the bytes round-tripped.
110
+ - a **package-free fallback** (`--backend auto|native|direct`): each conversion uses the format's native
111
+ package when it's installed, else lstar's own codec — so you don't *need* the domain packages for the
112
+ common cases. What works **without** the native packages:
113
+
114
+ | convert (no native package) | needs only |
115
+ |---|---|
116
+ | `.h5ad` ↔ store — read **and** write | `lstar` + `h5py` |
117
+ | Seurat `.rds` ↔ store — read **and** write | `lstar` + base R (no SeuratObject) |
118
+ | SCE `.rds` → store — **read** | `lstar` + base R (no SingleCellExperiment) |
119
+ | store → SCE `.rds` (write) · `.h5mu` ↔ store | **native-only** — needs `SingleCellExperiment` / `mudata` |
120
+
121
+ At a wall (an unknown on-disk version, a `BPCells`-backed matrix) it stops and names exactly what to
122
+ install. The heavy *analysis* packages (scanpy / full Seurat / scran) are **never** needed to convert —
123
+ only for the optional `--check`. Details: [docs/conversions.md](docs/conversions.md).
124
+
125
+ Under the hood it is just `write_Y(read_X(...))` with the on-disk L★ store as the bridge between the two
126
+ languages, which you can also drive directly:
127
+
128
+ ```bash
129
+ python3 -c 'import anndata as ad, lstar; from lstar.profiles.anndata import read_anndata
130
+ lstar.write(read_anndata(ad.read_h5ad("pbmc.h5ad")), "pbmc.lstar.zarr")' # AnnData -> L* store
131
+ Rscript -e 'library(lstar); saveRDS(write_seurat(lstar_read("pbmc.lstar.zarr")), "pbmc.rds")' # -> Seurat
132
+ ```
133
+
134
+ The shared-vocabulary core — raw counts, normalized/scaled expression, PCA (scores **and** gene
135
+ loadings), UMAP/t-SNE, clusterings, cell/gene metadata — survives. Whatever the target can't hold (e.g.
136
+ neighbor graphs through Seurat) is listed in the dataset's `dropped` manifest, so nothing vanishes
137
+ unannounced. A runnable, commented version is
138
+ [`examples/convert_h5ad_to_seurat.sh`](examples/convert_h5ad_to_seurat.sh).
139
+
140
+ See **[docs/conversions.md](docs/conversions.md)** for the full glue guide (every reader/writer, the
141
+ conversion matrix, what is preserved vs. recorded as dropped, version detection) and
142
+ **[docs/mapping.md](docs/mapping.md)** for the deterministic role→slot contract — what lands where in
143
+ each target, and the native-acceptance check that verifies the native tools won't choke.
144
+
145
+ ## Building a dataset directly
146
+
147
+ If you want to author or inspect L★ data, the model is just *axes* (the things you index by) and
148
+ *fields* (typed data over them):
149
+
150
+ ```python
151
+ import scipy.sparse as sp, lstar
152
+
153
+ ds = lstar.Dataset(kind="sample")
154
+ ds.add_axis("cells", [f"cell{i}" for i in range(100)])
155
+ ds.add_axis("genes", [f"g{i}" for i in range(50)])
156
+ # A field declares what it IS (a `measure` over cells × genes) — no fixed "X" slot.
157
+ ds.add_field("counts", sp.random(100, 50, density=0.1, format="csc"),
158
+ role="measure", span=["cells", "genes"], state="raw")
159
+
160
+ lstar.write(ds, "sample.lstar.zarr")
161
+ ds2 = lstar.read("sample.lstar.zarr") # also readable from R and C++
162
+ ```
163
+
164
+ A field's `role` (`measure`, `embedding`, `loading`, `relation`, `label`, …) says what kind of object
165
+ it is. A new kind of result is a new field with a role — never a change to the format. See
166
+ [docs/model.md](docs/model.md).
167
+
168
+ ## Two design choices worth knowing
169
+
170
+ **Collections, not one big matrix.** A multi-sample study is stored as a `samples` axis plus
171
+ *per-sample* `cells.{s}`/`genes.{s}` axes and measures (samples may differ in cells *and* genes), with a
172
+ *union* `cells` axis for the joint analysis (embedding, clusters, and the integration graph as a
173
+ `relation`). The R package ingests a **Conos** object (`write_conos`) and a split **Seurat v5** assay
174
+ this way — see [`examples/conos_collection_demo.R`](examples/conos_collection_demo.R).
175
+
176
+ **Versions are recognized, not assumed.** Formats change shape across releases, so the readers detect
177
+ the variant and adapt — even a legacy **v2** `seurat` object (the pre-`Assay` S4 class, read via its raw
178
+ slots) through v3/v4 `Assay` vs. v5 `Assay5` (with a fallback for SeuratObject < 5),
179
+ pagoda2's `getRawCounts()` accessor vs. the legacy `$counts` slot, AnnData's `.raw` slot. The detected
180
+ `<format>@<version>` is recorded, so a downstream reader knows what produced the data.
181
+
182
+ ## Large data: lazy reads and streaming
183
+
184
+ Single-cell stores get big — hundreds of thousands of cells, tens of thousands of genes. lstar is built
185
+ so you never hold a whole dataset in memory to work with it: the heavy operations **stream** the matrix
186
+ in blocks, so peak memory stays bounded and roughly *flat* as the data grows.
187
+
188
+ ![Streaming vs in-memory conversion: peak memory stays flat as the dataset grows, for a modest time cost](docs/img/streaming_scaling.png)
189
+
190
+ <sub>*`h5ad → L*` conversion of the Tabula Muris Senis droplet atlas (subsampled from 25k to 245k cells, up to 502M nonzeros): the in-memory path's peak RAM grows with the matrix (to ~4 GB) while streaming stays ~flat (~0.3 GB, ~13× less at full size), for a small, roughly constant time premium. Reproduce with [`examples/streaming_scaling.py`](examples/streaming_scaling.py).*</sub>
191
+
192
+ - **Convert and write in bounded memory.** `convert_anndata` (`h5ad → L*`) and `convert_to_h5ad`
193
+ (`L* → h5ad`) move data between formats with a backed read + block-by-block write, never materializing
194
+ the matrix; `lstar.write(..., stream=True)` does the same for any lazy/backed source. A multi-gigabyte
195
+ atlas converts in a few hundred MB.
196
+ - **Open without downloading.** `lstar.read(path, lazy=True)` reads only the small manifest; the heavy
197
+ arrays stay on disk (or on the server) until you touch them. Opening a 78-million-nonzero matrix this
198
+ way costs a few megabytes of memory instead of hundreds.
199
+ - **Compute without materializing.** A per-gene statistic (say, finding the most variable genes) is
200
+ computed by *streaming* the matrix in column blocks, so memory stays bounded and the matrix is never
201
+ expanded into a dense array.
202
+
203
+ ```python
204
+ ds = lstar.read("big.lstar.zarr", lazy=True) # opens in MBs, not GBs
205
+ # per-gene mean/variance over log-normalized counts, streamed in bounded memory:
206
+ mean, var, nnz = lstar.stream_col_stats(ds.field("counts").values,
207
+ lognorm=True, # normalize on the fly; the dense matrix is never built
208
+ n_threads=8) # use as many cores as you like
209
+ top_variable_genes = var.argsort()[::-1][:2000]
210
+ ```
211
+
212
+ When you write a store, chunking and compression make these reads cheap (a lazy read fetches only the
213
+ chunks it needs):
214
+
215
+ ```python
216
+ import numcodecs
217
+ lstar.write(ds, "big.lstar.zarr", chunk_elems=1_000_000, compressor=numcodecs.GZip(5))
218
+ ```
219
+
220
+ In practice this is fast and frugal: opening that 40,220 × 20,138 matrix lazily uses ~9 MB instead of
221
+ ~780 MB, per-gene statistics stream in bounded memory, and the heavy reductions run on a shared C++
222
+ core (used automatically when available, ~8× faster on 16 threads, identical results in Python, R, and
223
+ the browser). Measurements and the full picture are in [`misc/plan1.md`](misc/plan1.md) §12.
224
+
225
+ ## Languages and components
226
+
227
+ | | what it is |
228
+ |---|---|
229
+ | **Python** (`python/`) | the `lstar` package on zarr-python, with an optional compiled C++ accelerator |
230
+ | **R** (`R/`) | the `lstar` package; the format profiles (Seurat, SCE, Conos) live here |
231
+ | **C++** (`core/`) | `libstar`, the header-only core: the model, chunked+gzip Zarr IO, and the fast kernels |
232
+ | **Browser/Node** (`js/`) | a TypeScript reader (zarrita) + the kernels compiled to WebAssembly, for viewers |
233
+
234
+ ```
235
+ docs/ principles, the model & format specs, conversions, worked examples
236
+ core/ libstar — the C++ core
237
+ python/ R/ the Python and R packages
238
+ js/ the browser/WASM data layer
239
+ conformance/ the shared round-trip / cross-format / cross-language test suite
240
+ examples/ runnable, commented end-to-end demos
241
+ misc/ the design proposal (Lstar_proposal.md) + plans
242
+ ```
243
+
244
+ ## Documentation
245
+
246
+ - **[docs/principles.md](docs/principles.md)** — the idea and the reasoning. *Start here.*
247
+ - **[docs/conversions.md](docs/conversions.md)** — using lstar as glue between formats (incl. the `lstar convert` CLI).
248
+ - **[docs/mapping.md](docs/mapping.md)** — the deterministic role→slot conversion contract + native-acceptance.
249
+ - **[docs/model.md](docs/model.md)** — the model: axes, fields, roles, collections.
250
+ - **[docs/format.md](docs/format.md)** — the on-disk Zarr layout.
251
+ - **[docs/examples.md](docs/examples.md)** — worked, commented examples (Python, R, C++, browser).
252
+ - **[SUPPORT.md](SUPPORT.md)** — **format & language support matrix**: what converts/reads/writes today,
253
+ per format and per language, with real-vs-synthetic test coverage and the known gaps.
254
+
255
+ The full normative specification (the model, the Zarr schema, and the bidirectional profile rule
256
+ catalog for every format) is the proposal, [`misc/Lstar_proposal.md`](misc/Lstar_proposal.md).
257
+
258
+ ## License
259
+
260
+ MIT.
@@ -0,0 +1,222 @@
1
+ # L★
2
+
3
+ **A general model for single-cell omics data — built from *axes* and *fields* — and the
4
+ lightweight glue that moves data losslessly between AnnData, Seurat, SingleCellExperiment, and
5
+ pagoda/conos, including their disk-backed forms (backed AnnData, Seurat v5/BPCells, SCE/HDF5Array) — so
6
+ even datasets too large for memory convert in bounded memory.**
7
+
8
+ L★ represents a dataset as **axes** (the entities you index by — cells, genes, samples, clusters) and
9
+ **fields** (typed data over them — counts, embeddings, graphs, labels, designs). Because everything is
10
+ just axes and fields, one small model spans the diversity of real single-cell work that a fixed
11
+ `cells × genes` container strains on — for example a multi-sample (even cross-species) integration kept
12
+ as a *collection* of heterogeneous samples rather than one concatenated matrix; a CITE-seq object with
13
+ a second, protein feature axis; or a case-control cohort carrying a statistical *design* over its
14
+ samples. The routine count-matrix-plus-a-clustering case stays just as simple, while the harder cases
15
+ use the same vocabulary instead of an opaque `uns`/`misc` blob (see [Why lstar?](#why-lstar)).
16
+
17
+ In the short term, the most immediately useful thing this buys you is **[moving data between the formats
18
+ people already use](SUPPORT.md)**. Each existing container — AnnData (Python), Seurat and SingleCellExperiment (R),
19
+ pagoda/conos — fixes a few named slots; routing a dataset through L★ converts one to another while
20
+ preserving the *meaning* of each piece and **reporting** anything a target can't hold instead of
21
+ dropping it silently.
22
+
23
+ lstar is available in **Python, R, and C++** (sharing one fast C++ core), reads and writes a portable
24
+ [Zarr](https://zarr.dev)-based format, and is built to scale. Everything heavy can be **streamed in
25
+ bounded memory** — convert a multi-gigabyte dataset, write a store, or compute per-gene statistics
26
+ without ever loading the whole matrix, so work that needs a big machine today runs on a laptop (see
27
+ [Large data: lazy reads and streaming](#large-data-lazy-reads-and-streaming)). You can also open a
28
+ million-cell dataset over the network and read just the parts you need.
29
+
30
+ > **Status:** early development, not yet released. Working today: read/write the same store from
31
+ > Python, C++, and R; profiles for AnnData, Seurat (legacy v2 → v5), SingleCellExperiment, and Conos; the
32
+ > collection model; lazy/streaming reads; a browser/WebAssembly data layer.
33
+
34
+ ---
35
+
36
+ ## Why lstar?
37
+
38
+ Three things are hard with today's fixed-schema containers, and L★ is designed around them:
39
+
40
+ 1. **Conversion is lossy and pairwise.** Every container hard-codes a few named slots; what fits the
41
+ slots converts, and the rest is lost. Routing every format through *one shared model with a shared
42
+ vocabulary* makes conversion lossless on the common core and **explicit** about the remainder.
43
+ 2. **The interesting results have no home.** A gene-regulatory network, a cell–cell communication
44
+ tensor, RNA-velocity graphs, a fitted model — none of these fit a `cells × genes` slot, so they end
45
+ up as opaque blobs in `uns`/`misc`. In L★ they are ordinary, typed, queryable *fields*.
46
+ 3. **A study is many samples, not one matrix.** Different donors, conditions, even species and gene
47
+ sets cannot be honestly concatenated into a single matrix. L★ keeps a multi-sample study as a
48
+ *collection* of heterogeneous parts joined by a graph.
49
+
50
+ If you only ever need to move data between AnnData, Seurat, and SCE, point 1 is reason enough to use
51
+ lstar. Points 2 and 3 are why the model is shaped the way it is.
52
+
53
+ ## Converting between formats (the common case)
54
+
55
+ One command — `lstar convert` detects each format from its path, routes through the L★ store (in-process
56
+ for Python formats, an `Rscript` bridge for Seurat/SCE), and reports what crossed:
57
+
58
+ ```bash
59
+ lstar convert pbmc.h5ad pbmc.rds # AnnData (Python) -> Seurat (R), bridged automatically
60
+ lstar convert atlas.h5ad atlas.lstar.zarr # -> a portable L* store (--to sce for SingleCellExperiment)
61
+ lstar convert pbmc.rds pbmc.h5ad --report # + a fidelity report (every field, and what was `dropped`)
62
+ ```
63
+
64
+ Two things make it more than a one-liner:
65
+
66
+ - a **fidelity report** (`--report` / `--report-json`) lists every axis and field with its role, state,
67
+ and `provenance`, and — crucially — **`dropped`**: what the target couldn't represent, made visible
68
+ rather than silently lost.
69
+ - a **native-acceptance check** (`--check`, on by default; `--strict` to gate the exit code) opens the
70
+ result in its *own* library and runs a canonical-ops smoke (scanpy / Seurat / scran), so you know the
71
+ native analysis tools will accept it — not just that the bytes round-tripped.
72
+ - a **package-free fallback** (`--backend auto|native|direct`): each conversion uses the format's native
73
+ package when it's installed, else lstar's own codec — so you don't *need* the domain packages for the
74
+ common cases. What works **without** the native packages:
75
+
76
+ | convert (no native package) | needs only |
77
+ |---|---|
78
+ | `.h5ad` ↔ store — read **and** write | `lstar` + `h5py` |
79
+ | Seurat `.rds` ↔ store — read **and** write | `lstar` + base R (no SeuratObject) |
80
+ | SCE `.rds` → store — **read** | `lstar` + base R (no SingleCellExperiment) |
81
+ | store → SCE `.rds` (write) · `.h5mu` ↔ store | **native-only** — needs `SingleCellExperiment` / `mudata` |
82
+
83
+ At a wall (an unknown on-disk version, a `BPCells`-backed matrix) it stops and names exactly what to
84
+ install. The heavy *analysis* packages (scanpy / full Seurat / scran) are **never** needed to convert —
85
+ only for the optional `--check`. Details: [docs/conversions.md](docs/conversions.md).
86
+
87
+ Under the hood it is just `write_Y(read_X(...))` with the on-disk L★ store as the bridge between the two
88
+ languages, which you can also drive directly:
89
+
90
+ ```bash
91
+ python3 -c 'import anndata as ad, lstar; from lstar.profiles.anndata import read_anndata
92
+ lstar.write(read_anndata(ad.read_h5ad("pbmc.h5ad")), "pbmc.lstar.zarr")' # AnnData -> L* store
93
+ Rscript -e 'library(lstar); saveRDS(write_seurat(lstar_read("pbmc.lstar.zarr")), "pbmc.rds")' # -> Seurat
94
+ ```
95
+
96
+ The shared-vocabulary core — raw counts, normalized/scaled expression, PCA (scores **and** gene
97
+ loadings), UMAP/t-SNE, clusterings, cell/gene metadata — survives. Whatever the target can't hold (e.g.
98
+ neighbor graphs through Seurat) is listed in the dataset's `dropped` manifest, so nothing vanishes
99
+ unannounced. A runnable, commented version is
100
+ [`examples/convert_h5ad_to_seurat.sh`](examples/convert_h5ad_to_seurat.sh).
101
+
102
+ See **[docs/conversions.md](docs/conversions.md)** for the full glue guide (every reader/writer, the
103
+ conversion matrix, what is preserved vs. recorded as dropped, version detection) and
104
+ **[docs/mapping.md](docs/mapping.md)** for the deterministic role→slot contract — what lands where in
105
+ each target, and the native-acceptance check that verifies the native tools won't choke.
106
+
107
+ ## Building a dataset directly
108
+
109
+ If you want to author or inspect L★ data, the model is just *axes* (the things you index by) and
110
+ *fields* (typed data over them):
111
+
112
+ ```python
113
+ import scipy.sparse as sp, lstar
114
+
115
+ ds = lstar.Dataset(kind="sample")
116
+ ds.add_axis("cells", [f"cell{i}" for i in range(100)])
117
+ ds.add_axis("genes", [f"g{i}" for i in range(50)])
118
+ # A field declares what it IS (a `measure` over cells × genes) — no fixed "X" slot.
119
+ ds.add_field("counts", sp.random(100, 50, density=0.1, format="csc"),
120
+ role="measure", span=["cells", "genes"], state="raw")
121
+
122
+ lstar.write(ds, "sample.lstar.zarr")
123
+ ds2 = lstar.read("sample.lstar.zarr") # also readable from R and C++
124
+ ```
125
+
126
+ A field's `role` (`measure`, `embedding`, `loading`, `relation`, `label`, …) says what kind of object
127
+ it is. A new kind of result is a new field with a role — never a change to the format. See
128
+ [docs/model.md](docs/model.md).
129
+
130
+ ## Two design choices worth knowing
131
+
132
+ **Collections, not one big matrix.** A multi-sample study is stored as a `samples` axis plus
133
+ *per-sample* `cells.{s}`/`genes.{s}` axes and measures (samples may differ in cells *and* genes), with a
134
+ *union* `cells` axis for the joint analysis (embedding, clusters, and the integration graph as a
135
+ `relation`). The R package ingests a **Conos** object (`write_conos`) and a split **Seurat v5** assay
136
+ this way — see [`examples/conos_collection_demo.R`](examples/conos_collection_demo.R).
137
+
138
+ **Versions are recognized, not assumed.** Formats change shape across releases, so the readers detect
139
+ the variant and adapt — even a legacy **v2** `seurat` object (the pre-`Assay` S4 class, read via its raw
140
+ slots) through v3/v4 `Assay` vs. v5 `Assay5` (with a fallback for SeuratObject < 5),
141
+ pagoda2's `getRawCounts()` accessor vs. the legacy `$counts` slot, AnnData's `.raw` slot. The detected
142
+ `<format>@<version>` is recorded, so a downstream reader knows what produced the data.
143
+
144
+ ## Large data: lazy reads and streaming
145
+
146
+ Single-cell stores get big — hundreds of thousands of cells, tens of thousands of genes. lstar is built
147
+ so you never hold a whole dataset in memory to work with it: the heavy operations **stream** the matrix
148
+ in blocks, so peak memory stays bounded and roughly *flat* as the data grows.
149
+
150
+ ![Streaming vs in-memory conversion: peak memory stays flat as the dataset grows, for a modest time cost](docs/img/streaming_scaling.png)
151
+
152
+ <sub>*`h5ad → L*` conversion of the Tabula Muris Senis droplet atlas (subsampled from 25k to 245k cells, up to 502M nonzeros): the in-memory path's peak RAM grows with the matrix (to ~4 GB) while streaming stays ~flat (~0.3 GB, ~13× less at full size), for a small, roughly constant time premium. Reproduce with [`examples/streaming_scaling.py`](examples/streaming_scaling.py).*</sub>
153
+
154
+ - **Convert and write in bounded memory.** `convert_anndata` (`h5ad → L*`) and `convert_to_h5ad`
155
+ (`L* → h5ad`) move data between formats with a backed read + block-by-block write, never materializing
156
+ the matrix; `lstar.write(..., stream=True)` does the same for any lazy/backed source. A multi-gigabyte
157
+ atlas converts in a few hundred MB.
158
+ - **Open without downloading.** `lstar.read(path, lazy=True)` reads only the small manifest; the heavy
159
+ arrays stay on disk (or on the server) until you touch them. Opening a 78-million-nonzero matrix this
160
+ way costs a few megabytes of memory instead of hundreds.
161
+ - **Compute without materializing.** A per-gene statistic (say, finding the most variable genes) is
162
+ computed by *streaming* the matrix in column blocks, so memory stays bounded and the matrix is never
163
+ expanded into a dense array.
164
+
165
+ ```python
166
+ ds = lstar.read("big.lstar.zarr", lazy=True) # opens in MBs, not GBs
167
+ # per-gene mean/variance over log-normalized counts, streamed in bounded memory:
168
+ mean, var, nnz = lstar.stream_col_stats(ds.field("counts").values,
169
+ lognorm=True, # normalize on the fly; the dense matrix is never built
170
+ n_threads=8) # use as many cores as you like
171
+ top_variable_genes = var.argsort()[::-1][:2000]
172
+ ```
173
+
174
+ When you write a store, chunking and compression make these reads cheap (a lazy read fetches only the
175
+ chunks it needs):
176
+
177
+ ```python
178
+ import numcodecs
179
+ lstar.write(ds, "big.lstar.zarr", chunk_elems=1_000_000, compressor=numcodecs.GZip(5))
180
+ ```
181
+
182
+ In practice this is fast and frugal: opening that 40,220 × 20,138 matrix lazily uses ~9 MB instead of
183
+ ~780 MB, per-gene statistics stream in bounded memory, and the heavy reductions run on a shared C++
184
+ core (used automatically when available, ~8× faster on 16 threads, identical results in Python, R, and
185
+ the browser). Measurements and the full picture are in [`misc/plan1.md`](misc/plan1.md) §12.
186
+
187
+ ## Languages and components
188
+
189
+ | | what it is |
190
+ |---|---|
191
+ | **Python** (`python/`) | the `lstar` package on zarr-python, with an optional compiled C++ accelerator |
192
+ | **R** (`R/`) | the `lstar` package; the format profiles (Seurat, SCE, Conos) live here |
193
+ | **C++** (`core/`) | `libstar`, the header-only core: the model, chunked+gzip Zarr IO, and the fast kernels |
194
+ | **Browser/Node** (`js/`) | a TypeScript reader (zarrita) + the kernels compiled to WebAssembly, for viewers |
195
+
196
+ ```
197
+ docs/ principles, the model & format specs, conversions, worked examples
198
+ core/ libstar — the C++ core
199
+ python/ R/ the Python and R packages
200
+ js/ the browser/WASM data layer
201
+ conformance/ the shared round-trip / cross-format / cross-language test suite
202
+ examples/ runnable, commented end-to-end demos
203
+ misc/ the design proposal (Lstar_proposal.md) + plans
204
+ ```
205
+
206
+ ## Documentation
207
+
208
+ - **[docs/principles.md](docs/principles.md)** — the idea and the reasoning. *Start here.*
209
+ - **[docs/conversions.md](docs/conversions.md)** — using lstar as glue between formats (incl. the `lstar convert` CLI).
210
+ - **[docs/mapping.md](docs/mapping.md)** — the deterministic role→slot conversion contract + native-acceptance.
211
+ - **[docs/model.md](docs/model.md)** — the model: axes, fields, roles, collections.
212
+ - **[docs/format.md](docs/format.md)** — the on-disk Zarr layout.
213
+ - **[docs/examples.md](docs/examples.md)** — worked, commented examples (Python, R, C++, browser).
214
+ - **[SUPPORT.md](SUPPORT.md)** — **format & language support matrix**: what converts/reads/writes today,
215
+ per format and per language, with real-vs-synthetic test coverage and the known gaps.
216
+
217
+ The full normative specification (the model, the Zarr schema, and the bidirectional profile rule
218
+ catalog for every format) is the proposal, [`misc/Lstar_proposal.md`](misc/Lstar_proposal.md).
219
+
220
+ ## License
221
+
222
+ MIT.