lstar-sc 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- lstar_sc-0.1.0/LICENSE +21 -0
- lstar_sc-0.1.0/MANIFEST.in +6 -0
- lstar_sc-0.1.0/PKG-INFO +260 -0
- lstar_sc-0.1.0/README.md +222 -0
- lstar_sc-0.1.0/core/include/lstar/lstar.hpp +1121 -0
- lstar_sc-0.1.0/core/include/nlohmann/json.hpp +24765 -0
- lstar_sc-0.1.0/pyproject.toml +69 -0
- lstar_sc-0.1.0/python/src/lstar/__init__.py +25 -0
- lstar_sc-0.1.0/python/src/lstar/__main__.py +7 -0
- lstar_sc-0.1.0/python/src/lstar/_accel.cpp +162 -0
- lstar_sc-0.1.0/python/src/lstar/_engine.py +58 -0
- lstar_sc-0.1.0/python/src/lstar/_native_check.py +161 -0
- lstar_sc-0.1.0/python/src/lstar/cli.py +548 -0
- lstar_sc-0.1.0/python/src/lstar/collection.py +110 -0
- lstar_sc-0.1.0/python/src/lstar/de.py +174 -0
- lstar_sc-0.1.0/python/src/lstar/kernels.py +34 -0
- lstar_sc-0.1.0/python/src/lstar/lazy.py +271 -0
- lstar_sc-0.1.0/python/src/lstar/model.py +259 -0
- lstar_sc-0.1.0/python/src/lstar/passthrough.py +98 -0
- lstar_sc-0.1.0/python/src/lstar/profiles/__init__.py +1 -0
- lstar_sc-0.1.0/python/src/lstar/profiles/anndata.py +861 -0
- lstar_sc-0.1.0/python/src/lstar/profiles/anndata_direct.py +360 -0
- lstar_sc-0.1.0/python/src/lstar/profiles/mudata.py +224 -0
- lstar_sc-0.1.0/python/src/lstar/py.typed +0 -0
- lstar_sc-0.1.0/python/src/lstar/validate.py +125 -0
- lstar_sc-0.1.0/python/src/lstar/zarr_io.py +293 -0
- lstar_sc-0.1.0/python/src/lstar_sc.egg-info/PKG-INFO +260 -0
- lstar_sc-0.1.0/python/src/lstar_sc.egg-info/SOURCES.txt +58 -0
- lstar_sc-0.1.0/python/src/lstar_sc.egg-info/dependency_links.txt +1 -0
- lstar_sc-0.1.0/python/src/lstar_sc.egg-info/entry_points.txt +2 -0
- lstar_sc-0.1.0/python/src/lstar_sc.egg-info/requires.txt +23 -0
- lstar_sc-0.1.0/python/src/lstar_sc.egg-info/top_level.txt +1 -0
- lstar_sc-0.1.0/python/tests/corpus.py +212 -0
- lstar_sc-0.1.0/python/tests/synth.py +258 -0
- lstar_sc-0.1.0/python/tests/test_accel.py +52 -0
- lstar_sc-0.1.0/python/tests/test_anndata_profile.py +89 -0
- lstar_sc-0.1.0/python/tests/test_arity3.py +69 -0
- lstar_sc-0.1.0/python/tests/test_aux.py +111 -0
- lstar_sc-0.1.0/python/tests/test_categorical.py +70 -0
- lstar_sc-0.1.0/python/tests/test_collection_reduce.py +97 -0
- lstar_sc-0.1.0/python/tests/test_crossimpl.py +57 -0
- lstar_sc-0.1.0/python/tests/test_de.py +148 -0
- lstar_sc-0.1.0/python/tests/test_determinism.py +48 -0
- lstar_sc-0.1.0/python/tests/test_fuzz.py +79 -0
- lstar_sc-0.1.0/python/tests/test_induce.py +156 -0
- lstar_sc-0.1.0/python/tests/test_lazy.py +97 -0
- lstar_sc-0.1.0/python/tests/test_legacy_format.py +56 -0
- lstar_sc-0.1.0/python/tests/test_mudata.py +243 -0
- lstar_sc-0.1.0/python/tests/test_nullable.py +94 -0
- lstar_sc-0.1.0/python/tests/test_partial.py +73 -0
- lstar_sc-0.1.0/python/tests/test_real_atlas.py +74 -0
- lstar_sc-0.1.0/python/tests/test_roundtrip.py +74 -0
- lstar_sc-0.1.0/python/tests/test_spatial.py +62 -0
- lstar_sc-0.1.0/python/tests/test_stream_write.py +132 -0
- lstar_sc-0.1.0/python/tests/test_synth_faithful.py +112 -0
- lstar_sc-0.1.0/python/tests/test_tier1_promote.py +92 -0
- lstar_sc-0.1.0/python/tests/test_validate.py +59 -0
- lstar_sc-0.1.0/python/tests/test_versions.py +51 -0
- lstar_sc-0.1.0/setup.cfg +4 -0
- lstar_sc-0.1.0/setup.py +80 -0
lstar_sc-0.1.0/LICENSE
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 Peter Kharchenko
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
lstar_sc-0.1.0/PKG-INFO
ADDED
|
@@ -0,0 +1,260 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: lstar-sc
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: L* model and Zarr interchange for single-cell/spatial omics, with a fast C++ core
|
|
5
|
+
Author-email: Peter Kharchenko <pk.restricted@gmail.com>
|
|
6
|
+
License: MIT
|
|
7
|
+
Project-URL: Homepage, https://github.com/kharchenkolab/lstar
|
|
8
|
+
Project-URL: Source, https://github.com/kharchenkolab/lstar
|
|
9
|
+
Keywords: single-cell,omics,zarr,anndata,interchange,bioinformatics
|
|
10
|
+
Classifier: Development Status :: 3 - Alpha
|
|
11
|
+
Classifier: Intended Audience :: Science/Research
|
|
12
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
13
|
+
Classifier: Programming Language :: Python :: 3
|
|
14
|
+
Classifier: Programming Language :: C++
|
|
15
|
+
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
|
|
16
|
+
Requires-Python: >=3.8
|
|
17
|
+
Description-Content-Type: text/markdown
|
|
18
|
+
License-File: LICENSE
|
|
19
|
+
Requires-Dist: numpy
|
|
20
|
+
Requires-Dist: scipy
|
|
21
|
+
Requires-Dist: zarr<3
|
|
22
|
+
Provides-Extra: anndata
|
|
23
|
+
Requires-Dist: anndata; extra == "anndata"
|
|
24
|
+
Provides-Extra: mudata
|
|
25
|
+
Requires-Dist: mudata; extra == "mudata"
|
|
26
|
+
Provides-Extra: direct
|
|
27
|
+
Requires-Dist: h5py; extra == "direct"
|
|
28
|
+
Provides-Extra: all
|
|
29
|
+
Requires-Dist: anndata; extra == "all"
|
|
30
|
+
Requires-Dist: mudata; extra == "all"
|
|
31
|
+
Requires-Dist: h5py; extra == "all"
|
|
32
|
+
Provides-Extra: test
|
|
33
|
+
Requires-Dist: anndata; extra == "test"
|
|
34
|
+
Requires-Dist: mudata; extra == "test"
|
|
35
|
+
Requires-Dist: h5py; extra == "test"
|
|
36
|
+
Requires-Dist: numcodecs; extra == "test"
|
|
37
|
+
Dynamic: license-file
|
|
38
|
+
|
|
39
|
+
# L★
|
|
40
|
+
|
|
41
|
+
**A general model for single-cell omics data — built from *axes* and *fields* — and the
|
|
42
|
+
lightweight glue that moves data losslessly between AnnData, Seurat, SingleCellExperiment, and
|
|
43
|
+
pagoda/conos, including their disk-backed forms (backed AnnData, Seurat v5/BPCells, SCE/HDF5Array) — so
|
|
44
|
+
even datasets too large for memory convert in bounded memory.**
|
|
45
|
+
|
|
46
|
+
L★ represents a dataset as **axes** (the entities you index by — cells, genes, samples, clusters) and
|
|
47
|
+
**fields** (typed data over them — counts, embeddings, graphs, labels, designs). Because everything is
|
|
48
|
+
just axes and fields, one small model spans the diversity of real single-cell work that a fixed
|
|
49
|
+
`cells × genes` container strains on — for example a multi-sample (even cross-species) integration kept
|
|
50
|
+
as a *collection* of heterogeneous samples rather than one concatenated matrix; a CITE-seq object with
|
|
51
|
+
a second, protein feature axis; or a case-control cohort carrying a statistical *design* over its
|
|
52
|
+
samples. The routine count-matrix-plus-a-clustering case stays just as simple, while the harder cases
|
|
53
|
+
use the same vocabulary instead of an opaque `uns`/`misc` blob (see [Why lstar?](#why-lstar)).
|
|
54
|
+
|
|
55
|
+
In the short term, the most immediately useful thing this buys you is **[moving data between the formats
|
|
56
|
+
people already use](SUPPORT.md)**. Each existing container — AnnData (Python), Seurat and SingleCellExperiment (R),
|
|
57
|
+
pagoda/conos — fixes a few named slots; routing a dataset through L★ converts one to another while
|
|
58
|
+
preserving the *meaning* of each piece and **reporting** anything a target can't hold instead of
|
|
59
|
+
dropping it silently.
|
|
60
|
+
|
|
61
|
+
lstar is available in **Python, R, and C++** (sharing one fast C++ core), reads and writes a portable
|
|
62
|
+
[Zarr](https://zarr.dev)-based format, and is built to scale. Everything heavy can be **streamed in
|
|
63
|
+
bounded memory** — convert a multi-gigabyte dataset, write a store, or compute per-gene statistics
|
|
64
|
+
without ever loading the whole matrix, so work that needs a big machine today runs on a laptop (see
|
|
65
|
+
[Large data: lazy reads and streaming](#large-data-lazy-reads-and-streaming)). You can also open a
|
|
66
|
+
million-cell dataset over the network and read just the parts you need.
|
|
67
|
+
|
|
68
|
+
> **Status:** early development, not yet released. Working today: read/write the same store from
|
|
69
|
+
> Python, C++, and R; profiles for AnnData, Seurat (legacy v2 → v5), SingleCellExperiment, and Conos; the
|
|
70
|
+
> collection model; lazy/streaming reads; a browser/WebAssembly data layer.
|
|
71
|
+
|
|
72
|
+
---
|
|
73
|
+
|
|
74
|
+
## Why lstar?
|
|
75
|
+
|
|
76
|
+
Three things are hard with today's fixed-schema containers, and L★ is designed around them:
|
|
77
|
+
|
|
78
|
+
1. **Conversion is lossy and pairwise.** Every container hard-codes a few named slots; what fits the
|
|
79
|
+
slots converts, and the rest is lost. Routing every format through *one shared model with a shared
|
|
80
|
+
vocabulary* makes conversion lossless on the common core and **explicit** about the remainder.
|
|
81
|
+
2. **The interesting results have no home.** A gene-regulatory network, a cell–cell communication
|
|
82
|
+
tensor, RNA-velocity graphs, a fitted model — none of these fit a `cells × genes` slot, so they end
|
|
83
|
+
up as opaque blobs in `uns`/`misc`. In L★ they are ordinary, typed, queryable *fields*.
|
|
84
|
+
3. **A study is many samples, not one matrix.** Different donors, conditions, even species and gene
|
|
85
|
+
sets cannot be honestly concatenated into a single matrix. L★ keeps a multi-sample study as a
|
|
86
|
+
*collection* of heterogeneous parts joined by a graph.
|
|
87
|
+
|
|
88
|
+
If you only ever need to move data between AnnData, Seurat, and SCE, point 1 is reason enough to use
|
|
89
|
+
lstar. Points 2 and 3 are why the model is shaped the way it is.
|
|
90
|
+
|
|
91
|
+
## Converting between formats (the common case)
|
|
92
|
+
|
|
93
|
+
One command — `lstar convert` detects each format from its path, routes through the L★ store (in-process
|
|
94
|
+
for Python formats, an `Rscript` bridge for Seurat/SCE), and reports what crossed:
|
|
95
|
+
|
|
96
|
+
```bash
|
|
97
|
+
lstar convert pbmc.h5ad pbmc.rds # AnnData (Python) -> Seurat (R), bridged automatically
|
|
98
|
+
lstar convert atlas.h5ad atlas.lstar.zarr # -> a portable L* store (--to sce for SingleCellExperiment)
|
|
99
|
+
lstar convert pbmc.rds pbmc.h5ad --report # + a fidelity report (every field, and what was `dropped`)
|
|
100
|
+
```
|
|
101
|
+
|
|
102
|
+
Two things make it more than a one-liner:
|
|
103
|
+
|
|
104
|
+
- a **fidelity report** (`--report` / `--report-json`) lists every axis and field with its role, state,
|
|
105
|
+
and `provenance`, and — crucially — **`dropped`**: what the target couldn't represent, made visible
|
|
106
|
+
rather than silently lost.
|
|
107
|
+
- a **native-acceptance check** (`--check`, on by default; `--strict` to gate the exit code) opens the
|
|
108
|
+
result in its *own* library and runs a canonical-ops smoke (scanpy / Seurat / scran), so you know the
|
|
109
|
+
native analysis tools will accept it — not just that the bytes round-tripped.
|
|
110
|
+
- a **package-free fallback** (`--backend auto|native|direct`): each conversion uses the format's native
|
|
111
|
+
package when it's installed, else lstar's own codec — so you don't *need* the domain packages for the
|
|
112
|
+
common cases. What works **without** the native packages:
|
|
113
|
+
|
|
114
|
+
| convert (no native package) | needs only |
|
|
115
|
+
|---|---|
|
|
116
|
+
| `.h5ad` ↔ store — read **and** write | `lstar` + `h5py` |
|
|
117
|
+
| Seurat `.rds` ↔ store — read **and** write | `lstar` + base R (no SeuratObject) |
|
|
118
|
+
| SCE `.rds` → store — **read** | `lstar` + base R (no SingleCellExperiment) |
|
|
119
|
+
| store → SCE `.rds` (write) · `.h5mu` ↔ store | **native-only** — needs `SingleCellExperiment` / `mudata` |
|
|
120
|
+
|
|
121
|
+
At a wall (an unknown on-disk version, a `BPCells`-backed matrix) it stops and names exactly what to
|
|
122
|
+
install. The heavy *analysis* packages (scanpy / full Seurat / scran) are **never** needed to convert —
|
|
123
|
+
only for the optional `--check`. Details: [docs/conversions.md](docs/conversions.md).
|
|
124
|
+
|
|
125
|
+
Under the hood it is just `write_Y(read_X(...))` with the on-disk L★ store as the bridge between the two
|
|
126
|
+
languages, which you can also drive directly:
|
|
127
|
+
|
|
128
|
+
```bash
|
|
129
|
+
python3 -c 'import anndata as ad, lstar; from lstar.profiles.anndata import read_anndata
|
|
130
|
+
lstar.write(read_anndata(ad.read_h5ad("pbmc.h5ad")), "pbmc.lstar.zarr")' # AnnData -> L* store
|
|
131
|
+
Rscript -e 'library(lstar); saveRDS(write_seurat(lstar_read("pbmc.lstar.zarr")), "pbmc.rds")' # -> Seurat
|
|
132
|
+
```
|
|
133
|
+
|
|
134
|
+
The shared-vocabulary core — raw counts, normalized/scaled expression, PCA (scores **and** gene
|
|
135
|
+
loadings), UMAP/t-SNE, clusterings, cell/gene metadata — survives. Whatever the target can't hold (e.g.
|
|
136
|
+
neighbor graphs through Seurat) is listed in the dataset's `dropped` manifest, so nothing vanishes
|
|
137
|
+
unannounced. A runnable, commented version is
|
|
138
|
+
[`examples/convert_h5ad_to_seurat.sh`](examples/convert_h5ad_to_seurat.sh).
|
|
139
|
+
|
|
140
|
+
See **[docs/conversions.md](docs/conversions.md)** for the full glue guide (every reader/writer, the
|
|
141
|
+
conversion matrix, what is preserved vs. recorded as dropped, version detection) and
|
|
142
|
+
**[docs/mapping.md](docs/mapping.md)** for the deterministic role→slot contract — what lands where in
|
|
143
|
+
each target, and the native-acceptance check that verifies the native tools won't choke.
|
|
144
|
+
|
|
145
|
+
## Building a dataset directly
|
|
146
|
+
|
|
147
|
+
If you want to author or inspect L★ data, the model is just *axes* (the things you index by) and
|
|
148
|
+
*fields* (typed data over them):
|
|
149
|
+
|
|
150
|
+
```python
|
|
151
|
+
import scipy.sparse as sp, lstar
|
|
152
|
+
|
|
153
|
+
ds = lstar.Dataset(kind="sample")
|
|
154
|
+
ds.add_axis("cells", [f"cell{i}" for i in range(100)])
|
|
155
|
+
ds.add_axis("genes", [f"g{i}" for i in range(50)])
|
|
156
|
+
# A field declares what it IS (a `measure` over cells × genes) — no fixed "X" slot.
|
|
157
|
+
ds.add_field("counts", sp.random(100, 50, density=0.1, format="csc"),
|
|
158
|
+
role="measure", span=["cells", "genes"], state="raw")
|
|
159
|
+
|
|
160
|
+
lstar.write(ds, "sample.lstar.zarr")
|
|
161
|
+
ds2 = lstar.read("sample.lstar.zarr") # also readable from R and C++
|
|
162
|
+
```
|
|
163
|
+
|
|
164
|
+
A field's `role` (`measure`, `embedding`, `loading`, `relation`, `label`, …) says what kind of object
|
|
165
|
+
it is. A new kind of result is a new field with a role — never a change to the format. See
|
|
166
|
+
[docs/model.md](docs/model.md).
|
|
167
|
+
|
|
168
|
+
## Two design choices worth knowing
|
|
169
|
+
|
|
170
|
+
**Collections, not one big matrix.** A multi-sample study is stored as a `samples` axis plus
|
|
171
|
+
*per-sample* `cells.{s}`/`genes.{s}` axes and measures (samples may differ in cells *and* genes), with a
|
|
172
|
+
*union* `cells` axis for the joint analysis (embedding, clusters, and the integration graph as a
|
|
173
|
+
`relation`). The R package ingests a **Conos** object (`write_conos`) and a split **Seurat v5** assay
|
|
174
|
+
this way — see [`examples/conos_collection_demo.R`](examples/conos_collection_demo.R).
|
|
175
|
+
|
|
176
|
+
**Versions are recognized, not assumed.** Formats change shape across releases, so the readers detect
|
|
177
|
+
the variant and adapt — even a legacy **v2** `seurat` object (the pre-`Assay` S4 class, read via its raw
|
|
178
|
+
slots) through v3/v4 `Assay` vs. v5 `Assay5` (with a fallback for SeuratObject < 5),
|
|
179
|
+
pagoda2's `getRawCounts()` accessor vs. the legacy `$counts` slot, AnnData's `.raw` slot. The detected
|
|
180
|
+
`<format>@<version>` is recorded, so a downstream reader knows what produced the data.
|
|
181
|
+
|
|
182
|
+
## Large data: lazy reads and streaming
|
|
183
|
+
|
|
184
|
+
Single-cell stores get big — hundreds of thousands of cells, tens of thousands of genes. lstar is built
|
|
185
|
+
so you never hold a whole dataset in memory to work with it: the heavy operations **stream** the matrix
|
|
186
|
+
in blocks, so peak memory stays bounded and roughly *flat* as the data grows.
|
|
187
|
+
|
|
188
|
+

|
|
189
|
+
|
|
190
|
+
<sub>*`h5ad → L*` conversion of the Tabula Muris Senis droplet atlas (subsampled from 25k to 245k cells, up to 502M nonzeros): the in-memory path's peak RAM grows with the matrix (to ~4 GB) while streaming stays ~flat (~0.3 GB, ~13× less at full size), for a small, roughly constant time premium. Reproduce with [`examples/streaming_scaling.py`](examples/streaming_scaling.py).*</sub>
|
|
191
|
+
|
|
192
|
+
- **Convert and write in bounded memory.** `convert_anndata` (`h5ad → L*`) and `convert_to_h5ad`
|
|
193
|
+
(`L* → h5ad`) move data between formats with a backed read + block-by-block write, never materializing
|
|
194
|
+
the matrix; `lstar.write(..., stream=True)` does the same for any lazy/backed source. A multi-gigabyte
|
|
195
|
+
atlas converts in a few hundred MB.
|
|
196
|
+
- **Open without downloading.** `lstar.read(path, lazy=True)` reads only the small manifest; the heavy
|
|
197
|
+
arrays stay on disk (or on the server) until you touch them. Opening a 78-million-nonzero matrix this
|
|
198
|
+
way costs a few megabytes of memory instead of hundreds.
|
|
199
|
+
- **Compute without materializing.** A per-gene statistic (say, finding the most variable genes) is
|
|
200
|
+
computed by *streaming* the matrix in column blocks, so memory stays bounded and the matrix is never
|
|
201
|
+
expanded into a dense array.
|
|
202
|
+
|
|
203
|
+
```python
|
|
204
|
+
ds = lstar.read("big.lstar.zarr", lazy=True) # opens in MBs, not GBs
|
|
205
|
+
# per-gene mean/variance over log-normalized counts, streamed in bounded memory:
|
|
206
|
+
mean, var, nnz = lstar.stream_col_stats(ds.field("counts").values,
|
|
207
|
+
lognorm=True, # normalize on the fly; the dense matrix is never built
|
|
208
|
+
n_threads=8) # use as many cores as you like
|
|
209
|
+
top_variable_genes = var.argsort()[::-1][:2000]
|
|
210
|
+
```
|
|
211
|
+
|
|
212
|
+
When you write a store, chunking and compression make these reads cheap (a lazy read fetches only the
|
|
213
|
+
chunks it needs):
|
|
214
|
+
|
|
215
|
+
```python
|
|
216
|
+
import numcodecs
|
|
217
|
+
lstar.write(ds, "big.lstar.zarr", chunk_elems=1_000_000, compressor=numcodecs.GZip(5))
|
|
218
|
+
```
|
|
219
|
+
|
|
220
|
+
In practice this is fast and frugal: opening that 40,220 × 20,138 matrix lazily uses ~9 MB instead of
|
|
221
|
+
~780 MB, per-gene statistics stream in bounded memory, and the heavy reductions run on a shared C++
|
|
222
|
+
core (used automatically when available, ~8× faster on 16 threads, identical results in Python, R, and
|
|
223
|
+
the browser). Measurements and the full picture are in [`misc/plan1.md`](misc/plan1.md) §12.
|
|
224
|
+
|
|
225
|
+
## Languages and components
|
|
226
|
+
|
|
227
|
+
| | what it is |
|
|
228
|
+
|---|---|
|
|
229
|
+
| **Python** (`python/`) | the `lstar` package on zarr-python, with an optional compiled C++ accelerator |
|
|
230
|
+
| **R** (`R/`) | the `lstar` package; the format profiles (Seurat, SCE, Conos) live here |
|
|
231
|
+
| **C++** (`core/`) | `libstar`, the header-only core: the model, chunked+gzip Zarr IO, and the fast kernels |
|
|
232
|
+
| **Browser/Node** (`js/`) | a TypeScript reader (zarrita) + the kernels compiled to WebAssembly, for viewers |
|
|
233
|
+
|
|
234
|
+
```
|
|
235
|
+
docs/ principles, the model & format specs, conversions, worked examples
|
|
236
|
+
core/ libstar — the C++ core
|
|
237
|
+
python/ R/ the Python and R packages
|
|
238
|
+
js/ the browser/WASM data layer
|
|
239
|
+
conformance/ the shared round-trip / cross-format / cross-language test suite
|
|
240
|
+
examples/ runnable, commented end-to-end demos
|
|
241
|
+
misc/ the design proposal (Lstar_proposal.md) + plans
|
|
242
|
+
```
|
|
243
|
+
|
|
244
|
+
## Documentation
|
|
245
|
+
|
|
246
|
+
- **[docs/principles.md](docs/principles.md)** — the idea and the reasoning. *Start here.*
|
|
247
|
+
- **[docs/conversions.md](docs/conversions.md)** — using lstar as glue between formats (incl. the `lstar convert` CLI).
|
|
248
|
+
- **[docs/mapping.md](docs/mapping.md)** — the deterministic role→slot conversion contract + native-acceptance.
|
|
249
|
+
- **[docs/model.md](docs/model.md)** — the model: axes, fields, roles, collections.
|
|
250
|
+
- **[docs/format.md](docs/format.md)** — the on-disk Zarr layout.
|
|
251
|
+
- **[docs/examples.md](docs/examples.md)** — worked, commented examples (Python, R, C++, browser).
|
|
252
|
+
- **[SUPPORT.md](SUPPORT.md)** — **format & language support matrix**: what converts/reads/writes today,
|
|
253
|
+
per format and per language, with real-vs-synthetic test coverage and the known gaps.
|
|
254
|
+
|
|
255
|
+
The full normative specification (the model, the Zarr schema, and the bidirectional profile rule
|
|
256
|
+
catalog for every format) is the proposal, [`misc/Lstar_proposal.md`](misc/Lstar_proposal.md).
|
|
257
|
+
|
|
258
|
+
## License
|
|
259
|
+
|
|
260
|
+
MIT.
|
lstar_sc-0.1.0/README.md
ADDED
|
@@ -0,0 +1,222 @@
|
|
|
1
|
+
# L★
|
|
2
|
+
|
|
3
|
+
**A general model for single-cell omics data — built from *axes* and *fields* — and the
|
|
4
|
+
lightweight glue that moves data losslessly between AnnData, Seurat, SingleCellExperiment, and
|
|
5
|
+
pagoda/conos, including their disk-backed forms (backed AnnData, Seurat v5/BPCells, SCE/HDF5Array) — so
|
|
6
|
+
even datasets too large for memory convert in bounded memory.**
|
|
7
|
+
|
|
8
|
+
L★ represents a dataset as **axes** (the entities you index by — cells, genes, samples, clusters) and
|
|
9
|
+
**fields** (typed data over them — counts, embeddings, graphs, labels, designs). Because everything is
|
|
10
|
+
just axes and fields, one small model spans the diversity of real single-cell work that a fixed
|
|
11
|
+
`cells × genes` container strains on — for example a multi-sample (even cross-species) integration kept
|
|
12
|
+
as a *collection* of heterogeneous samples rather than one concatenated matrix; a CITE-seq object with
|
|
13
|
+
a second, protein feature axis; or a case-control cohort carrying a statistical *design* over its
|
|
14
|
+
samples. The routine count-matrix-plus-a-clustering case stays just as simple, while the harder cases
|
|
15
|
+
use the same vocabulary instead of an opaque `uns`/`misc` blob (see [Why lstar?](#why-lstar)).
|
|
16
|
+
|
|
17
|
+
In the short term, the most immediately useful thing this buys you is **[moving data between the formats
|
|
18
|
+
people already use](SUPPORT.md)**. Each existing container — AnnData (Python), Seurat and SingleCellExperiment (R),
|
|
19
|
+
pagoda/conos — fixes a few named slots; routing a dataset through L★ converts one to another while
|
|
20
|
+
preserving the *meaning* of each piece and **reporting** anything a target can't hold instead of
|
|
21
|
+
dropping it silently.
|
|
22
|
+
|
|
23
|
+
lstar is available in **Python, R, and C++** (sharing one fast C++ core), reads and writes a portable
|
|
24
|
+
[Zarr](https://zarr.dev)-based format, and is built to scale. Everything heavy can be **streamed in
|
|
25
|
+
bounded memory** — convert a multi-gigabyte dataset, write a store, or compute per-gene statistics
|
|
26
|
+
without ever loading the whole matrix, so work that needs a big machine today runs on a laptop (see
|
|
27
|
+
[Large data: lazy reads and streaming](#large-data-lazy-reads-and-streaming)). You can also open a
|
|
28
|
+
million-cell dataset over the network and read just the parts you need.
|
|
29
|
+
|
|
30
|
+
> **Status:** early development, not yet released. Working today: read/write the same store from
|
|
31
|
+
> Python, C++, and R; profiles for AnnData, Seurat (legacy v2 → v5), SingleCellExperiment, and Conos; the
|
|
32
|
+
> collection model; lazy/streaming reads; a browser/WebAssembly data layer.
|
|
33
|
+
|
|
34
|
+
---
|
|
35
|
+
|
|
36
|
+
## Why lstar?
|
|
37
|
+
|
|
38
|
+
Three things are hard with today's fixed-schema containers, and L★ is designed around them:
|
|
39
|
+
|
|
40
|
+
1. **Conversion is lossy and pairwise.** Every container hard-codes a few named slots; what fits the
|
|
41
|
+
slots converts, and the rest is lost. Routing every format through *one shared model with a shared
|
|
42
|
+
vocabulary* makes conversion lossless on the common core and **explicit** about the remainder.
|
|
43
|
+
2. **The interesting results have no home.** A gene-regulatory network, a cell–cell communication
|
|
44
|
+
tensor, RNA-velocity graphs, a fitted model — none of these fit a `cells × genes` slot, so they end
|
|
45
|
+
up as opaque blobs in `uns`/`misc`. In L★ they are ordinary, typed, queryable *fields*.
|
|
46
|
+
3. **A study is many samples, not one matrix.** Different donors, conditions, even species and gene
|
|
47
|
+
sets cannot be honestly concatenated into a single matrix. L★ keeps a multi-sample study as a
|
|
48
|
+
*collection* of heterogeneous parts joined by a graph.
|
|
49
|
+
|
|
50
|
+
If you only ever need to move data between AnnData, Seurat, and SCE, point 1 is reason enough to use
|
|
51
|
+
lstar. Points 2 and 3 are why the model is shaped the way it is.
|
|
52
|
+
|
|
53
|
+
## Converting between formats (the common case)
|
|
54
|
+
|
|
55
|
+
One command — `lstar convert` detects each format from its path, routes through the L★ store (in-process
|
|
56
|
+
for Python formats, an `Rscript` bridge for Seurat/SCE), and reports what crossed:
|
|
57
|
+
|
|
58
|
+
```bash
|
|
59
|
+
lstar convert pbmc.h5ad pbmc.rds # AnnData (Python) -> Seurat (R), bridged automatically
|
|
60
|
+
lstar convert atlas.h5ad atlas.lstar.zarr # -> a portable L* store (--to sce for SingleCellExperiment)
|
|
61
|
+
lstar convert pbmc.rds pbmc.h5ad --report # + a fidelity report (every field, and what was `dropped`)
|
|
62
|
+
```
|
|
63
|
+
|
|
64
|
+
Two things make it more than a one-liner:
|
|
65
|
+
|
|
66
|
+
- a **fidelity report** (`--report` / `--report-json`) lists every axis and field with its role, state,
|
|
67
|
+
and `provenance`, and — crucially — **`dropped`**: what the target couldn't represent, made visible
|
|
68
|
+
rather than silently lost.
|
|
69
|
+
- a **native-acceptance check** (`--check`, on by default; `--strict` to gate the exit code) opens the
|
|
70
|
+
result in its *own* library and runs a canonical-ops smoke (scanpy / Seurat / scran), so you know the
|
|
71
|
+
native analysis tools will accept it — not just that the bytes round-tripped.
|
|
72
|
+
- a **package-free fallback** (`--backend auto|native|direct`): each conversion uses the format's native
|
|
73
|
+
package when it's installed, else lstar's own codec — so you don't *need* the domain packages for the
|
|
74
|
+
common cases. What works **without** the native packages:
|
|
75
|
+
|
|
76
|
+
| convert (no native package) | needs only |
|
|
77
|
+
|---|---|
|
|
78
|
+
| `.h5ad` ↔ store — read **and** write | `lstar` + `h5py` |
|
|
79
|
+
| Seurat `.rds` ↔ store — read **and** write | `lstar` + base R (no SeuratObject) |
|
|
80
|
+
| SCE `.rds` → store — **read** | `lstar` + base R (no SingleCellExperiment) |
|
|
81
|
+
| store → SCE `.rds` (write) · `.h5mu` ↔ store | **native-only** — needs `SingleCellExperiment` / `mudata` |
|
|
82
|
+
|
|
83
|
+
At a wall (an unknown on-disk version, a `BPCells`-backed matrix) it stops and names exactly what to
|
|
84
|
+
install. The heavy *analysis* packages (scanpy / full Seurat / scran) are **never** needed to convert —
|
|
85
|
+
only for the optional `--check`. Details: [docs/conversions.md](docs/conversions.md).
|
|
86
|
+
|
|
87
|
+
Under the hood it is just `write_Y(read_X(...))` with the on-disk L★ store as the bridge between the two
|
|
88
|
+
languages, which you can also drive directly:
|
|
89
|
+
|
|
90
|
+
```bash
|
|
91
|
+
python3 -c 'import anndata as ad, lstar; from lstar.profiles.anndata import read_anndata
|
|
92
|
+
lstar.write(read_anndata(ad.read_h5ad("pbmc.h5ad")), "pbmc.lstar.zarr")' # AnnData -> L* store
|
|
93
|
+
Rscript -e 'library(lstar); saveRDS(write_seurat(lstar_read("pbmc.lstar.zarr")), "pbmc.rds")' # -> Seurat
|
|
94
|
+
```
|
|
95
|
+
|
|
96
|
+
The shared-vocabulary core — raw counts, normalized/scaled expression, PCA (scores **and** gene
|
|
97
|
+
loadings), UMAP/t-SNE, clusterings, cell/gene metadata — survives. Whatever the target can't hold (e.g.
|
|
98
|
+
neighbor graphs through Seurat) is listed in the dataset's `dropped` manifest, so nothing vanishes
|
|
99
|
+
unannounced. A runnable, commented version is
|
|
100
|
+
[`examples/convert_h5ad_to_seurat.sh`](examples/convert_h5ad_to_seurat.sh).
|
|
101
|
+
|
|
102
|
+
See **[docs/conversions.md](docs/conversions.md)** for the full glue guide (every reader/writer, the
|
|
103
|
+
conversion matrix, what is preserved vs. recorded as dropped, version detection) and
|
|
104
|
+
**[docs/mapping.md](docs/mapping.md)** for the deterministic role→slot contract — what lands where in
|
|
105
|
+
each target, and the native-acceptance check that verifies the native tools won't choke.
|
|
106
|
+
|
|
107
|
+
## Building a dataset directly
|
|
108
|
+
|
|
109
|
+
If you want to author or inspect L★ data, the model is just *axes* (the things you index by) and
|
|
110
|
+
*fields* (typed data over them):
|
|
111
|
+
|
|
112
|
+
```python
|
|
113
|
+
import scipy.sparse as sp, lstar
|
|
114
|
+
|
|
115
|
+
ds = lstar.Dataset(kind="sample")
|
|
116
|
+
ds.add_axis("cells", [f"cell{i}" for i in range(100)])
|
|
117
|
+
ds.add_axis("genes", [f"g{i}" for i in range(50)])
|
|
118
|
+
# A field declares what it IS (a `measure` over cells × genes) — no fixed "X" slot.
|
|
119
|
+
ds.add_field("counts", sp.random(100, 50, density=0.1, format="csc"),
|
|
120
|
+
role="measure", span=["cells", "genes"], state="raw")
|
|
121
|
+
|
|
122
|
+
lstar.write(ds, "sample.lstar.zarr")
|
|
123
|
+
ds2 = lstar.read("sample.lstar.zarr") # also readable from R and C++
|
|
124
|
+
```
|
|
125
|
+
|
|
126
|
+
A field's `role` (`measure`, `embedding`, `loading`, `relation`, `label`, …) says what kind of object
|
|
127
|
+
it is. A new kind of result is a new field with a role — never a change to the format. See
|
|
128
|
+
[docs/model.md](docs/model.md).
|
|
129
|
+
|
|
130
|
+
## Two design choices worth knowing
|
|
131
|
+
|
|
132
|
+
**Collections, not one big matrix.** A multi-sample study is stored as a `samples` axis plus
|
|
133
|
+
*per-sample* `cells.{s}`/`genes.{s}` axes and measures (samples may differ in cells *and* genes), with a
|
|
134
|
+
*union* `cells` axis for the joint analysis (embedding, clusters, and the integration graph as a
|
|
135
|
+
`relation`). The R package ingests a **Conos** object (`write_conos`) and a split **Seurat v5** assay
|
|
136
|
+
this way — see [`examples/conos_collection_demo.R`](examples/conos_collection_demo.R).
|
|
137
|
+
|
|
138
|
+
**Versions are recognized, not assumed.** Formats change shape across releases, so the readers detect
|
|
139
|
+
the variant and adapt — even a legacy **v2** `seurat` object (the pre-`Assay` S4 class, read via its raw
|
|
140
|
+
slots) through v3/v4 `Assay` vs. v5 `Assay5` (with a fallback for SeuratObject < 5),
|
|
141
|
+
pagoda2's `getRawCounts()` accessor vs. the legacy `$counts` slot, AnnData's `.raw` slot. The detected
|
|
142
|
+
`<format>@<version>` is recorded, so a downstream reader knows what produced the data.
|
|
143
|
+
|
|
144
|
+
## Large data: lazy reads and streaming
|
|
145
|
+
|
|
146
|
+
Single-cell stores get big — hundreds of thousands of cells, tens of thousands of genes. lstar is built
|
|
147
|
+
so you never hold a whole dataset in memory to work with it: the heavy operations **stream** the matrix
|
|
148
|
+
in blocks, so peak memory stays bounded and roughly *flat* as the data grows.
|
|
149
|
+
|
|
150
|
+

|
|
151
|
+
|
|
152
|
+
<sub>*`h5ad → L*` conversion of the Tabula Muris Senis droplet atlas (subsampled from 25k to 245k cells, up to 502M nonzeros): the in-memory path's peak RAM grows with the matrix (to ~4 GB) while streaming stays ~flat (~0.3 GB, ~13× less at full size), for a small, roughly constant time premium. Reproduce with [`examples/streaming_scaling.py`](examples/streaming_scaling.py).*</sub>
|
|
153
|
+
|
|
154
|
+
- **Convert and write in bounded memory.** `convert_anndata` (`h5ad → L*`) and `convert_to_h5ad`
|
|
155
|
+
(`L* → h5ad`) move data between formats with a backed read + block-by-block write, never materializing
|
|
156
|
+
the matrix; `lstar.write(..., stream=True)` does the same for any lazy/backed source. A multi-gigabyte
|
|
157
|
+
atlas converts in a few hundred MB.
|
|
158
|
+
- **Open without downloading.** `lstar.read(path, lazy=True)` reads only the small manifest; the heavy
|
|
159
|
+
arrays stay on disk (or on the server) until you touch them. Opening a 78-million-nonzero matrix this
|
|
160
|
+
way costs a few megabytes of memory instead of hundreds.
|
|
161
|
+
- **Compute without materializing.** A per-gene statistic (say, finding the most variable genes) is
|
|
162
|
+
computed by *streaming* the matrix in column blocks, so memory stays bounded and the matrix is never
|
|
163
|
+
expanded into a dense array.
|
|
164
|
+
|
|
165
|
+
```python
|
|
166
|
+
ds = lstar.read("big.lstar.zarr", lazy=True) # opens in MBs, not GBs
|
|
167
|
+
# per-gene mean/variance over log-normalized counts, streamed in bounded memory:
|
|
168
|
+
mean, var, nnz = lstar.stream_col_stats(ds.field("counts").values,
|
|
169
|
+
lognorm=True, # normalize on the fly; the dense matrix is never built
|
|
170
|
+
n_threads=8) # use as many cores as you like
|
|
171
|
+
top_variable_genes = var.argsort()[::-1][:2000]
|
|
172
|
+
```
|
|
173
|
+
|
|
174
|
+
When you write a store, chunking and compression make these reads cheap (a lazy read fetches only the
|
|
175
|
+
chunks it needs):
|
|
176
|
+
|
|
177
|
+
```python
|
|
178
|
+
import numcodecs
|
|
179
|
+
lstar.write(ds, "big.lstar.zarr", chunk_elems=1_000_000, compressor=numcodecs.GZip(5))
|
|
180
|
+
```
|
|
181
|
+
|
|
182
|
+
In practice this is fast and frugal: opening that 40,220 × 20,138 matrix lazily uses ~9 MB instead of
|
|
183
|
+
~780 MB, per-gene statistics stream in bounded memory, and the heavy reductions run on a shared C++
|
|
184
|
+
core (used automatically when available, ~8× faster on 16 threads, identical results in Python, R, and
|
|
185
|
+
the browser). Measurements and the full picture are in [`misc/plan1.md`](misc/plan1.md) §12.
|
|
186
|
+
|
|
187
|
+
## Languages and components
|
|
188
|
+
|
|
189
|
+
| | what it is |
|
|
190
|
+
|---|---|
|
|
191
|
+
| **Python** (`python/`) | the `lstar` package on zarr-python, with an optional compiled C++ accelerator |
|
|
192
|
+
| **R** (`R/`) | the `lstar` package; the format profiles (Seurat, SCE, Conos) live here |
|
|
193
|
+
| **C++** (`core/`) | `libstar`, the header-only core: the model, chunked+gzip Zarr IO, and the fast kernels |
|
|
194
|
+
| **Browser/Node** (`js/`) | a TypeScript reader (zarrita) + the kernels compiled to WebAssembly, for viewers |
|
|
195
|
+
|
|
196
|
+
```
|
|
197
|
+
docs/ principles, the model & format specs, conversions, worked examples
|
|
198
|
+
core/ libstar — the C++ core
|
|
199
|
+
python/ R/ the Python and R packages
|
|
200
|
+
js/ the browser/WASM data layer
|
|
201
|
+
conformance/ the shared round-trip / cross-format / cross-language test suite
|
|
202
|
+
examples/ runnable, commented end-to-end demos
|
|
203
|
+
misc/ the design proposal (Lstar_proposal.md) + plans
|
|
204
|
+
```
|
|
205
|
+
|
|
206
|
+
## Documentation
|
|
207
|
+
|
|
208
|
+
- **[docs/principles.md](docs/principles.md)** — the idea and the reasoning. *Start here.*
|
|
209
|
+
- **[docs/conversions.md](docs/conversions.md)** — using lstar as glue between formats (incl. the `lstar convert` CLI).
|
|
210
|
+
- **[docs/mapping.md](docs/mapping.md)** — the deterministic role→slot conversion contract + native-acceptance.
|
|
211
|
+
- **[docs/model.md](docs/model.md)** — the model: axes, fields, roles, collections.
|
|
212
|
+
- **[docs/format.md](docs/format.md)** — the on-disk Zarr layout.
|
|
213
|
+
- **[docs/examples.md](docs/examples.md)** — worked, commented examples (Python, R, C++, browser).
|
|
214
|
+
- **[SUPPORT.md](SUPPORT.md)** — **format & language support matrix**: what converts/reads/writes today,
|
|
215
|
+
per format and per language, with real-vs-synthetic test coverage and the known gaps.
|
|
216
|
+
|
|
217
|
+
The full normative specification (the model, the Zarr schema, and the bidirectional profile rule
|
|
218
|
+
catalog for every format) is the proposal, [`misc/Lstar_proposal.md`](misc/Lstar_proposal.md).
|
|
219
|
+
|
|
220
|
+
## License
|
|
221
|
+
|
|
222
|
+
MIT.
|