scdataloader 1.0.6__py3-none-any.whl → 1.2.1__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,299 @@
1
+ Metadata-Version: 2.3
2
+ Name: scdataloader
3
+ Version: 1.2.1
4
+ Summary: a dataloader for single cell data in lamindb
5
+ Project-URL: repository, https://github.com/jkobject/scDataLoader
6
+ Author-email: jkobject <jkobject@gmail.com>
7
+ License-Expression: MIT
8
+ License-File: LICENSE
9
+ Keywords: dataloader,lamindb,pytorch,scPRINT,scRNAseq
10
+ Requires-Python: <3.11,>=3.10
11
+ Requires-Dist: anndata>=0.9.0
12
+ Requires-Dist: biomart>=0.9.0
13
+ Requires-Dist: cellxgene-census>=0.1.0
14
+ Requires-Dist: django>=4.0.0
15
+ Requires-Dist: ipykernel>=6.20.0
16
+ Requires-Dist: lamindb[bionty]==0.76.12
17
+ Requires-Dist: leidenalg>=0.8.0
18
+ Requires-Dist: lightning>=2.0.0
19
+ Requires-Dist: matplotlib>=3.5.0
20
+ Requires-Dist: numpy>=1.26.0
21
+ Requires-Dist: pandas>=2.0.0
22
+ Requires-Dist: scikit-misc>=0.5.0
23
+ Requires-Dist: seaborn>=0.11.0
24
+ Requires-Dist: torch==2.2.0
25
+ Requires-Dist: torchdata>=0.5.0
26
+ Provides-Extra: dev
27
+ Requires-Dist: coverage>=7.3.2; extra == 'dev'
28
+ Requires-Dist: gitchangelog>=3.0.4; extra == 'dev'
29
+ Requires-Dist: mkdocs-git-authors-plugin>=0.4.0; extra == 'dev'
30
+ Requires-Dist: mkdocs-git-revision-date-localized-plugin>=1.0.0; extra == 'dev'
31
+ Requires-Dist: mkdocs-jupyter>=0.2.0; extra == 'dev'
32
+ Requires-Dist: mkdocs>=1.5.3; extra == 'dev'
33
+ Requires-Dist: mkdocstrings-python>=0.10.0; extra == 'dev'
34
+ Requires-Dist: mkdocstrings>=0.22.0; extra == 'dev'
35
+ Requires-Dist: pytest-cov>=4.1.0; extra == 'dev'
36
+ Requires-Dist: pytest>=7.4.3; extra == 'dev'
37
+ Requires-Dist: ruff>=0.6.4; extra == 'dev'
38
+ Description-Content-Type: text/markdown
39
+
40
+ # scdataloader
41
+
42
+ [![codecov](https://codecov.io/gh/jkobject/scDataLoader/branch/main/graph/badge.svg?token=scDataLoader_token_here)](https://codecov.io/gh/jkobject/scDataLoader)
43
+ [![CI](https://github.com/jkobject/scDataLoader/actions/workflows/main.yml/badge.svg)](https://github.com/jkobject/scDataLoader/actions/workflows/main.yml)
44
+ [![PyPI version](https://badge.fury.io/py/scDataLoader.svg)](https://badge.fury.io/py/scDataLoader)
45
+ [![Downloads](https://pepy.tech/badge/scDataLoader)](https://pepy.tech/project/scDataLoader)
46
+ [![Downloads](https://pepy.tech/badge/scDataLoader/month)](https://pepy.tech/project/scDataLoader)
47
+ [![Downloads](https://pepy.tech/badge/scDataLoader/week)](https://pepy.tech/project/scDataLoader)
48
+ [![GitHub issues](https://img.shields.io/github/issues/jkobject/scDataLoader)](https://img.shields.io/github/issues/jkobject/scDataLoader)
49
+ [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
50
+ [![DOI](https://img.shields.io/badge/DOI-10.1101%2F2024.07.29.605556-blue)](https://doi.org/10.1101/2024.07.29.605556)
51
+
52
+ This single cell pytorch dataloader / lighting datamodule is designed to be used with:
53
+
54
+ - [lamindb](https://lamin.ai/)
55
+
56
+ and:
57
+
58
+ - [scanpy](https://scanpy.readthedocs.io/en/stable/)
59
+ - [anndata](https://anndata.readthedocs.io/en/latest/)
60
+
61
+ It allows you to:
62
+
63
+ 1. load thousands of datasets containing millions of cells in a few seconds.
64
+ 2. preprocess the data per dataset and download it locally (normalization, filtering, etc.)
65
+ 3. create a more complex single cell dataset
66
+ 4. extend it to your need
67
+
68
+ built on top of `lamindb` and the `.mapped()` function by Sergey: https://github.com/Koncopd
69
+
70
+ The package has been designed together with the [scPRINT paper](https://doi.org/10.1101/2024.07.29.605556) and [model](https://github.com/cantinilab/scPRINT).
71
+
72
+ ## More
73
+
74
+ I needed to create this Data Loader for my PhD project. I am using it to load & preprocess thousands of datasets containing millions of cells in a few seconds. I believed that individuals employing AI for single-cell RNA sequencing and other sequencing datasets would eagerly utilize and desire such a tool, which presently does not exist.
75
+
76
+ ![scdataloader.drawio.png](docs/scdataloader.drawio.png)
77
+
78
+ ## Install it from PyPI
79
+
80
+ ```bash
81
+ pip install scdataloader
82
+ # or
83
+ pip install scDataLoader[dev] # for dev dependencies
84
+
85
+ lamin init --storage ./testdb --name test --schema bionty
86
+ ```
87
+
88
+ if you start with lamin and had to do a `lamin init`, you will also need to populate your ontologies. This is because scPRINT is using ontologies to define its cell types, diseases, sexes, ethnicities, etc.
89
+
90
+ you can do it manually or with our function:
91
+
92
+ ```python
93
+ from scdataloader.utils import populate_my_ontology
94
+
95
+ populate_my_ontology() #to populate everything (recommended) (can take 2-10mns)
96
+
97
+ populate_my_ontology( #the minimum to the tool
98
+ organisms: List[str] = ["NCBITaxon:10090", "NCBITaxon:9606"],
99
+ sex: List[str] = ["PATO:0000384", "PATO:0000383"],
100
+ celltypes = None,
101
+ ethnicities = None,
102
+ assays = None,
103
+ tissues = None,
104
+ diseases = None,
105
+ dev_stages = None,
106
+ )
107
+ ```
108
+
109
+ ### Dev install
110
+
111
+ If you want to use the latest version of scDataLoader and work on the code yourself use `git clone` and `pip -e` instead of `pip install`.
112
+
113
+ ```bash
114
+ git clone https://github.com/jkobject/scDataLoader.git
115
+ pip install -e scDataLoader[dev]
116
+ ```
117
+
118
+ ## Usage
119
+
120
+ ### DataModule usage
121
+
122
+ ```python
123
+ # initialize a local lamin database
124
+ #! lamin init --storage ./cellxgene --name cellxgene --schema bionty
125
+ from scdataloader import utils, Preprocessor, DataModule
126
+
127
+
128
+ # preprocess datasets
129
+ preprocessor = Preprocessor(
130
+ do_postp=False,
131
+ force_preprocess=True,
132
+ )
133
+ adata = preprocessor(adata)
134
+
135
+ art = ln.Artifact(adata, description="test")
136
+ art.save()
137
+ ln.Collection(art, name="test", description="test").save()
138
+
139
+ datamodule = DataModule(
140
+ collection_name="test",
141
+ organisms=["NCBITaxon:9606"], #organism that we will work on
142
+ how="most expr", # for the collator (most expr genes only will be selected)
143
+ max_len=1000, # only the 1000 most expressed
144
+ batch_size=64,
145
+ num_workers=1,
146
+ validation_split=0.1,
147
+ )
148
+ ```
149
+
150
+ ### lightning-free usage (Dataset+Collator+DataLoader)
151
+
152
+ ```python
153
+ # initialize a local lamin database
154
+ #! lamin init --storage ./cellxgene --name cellxgene --schema bionty
155
+
156
+ from scdataloader import utils, Preprocessor, SimpleAnnDataset, Collator, DataLoader
157
+
158
+ # preprocess dataset
159
+ preprocessor = Preprocessor(
160
+ do_postp=False,
161
+ force_preprocess=True,
162
+ )
163
+ adata = preprocessor(adata)
164
+
165
+ # create dataset
166
+ adataset = SimpleAnnDataset(
167
+ adata, obs_to_output=["organism_ontology_term_id"]
168
+ )
169
+ # create collator
170
+ col = Collator(
171
+ organisms="NCBITaxon:9606",
172
+ valid_genes=adata.var_names,
173
+ max_len=2000, #maximum number of genes to use
174
+ how="some" |"most expr"|"random_expr",
175
+ # genelist = [geneA, geneB] if how=='some'
176
+ )
177
+ # create dataloader
178
+ dataloader = DataLoader(
179
+ adataset,
180
+ collate_fn=col,
181
+ batch_size=64,
182
+ num_workers=4,
183
+ shuffle=False,
184
+ )
185
+
186
+ # predict
187
+ for batch in tqdm(dataloader):
188
+ gene_pos, expression, depth = (
189
+ batch["genes"],
190
+ batch["x"],
191
+ batch["depth"],
192
+ )
193
+ model.predict(
194
+ gene_pos,
195
+ expression,
196
+ depth,
197
+ )
198
+ ```
199
+
200
+ ### Usage on all of cellxgene
201
+
202
+ ```python
203
+ # initialize a local lamin database
204
+ #! lamin init --storage ./cellxgene --name cellxgene --schema bionty
205
+
206
+ from scdataloader import utils
207
+ from scdataloader.preprocess import LaminPreprocessor, additional_postprocess, additional_preprocess
208
+
209
+ # preprocess datasets
210
+ DESCRIPTION='preprocessed by scDataLoader'
211
+
212
+ cx_dataset = ln.Collection.using(instance="laminlabs/cellxgene").filter(name="cellxgene-census", version='2023-12-15').one()
213
+ cx_dataset, len(cx_dataset.artifacts.all())
214
+
215
+
216
+ do_preprocess = LaminPreprocessor(additional_postprocess=additional_postprocess, additional_preprocess=additional_preprocess, skip_validate=True, subset_hvg=0)
217
+
218
+ preprocessed_dataset = do_preprocess(cx_dataset, name=DESCRIPTION, description=DESCRIPTION, start_at=6, version="2")
219
+
220
+ # create dataloaders
221
+ from scdataloader import DataModule
222
+ import tqdm
223
+
224
+ datamodule = DataModule(
225
+ collection_name="preprocessed dataset",
226
+ organisms=["NCBITaxon:9606"], #organism that we will work on
227
+ how="most expr", # for the collator (most expr genes only will be selected)
228
+ max_len=1000, # only the 1000 most expressed
229
+ batch_size=64,
230
+ num_workers=1,
231
+ validation_split=0.1,
232
+ test_split=0)
233
+
234
+ for i in tqdm.tqdm(datamodule.train_dataloader()):
235
+ # pass #or do pass
236
+ print(i)
237
+ break
238
+
239
+ # with lightning:
240
+ # Trainer(model, datamodule)
241
+
242
+ ```
243
+
244
+ see the notebooks in [docs](https://www.jkobject.com/scDataLoader/):
245
+
246
+ 1. [load a dataset](https://www.jkobject.com/scDataLoader/notebooks/1_download_and_preprocess/)
247
+ 2. [create a dataset](https://www.jkobject.com/scDataLoader/notebooks/2_create_dataloader/)
248
+
249
+ ### command line preprocessing
250
+
251
+ You can use the command line to preprocess a large database of datasets like here for cellxgene. this allows parallelizing and easier usage.
252
+
253
+ ```bash
254
+ scdataloader --instance "laminlabs/cellxgene" --name "cellxgene-census" --version "2023-12-15" --description "preprocessed for scprint" --new_name "scprint main" --start_at 10 >> scdataloader.out
255
+ ```
256
+
257
+ ### command line usage
258
+
259
+ The main way to use
260
+
261
+ > please refer to the [scPRINT documentation](https://www.jkobject.com/scPRINT/) and [lightning documentation](https://lightning.ai/docs/pytorch/stable/cli/lightning_cli_intermediate.html) for more information on command line usage
262
+
263
+ ## FAQ
264
+
265
+ ### how to update my ontologies?
266
+
267
+ ```bash
268
+ import bionty as bt
269
+ bt.reset_sources()
270
+
271
+ # Run via CLI: lamin load <your instance>
272
+
273
+ import lnschema_bionty as lb
274
+ lb.dev.sync_bionty_source_to_latest()
275
+ ```
276
+
277
+ ### how to load all ontologies?
278
+
279
+ ```python
280
+ from scdataloader import utils
281
+ utils.populate_ontologies() # this might take from 5-20mins
282
+ ```
283
+
284
+ ## Development
285
+
286
+ Read the [CONTRIBUTING.md](CONTRIBUTING.md) file.
287
+
288
+ ## License
289
+
290
+ This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
291
+
292
+ ## Acknowledgments
293
+
294
+ - [lamin.ai](https://lamin.ai/)
295
+ - [scanpy](https://scanpy.readthedocs.io/en/stable/)
296
+ - [anndata](https://anndata.readthedocs.io/en/latest/)
297
+ - [scprint](https://www.jkobject.com/scPRINT/)
298
+
299
+ Awesome single cell dataloader created by @jkobject
@@ -0,0 +1,14 @@
1
+ scdataloader/VERSION,sha256=bPTghLR_M8mwLveSedFXgzho-PcFFBaadovjU-4yj-o,6
2
+ scdataloader/__init__.py,sha256=5y9VzRhOAUWeYMn2MrRRRlzgdiMjRFytr7gcn-I6IkE,147
3
+ scdataloader/__main__.py,sha256=Hu7Bnc7P4UfOzNWyDAVoNZsItgy27hldaw3y8OS3gPM,6387
4
+ scdataloader/base.py,sha256=M1gD59OffRdLOgS1vHKygOomUoAMuzjpRtAfM3SBKF8,338
5
+ scdataloader/collator.py,sha256=gzHiuixUwK8JClhAbG12kgWMU_VTKkowibA-tDFpbwo,11341
6
+ scdataloader/config.py,sha256=rrW2DZxG4J2_pmpDbXXsaKJkpNC57w5dIlItiFbANYw,2905
7
+ scdataloader/data.py,sha256=3dCp-lIAfOkCi76SH5W3iSqFmAWZslwARkN9v5mylz8,14907
8
+ scdataloader/datamodule.py,sha256=B-udBevPSPF__hfy0pOz1dGovgE95K2pxPupjB7RblI,16936
9
+ scdataloader/preprocess.py,sha256=pH4EPrcRqH34o3t5X3A4kETiYdCZngih5SdP_PPfgOo,29178
10
+ scdataloader/utils.py,sha256=5-6CnI3Utn5XFpqgZiJa0MT6gfvkFNg078SgrE6P4s8,22365
11
+ scdataloader-1.2.1.dist-info/METADATA,sha256=JeE7j8HkByp_MMGVXp4GOvpdkjIjoyEoByXA-FWISuk,9802
12
+ scdataloader-1.2.1.dist-info/WHEEL,sha256=1yFddiXMmvYK7QYTqtRNtX66WJ0Mz8PYEiEUoOUUxRY,87
13
+ scdataloader-1.2.1.dist-info/licenses/LICENSE,sha256=OXLcl0T2SZ8Pmy2_dmlvKuetivmyPd5m1q-Gyd-zaYY,35149
14
+ scdataloader-1.2.1.dist-info/RECORD,,
@@ -1,4 +1,4 @@
1
1
  Wheel-Version: 1.0
2
- Generator: poetry-core 1.7.0
2
+ Generator: hatchling 1.25.0
3
3
  Root-Is-Purelib: true
4
4
  Tag: py3-none-any