scdataloader 0.0.2__tar.gz → 0.0.4__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,39 +1,45 @@
1
1
  Metadata-Version: 2.1
2
2
  Name: scdataloader
3
- Version: 0.0.2
3
+ Version: 0.0.4
4
4
  Summary: a dataloader for single cell data in lamindb
5
- Home-page: https://github.com/jkobject/scPrint
5
+ Home-page: https://github.com/jkobject/scDataLoader
6
6
  License: GPL3
7
7
  Keywords: scRNAseq,dataloader,pytorch,lamindb,scPrint
8
8
  Author: jkobject
9
- Requires-Python: >=3.10,<4.0
9
+ Requires-Python: ==3.10.*
10
10
  Classifier: License :: Other/Proprietary License
11
11
  Classifier: Programming Language :: Python :: 3
12
12
  Classifier: Programming Language :: Python :: 3.10
13
- Classifier: Programming Language :: Python :: 3.11
14
- Classifier: Programming Language :: Python :: 3.12
15
13
  Requires-Dist: anndata
16
14
  Requires-Dist: biomart
15
+ Requires-Dist: bionty
17
16
  Requires-Dist: cellxgene-census
18
17
  Requires-Dist: decoupler
19
18
  Requires-Dist: django
20
19
  Requires-Dist: ipykernel
21
20
  Requires-Dist: lamindb
22
21
  Requires-Dist: leidenalg
22
+ Requires-Dist: lightning
23
+ Requires-Dist: lnschema-bionty
23
24
  Requires-Dist: matplotlib
24
25
  Requires-Dist: pandas (>=2.0.0)
26
+ Requires-Dist: scikit-misc
25
27
  Requires-Dist: seaborn
26
28
  Requires-Dist: torch
27
29
  Requires-Dist: torchdata
28
- Project-URL: Repository, https://github.com/jkobject/scPrint
30
+ Project-URL: Repository, https://github.com/jkobject/scDataLoader
29
31
  Description-Content-Type: text/markdown
30
32
 
31
33
  # scdataloader
32
34
 
33
35
  [![codecov](https://codecov.io/gh/jkobject/scDataLoader/branch/main/graph/badge.svg?token=scDataLoader_token_here)](https://codecov.io/gh/jkobject/scDataLoader)
34
36
  [![CI](https://github.com/jkobject/scDataLoader/actions/workflows/main.yml/badge.svg)](https://github.com/jkobject/scDataLoader/actions/workflows/main.yml)
37
+ [![DOI](https://zenodo.org/badge/731248665.svg)](https://zenodo.org/doi/10.5281/zenodo.10573143)
35
38
 
36
- Awesome single cell dataloader created by @jkobject
39
+
40
+ Awesome single cell dataloader created by @jkobject
41
+
42
+ built on top of `lamindb` and the `.mapped()` function by Sergey: https://github.com/Koncopd
37
43
 
38
44
  This data loader is designed to be used with:
39
45
 
@@ -51,14 +57,78 @@ It allows you to:
51
57
  3. create a more complex single cell dataset
52
58
  4. extend it to your need
53
59
 
60
+ ## About
61
+
62
+ the idea is to use it to train models like scGPT / GeneFormer (and soon, scPrint ;)). It is:
63
+
64
+ 1. loading from lamin
65
+ 2. doing some dataset specific preprocessing if needed
66
+ 3. creating a dataset object on top of .mapped() (that is needed for mapping genes, cell labels etc..)
67
+ 4. passing it to a dataloader object that can work with it correctly
68
+
69
+ Currently one would have to use the preprocess function to make the dataset fit for different tools like scGPT / Geneformer. But I would want to enable it through different Collators. This is still missing and a WIP... (please do contribute!)
70
+
71
+ ![docs/scdataloader.drawio.png](docs/scdataloader.drawio.png)
72
+
54
73
  ## Install it from PyPI
55
74
 
56
75
  ```bash
57
76
  pip install scdataloader
58
77
  ```
59
78
 
79
+ ### Install it locally and run the notebooks:
80
+
81
+ ```bash
82
+ git clone https://github.com/jkobject/scDataLoader.git
83
+ cd scDataLoader
84
+ poetry install
85
+ ```
86
+ then run the notebooks with the poetry installed environment
87
+
60
88
  ## Usage
61
89
 
90
+ ```python
91
+ # initialize a local lamin database
92
+ # !lamin init --storage ~/scdataloader --schema bionty
93
+
94
+ from scdataloader import utils
95
+ from scdataloader.preprocess import LaminPreprocessor, additional_postprocess, additional_preprocess
96
+
97
+ # preprocess datasets
98
+ DESCRIPTION='preprocessed by scDataLoader'
99
+
100
+ cx_dataset = ln.Collection.using(instance="laminlabs/cellxgene").filter(name="cellxgene-census", version='2023-12-15').one()
101
+ cx_dataset, len(cx_dataset.artifacts.all())
102
+
103
+
104
+ do_preprocess = LaminPreprocessor(additional_postprocess=additional_postprocess, additional_preprocess=additional_preprocess, skip_validate=True, subset_hvg=0)
105
+
106
+ preprocessed_dataset = do_preprocess(cx_dataset, name=DESCRIPTION, description=DESCRIPTION, start_at=6, version="2")
107
+
108
+ # create dataloaders
109
+ from scdataloader import DataModule
110
+ import tqdm
111
+
112
+ datamodule = DataModule(
113
+ collection_name="preprocessed dataset",
114
+ organisms=["NCBITaxon:9606"], #organism that we will work on
115
+ how="most expr", # for the collator (most expr genes only will be selected)
116
+ max_len=1000, # only the 1000 most expressed
117
+ batch_size=64,
118
+ num_workers=1,
119
+ validation_split=0.1,
120
+ test_split=0)
121
+
122
+ for i in tqdm.tqdm(datamodule.train_dataloader()):
123
+ # pass #or do pass
124
+ print(i)
125
+ break
126
+
127
+ # with lightning:
128
+ # Trainer(model, datamodule)
129
+
130
+ ```
131
+
62
132
  see the notebooks in [docs](https://jkobject.github.io/scDataLoader/):
63
133
 
64
134
  1. [load a dataset](https://jkobject.github.io/scDataLoader/notebooks/01_load_dataset.html)
@@ -0,0 +1,107 @@
1
+ # scdataloader
2
+
3
+ [![codecov](https://codecov.io/gh/jkobject/scDataLoader/branch/main/graph/badge.svg?token=scDataLoader_token_here)](https://codecov.io/gh/jkobject/scDataLoader)
4
+ [![CI](https://github.com/jkobject/scDataLoader/actions/workflows/main.yml/badge.svg)](https://github.com/jkobject/scDataLoader/actions/workflows/main.yml)
5
+ [![DOI](https://zenodo.org/badge/731248665.svg)](https://zenodo.org/doi/10.5281/zenodo.10573143)
6
+
7
+
8
+ Awesome single cell dataloader created by @jkobject
9
+
10
+ built on top of `lamindb` and the `.mapped()` function by Sergey: https://github.com/Koncopd
11
+
12
+ This data loader is designed to be used with:
13
+
14
+ - [lamindb](https://lamin.ai/)
15
+
16
+ and:
17
+
18
+ - [scanpy](https://scanpy.readthedocs.io/en/stable/)
19
+ - [anndata](https://anndata.readthedocs.io/en/latest/)
20
+
21
+ It allows you to:
22
+
23
+ 1. load thousands of datasets containing millions of cells in a few seconds.
24
+ 2. preprocess the data per dataset and download it locally (normalization, filtering, etc.)
25
+ 3. create a more complex single cell dataset
26
+ 4. extend it to your need
27
+
28
+ ## About
29
+
30
+ the idea is to use it to train models like scGPT / GeneFormer (and soon, scPrint ;)). It is:
31
+
32
+ 1. loading from lamin
33
+ 2. doing some dataset specific preprocessing if needed
34
+ 3. creating a dataset object on top of .mapped() (that is needed for mapping genes, cell labels etc..)
35
+ 4. passing it to a dataloader object that can work with it correctly
36
+
37
+ Currently one would have to use the preprocess function to make the dataset fit for different tools like scGPT / Geneformer. But I would want to enable it through different Collators. This is still missing and a WIP... (please do contribute!)
38
+
39
+ ![docs/scdataloader.drawio.png](docs/scdataloader.drawio.png)
40
+
41
+ ## Install it from PyPI
42
+
43
+ ```bash
44
+ pip install scdataloader
45
+ ```
46
+
47
+ ### Install it locally and run the notebooks:
48
+
49
+ ```bash
50
+ git clone https://github.com/jkobject/scDataLoader.git
51
+ cd scDataLoader
52
+ poetry install
53
+ ```
54
+ then run the notebooks with the poetry installed environment
55
+
56
+ ## Usage
57
+
58
+ ```python
59
+ # initialize a local lamin database
60
+ # !lamin init --storage ~/scdataloader --schema bionty
61
+
62
+ from scdataloader import utils
63
+ from scdataloader.preprocess import LaminPreprocessor, additional_postprocess, additional_preprocess
64
+
65
+ # preprocess datasets
66
+ DESCRIPTION='preprocessed by scDataLoader'
67
+
68
+ cx_dataset = ln.Collection.using(instance="laminlabs/cellxgene").filter(name="cellxgene-census", version='2023-12-15').one()
69
+ cx_dataset, len(cx_dataset.artifacts.all())
70
+
71
+
72
+ do_preprocess = LaminPreprocessor(additional_postprocess=additional_postprocess, additional_preprocess=additional_preprocess, skip_validate=True, subset_hvg=0)
73
+
74
+ preprocessed_dataset = do_preprocess(cx_dataset, name=DESCRIPTION, description=DESCRIPTION, start_at=6, version="2")
75
+
76
+ # create dataloaders
77
+ from scdataloader import DataModule
78
+ import tqdm
79
+
80
+ datamodule = DataModule(
81
+ collection_name="preprocessed dataset",
82
+ organisms=["NCBITaxon:9606"], #organism that we will work on
83
+ how="most expr", # for the collator (most expr genes only will be selected)
84
+ max_len=1000, # only the 1000 most expressed
85
+ batch_size=64,
86
+ num_workers=1,
87
+ validation_split=0.1,
88
+ test_split=0)
89
+
90
+ for i in tqdm.tqdm(datamodule.train_dataloader()):
91
+ # pass #or do pass
92
+ print(i)
93
+ break
94
+
95
+ # with lightning:
96
+ # Trainer(model, datamodule)
97
+
98
+ ```
99
+
100
+ see the notebooks in [docs](https://jkobject.github.io/scDataLoader/):
101
+
102
+ 1. [load a dataset](https://jkobject.github.io/scDataLoader/notebooks/01_load_dataset.html)
103
+ 2. [create a dataset](https://jkobject.github.io/scDataLoader/notebooks/02_create_dataset.html)
104
+
105
+ ## Development
106
+
107
+ Read the [CONTRIBUTING.md](CONTRIBUTING.md) file.
@@ -1,24 +1,19 @@
1
1
  [tool.poetry]
2
2
  name = "scdataloader"
3
- version = "0.0.2"
3
+ version = "0.0.4"
4
4
  description = "a dataloader for single cell data in lamindb"
5
5
  authors = ["jkobject"]
6
6
  license = "GPL3"
7
7
  readme = ["README.md", "LICENSE"]
8
- repository = "https://github.com/jkobject/scPrint"
9
- keywords = [
10
- "scRNAseq",
11
- "dataloader",
12
- "pytorch",
13
- "lamindb",
14
- "scPrint",
15
- ]
8
+ repository = "https://github.com/jkobject/scDataLoader"
9
+ keywords = ["scRNAseq", "dataloader", "pytorch", "lamindb", "scPrint"]
16
10
 
17
11
  [tool.poetry.dependencies]
18
- python = "^3.10"
12
+ python = "3.10.*"
19
13
  lamindb = "*"
20
14
  cellxgene-census = "*"
21
15
  torch = "*"
16
+ lightning = "*"
22
17
  anndata = "*"
23
18
  matplotlib = "*"
24
19
  seaborn = "*"
@@ -29,6 +24,9 @@ pandas = ">=2.0.0"
29
24
  leidenalg = "*"
30
25
  decoupler = "*"
31
26
  django = "*"
27
+ lnschema-bionty = "*"
28
+ bionty = "*"
29
+ scikit-misc = "*"
32
30
 
33
31
  [tool.poetry.group.dev.dependencies]
34
32
  pytest = "^7.4.3"
@@ -46,6 +44,7 @@ mkdocs-git-authors-plugin = "*"
46
44
  mkdocs-jupyter = "*"
47
45
  mkdocstrings-python = "*"
48
46
 
47
+
49
48
  [build-system]
50
49
  requires = ["poetry-core"]
51
50
  build-backend = "poetry.core.masonry.api"
@@ -0,0 +1 @@
1
+ 0.7.0
@@ -0,0 +1,4 @@
1
+ from .data import Dataset
2
+ from .datamodule import DataModule
3
+ from .preprocess import Preprocessor
4
+ from .collator import *
@@ -0,0 +1,209 @@
1
+ import argparse
2
+ from scdataloader.preprocess import (
3
+ LaminPreprocessor,
4
+ additional_preprocess,
5
+ additional_postprocess,
6
+ )
7
+ import lamindb as ln
8
+ from typing import Optional, Union
9
+
10
+
11
+ # scdataloader --instance="laminlabs/cellxgene" --name="cellxgene-census" --version="2023-12-15" --description="preprocessed for scprint" --new_name="scprint main" --start_at=39
12
+ def main():
13
+ parser = argparse.ArgumentParser(
14
+ description="Preprocess datasets in a given lamindb collection."
15
+ )
16
+ parser.add_argument(
17
+ "--name", type=str, required=True, help="Name of the input dataset"
18
+ )
19
+ parser.add_argument(
20
+ "--new_name",
21
+ type=str,
22
+ default="preprocessed dataset",
23
+ help="Name of the preprocessed dataset.",
24
+ )
25
+ parser.add_argument(
26
+ "--description",
27
+ type=str,
28
+ default="preprocessed by scDataLoader",
29
+ help="Description of the preprocessed dataset.",
30
+ )
31
+ parser.add_argument(
32
+ "--start_at", type=int, default=0, help="Position to start preprocessing at."
33
+ )
34
+ parser.add_argument(
35
+ "--new_version",
36
+ type=str,
37
+ default="2",
38
+ help="Version of the output dataset and files.",
39
+ )
40
+ parser.add_argument(
41
+ "--instance",
42
+ type=str,
43
+ default=None,
44
+ help="Instance storing the input dataset, if not local",
45
+ )
46
+ parser.add_argument(
47
+ "--version", type=str, default=None, help="Version of the input dataset."
48
+ )
49
+ parser.add_argument(
50
+ "--filter_gene_by_counts",
51
+ type=Union[int, bool],
52
+ default=False,
53
+ help="Determines whether to filter genes by counts.",
54
+ )
55
+ parser.add_argument(
56
+ "--filter_cell_by_counts",
57
+ type=Union[int, bool],
58
+ default=False,
59
+ help="Determines whether to filter cells by counts.",
60
+ )
61
+ parser.add_argument(
62
+ "--normalize_sum",
63
+ type=float,
64
+ default=1e4,
65
+ help="Determines whether to normalize the total counts of each cell to a specific value.",
66
+ )
67
+ parser.add_argument(
68
+ "--subset_hvg",
69
+ type=int,
70
+ default=0,
71
+ help="Determines whether to subset highly variable genes.",
72
+ )
73
+ parser.add_argument(
74
+ "--hvg_flavor",
75
+ type=str,
76
+ default="seurat_v3",
77
+ help="Specifies the flavor of highly variable genes selection.",
78
+ )
79
+ parser.add_argument(
80
+ "--binning",
81
+ type=Optional[int],
82
+ default=None,
83
+ help="Determines whether to bin the data into discrete values of number of bins provided.",
84
+ )
85
+ parser.add_argument(
86
+ "--result_binned_key",
87
+ type=str,
88
+ default="X_binned",
89
+ help="Specifies the key of AnnData to store the binned data.",
90
+ )
91
+ parser.add_argument(
92
+ "--length_normalize",
93
+ type=bool,
94
+ default=False,
95
+ help="Determines whether to normalize the length.",
96
+ )
97
+ parser.add_argument(
98
+ "--force_preprocess",
99
+ type=bool,
100
+ default=False,
101
+ help="Determines whether to force preprocessing.",
102
+ )
103
+ parser.add_argument(
104
+ "--min_dataset_size",
105
+ type=int,
106
+ default=100,
107
+ help="Specifies the minimum dataset size.",
108
+ )
109
+ parser.add_argument(
110
+ "--min_valid_genes_id",
111
+ type=int,
112
+ default=10_000,
113
+ help="Specifies the minimum valid genes id.",
114
+ )
115
+ parser.add_argument(
116
+ "--min_nnz_genes",
117
+ type=int,
118
+ default=400,
119
+ help="Specifies the minimum non-zero genes.",
120
+ )
121
+ parser.add_argument(
122
+ "--maxdropamount",
123
+ type=int,
124
+ default=50,
125
+ help="Specifies the maximum drop amount.",
126
+ )
127
+ parser.add_argument(
128
+ "--madoutlier", type=int, default=5, help="Specifies the MAD outlier."
129
+ )
130
+ parser.add_argument(
131
+ "--pct_mt_outlier",
132
+ type=int,
133
+ default=8,
134
+ help="Specifies the percentage of MT outlier.",
135
+ )
136
+ parser.add_argument(
137
+ "--batch_key", type=Optional[str], default=None, help="Specifies the batch key."
138
+ )
139
+ parser.add_argument(
140
+ "--skip_validate",
141
+ type=bool,
142
+ default=False,
143
+ help="Determines whether to skip validation.",
144
+ )
145
+ parser.add_argument(
146
+ "--do_postp",
147
+ type=bool,
148
+ default=False,
149
+ help="Determines whether to do postprocessing.",
150
+ )
151
+ args = parser.parse_args()
152
+
153
+ # Load the collection
154
+ # if not args.preprocess:
155
+ # print("Only preprocess is available for now")
156
+ # return
157
+ if args.instance is not None:
158
+ collection = (
159
+ ln.Collection.using(instance=args.instance)
160
+ .filter(name=args.name, version=args.version)
161
+ .first()
162
+ )
163
+ else:
164
+ collection = ln.Collection.filter(name=args.name, version=args.version).first()
165
+
166
+ print(
167
+ "using the dataset ", collection, " of size ", len(collection.artifacts.all())
168
+ )
169
+ # Initialize the preprocessor
170
+ preprocessor = LaminPreprocessor(
171
+ filter_gene_by_counts=args.filter_gene_by_counts,
172
+ filter_cell_by_counts=args.filter_cell_by_counts,
173
+ normalize_sum=args.normalize_sum,
174
+ subset_hvg=args.subset_hvg,
175
+ hvg_flavor=args.hvg_flavor,
176
+ binning=args.binning,
177
+ result_binned_key=args.result_binned_key,
178
+ length_normalize=args.length_normalize,
179
+ force_preprocess=args.force_preprocess,
180
+ min_dataset_size=args.min_dataset_size,
181
+ min_valid_genes_id=args.min_valid_genes_id,
182
+ min_nnz_genes=args.min_nnz_genes,
183
+ maxdropamount=args.maxdropamount,
184
+ madoutlier=args.madoutlier,
185
+ pct_mt_outlier=args.pct_mt_outlier,
186
+ batch_key=args.batch_key,
187
+ skip_validate=args.skip_validate,
188
+ do_postp=args.do_postp,
189
+ additional_preprocess=additional_preprocess,
190
+ additional_postprocess=additional_postprocess,
191
+ keep_files=False,
192
+ )
193
+
194
+ # Preprocess the dataset
195
+ preprocessor(
196
+ collection,
197
+ name=args.new_name,
198
+ description=args.description,
199
+ start_at=args.start_at,
200
+ version=args.new_version,
201
+ )
202
+
203
+ print(
204
+ f"Preprocessed dataset saved with version {args.version} and name {args.new_name}."
205
+ )
206
+
207
+
208
+ if __name__ == "__main__":
209
+ main()