mapmycells2cl 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,32 @@
1
+ # Python-generated files
2
+ __pycache__/
3
+ *.py[oc]
4
+ build/
5
+ dist/
6
+ wheels/
7
+ *.egg-info
8
+
9
+ # Virtual environments
10
+ .venv
11
+
12
+ # Coverage
13
+ .coverage
14
+ htmlcov/
15
+
16
+ # Large OWL/ontology source files (use update-mappings to regenerate)
17
+ cl.owl
18
+ cl-full.owl
19
+ cl-full.json
20
+ pcl.owl
21
+
22
+ # MapMyCells annotated output files
23
+ *_annotated.csv
24
+ *_annotated.json
25
+
26
+ # Pytest cache
27
+ .pytest_cache/
28
+
29
+ # Sphinx docs build output and generated AutoAPI source files
30
+ docs/_build/
31
+ docs/autoapi/
32
+
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 Cellular Semantics
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,349 @@
1
+ Metadata-Version: 2.4
2
+ Name: mapmycells2cl
3
+ Version: 0.1.0
4
+ Summary: Map MapMyCells ABA taxonomy IDs to Cell Ontology (CL) terms
5
+ Project-URL: Homepage, https://github.com/Cellular-Semantics/MapMyCells2CL
6
+ Project-URL: Repository, https://github.com/Cellular-Semantics/MapMyCells2CL
7
+ Project-URL: Issues, https://github.com/Cellular-Semantics/MapMyCells2CL/issues
8
+ Author: Cellular Semantics
9
+ License: MIT
10
+ License-File: LICENSE
11
+ Keywords: allen-brain-atlas,cell-ontology,mapmycells,single-cell
12
+ Classifier: Development Status :: 4 - Beta
13
+ Classifier: Intended Audience :: Science/Research
14
+ Classifier: License :: OSI Approved :: MIT License
15
+ Classifier: Programming Language :: Python :: 3
16
+ Classifier: Programming Language :: Python :: 3.13
17
+ Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
18
+ Requires-Python: >=3.13
19
+ Requires-Dist: anndata>=0.10
20
+ Requires-Dist: click>=8.1
21
+ Description-Content-Type: text/markdown
22
+
23
+ # MapMyCells2CL
24
+
25
+ Annotate [MapMyCells](https://brain-map.org/bkp/analyze/mapmycells) output with [Cell Ontology (CL)](https://obofoundry.org/ontology/cl.html) terms.
26
+
27
+ MapMyCells assigns cells to Allen Brain Atlas (ABA) taxonomy nodes (e.g. `CS20230722_SUBC_053`). This library maps those IDs to CL or Provisional Cell Ontology (PCL) terms and selects the **most specific CL term** using information-content (IC) ranking — ready for [CELLxGENE](https://cellxgene.cziscience.com/) schema compliance.
28
+
29
+ ---
30
+
31
+ ## Quick start
32
+
33
+ ```bash
34
+ pip install mapmycells2cl
35
+
36
+ # Annotate a MapMyCells CSV
37
+ mmc2cl annotate results.csv
38
+ # → results_annotated.csv
39
+
40
+ # Annotate an h5ad file (CxG-compliant obs columns)
41
+ mmc2cl annotate-h5ad results.csv cells.h5ad
42
+ # → cells_annotated.h5ad
43
+ ```
44
+
45
+ ---
46
+
47
+ ## Installation
48
+
49
+ ```bash
50
+ pip install mapmycells2cl
51
+ # or
52
+ uv add mapmycells2cl
53
+ ```
54
+
55
+ Installing the package places two equivalent commands in your PATH: `mmc2cl` (short form) and `mapmycells2cl`.
56
+
57
+ ### From source (development)
58
+
59
+ Requires [uv](https://docs.astral.sh/uv/).
60
+
61
+ ```bash
62
+ git clone https://github.com/Cellular-Semantics/MapMyCells2CL.git
63
+ cd MapMyCells2CL
64
+ uv sync
65
+
66
+ # Run via uv (no venv activation needed)
67
+ uv run mmc2cl annotate results.csv
68
+
69
+ # Or activate the venv once and use the command directly
70
+ source .venv/bin/activate
71
+ mmc2cl annotate results.csv
72
+ ```
73
+
74
+ ---
75
+
76
+ ## CLI reference
77
+
78
+ ### `annotate`
79
+
80
+ Annotate a MapMyCells CSV or JSON output file with CL terms.
81
+
82
+ ```bash
83
+ mmc2cl annotate INPUT_FILE [OPTIONS]
84
+ ```
85
+
86
+ | Option | Description |
87
+ |--------|-------------|
88
+ | `-o, --output PATH` | Output file path. Defaults to `<input>_annotated.<ext>` |
89
+ | `--mapping PATH` | Path to a custom mapping JSON (default: bundled `mapping.json`) |
90
+
91
+ **Examples:**
92
+
93
+ ```bash
94
+ # Annotate CSV
95
+ mmc2cl annotate results.csv
96
+
97
+ # Annotate JSON
98
+ mmc2cl annotate results.json
99
+
100
+ # Specify output path
101
+ mmc2cl annotate results.csv -o /data/annotated.csv
102
+ ```
103
+
104
+ **CSV output columns** — added after each `{level}_label` column using the CAP/HCA double-dash convention:
105
+
106
+ | Column | Content | When |
107
+ |--------|---------|------|
108
+ | `{level}--cell_type_ontology_term_id` | Most specific CL CURIE (IC-ranked) | Always |
109
+ | `{level}--cell_type` | Label for the above | Always |
110
+ | `{level}--cell_type_pcl_ontology_term_id` | PCL exact match CURIE | PCL exact only |
111
+ | `{level}--cell_type_pcl` | PCL exact label | PCL exact only |
112
+ | `{level}--cell_type_cl_broad_ontology_term_ids` | All CL broad CURIEs, `\|`-joined | PCL exact only |
113
+
114
+ Example (subclass level, PCL exact match):
115
+
116
+ ```
117
+ subclass_label → CS20230722_SUBC_053
118
+ subclass--cell_type_ontology_term_id → CL:4023017
119
+ subclass--cell_type → sst GABAergic cortical interneuron
120
+ subclass--cell_type_pcl_ontology_term_id → PCL:0110113
121
+ subclass--cell_type_pcl → Sst Gaba sst GABAergic cortical interneuron (Mmus)
122
+ subclass--cell_type_cl_broad_ontology_term_ids → CL:4023017|CL:4023069
123
+ ```
124
+
125
+ **JSON output** — `cell_type_ontology_term_id`, `cell_type`, and (for PCL) `cell_type_pcl_ontology_term_id`, `cell_type_pcl`, `cell_type_cl_broad_ontology_term_ids` are added to each level's assignment dict.
126
+
127
+ ---
128
+
129
+ ### `annotate-h5ad`
130
+
131
+ Annotate an AnnData h5ad file with CL terms from a MapMyCells CSV. Adds CL columns directly to `adata.obs`, including the unprefixed `cell_type_ontology_term_id` / `cell_type` pair required by the [CELLxGENE schema](https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/5.3.0/schema.md).
132
+
133
+ ```bash
134
+ mmc2cl annotate-h5ad MMC_CSV H5AD_IN [OPTIONS]
135
+ ```
136
+
137
+ | Option | Description |
138
+ |--------|-------------|
139
+ | `-o, --output PATH` | Output h5ad path. Defaults to `<input>_annotated.h5ad` |
140
+ | `--cxg-level TEXT` | Taxonomy level used for unprefixed CxG columns (default: `cluster`) |
141
+ | `--mapping PATH` | Path to a custom mapping JSON |
142
+
143
+ **Examples:**
144
+
145
+ ```bash
146
+ # Annotate h5ad — output written to cells_annotated.h5ad
147
+ mmc2cl annotate-h5ad results.csv cells.h5ad
148
+
149
+ # Use supertype level for the CxG cell_type columns
150
+ mmc2cl annotate-h5ad results.csv cells.h5ad --cxg-level supertype
151
+ ```
152
+
153
+ **obs columns added:**
154
+
155
+ | Column | Content |
156
+ |--------|---------|
157
+ | `cell_type_ontology_term_id` | IC-best CL CURIE from `--cxg-level` (CxG required) |
158
+ | `cell_type` | Label for the above (CxG required) |
159
+ | `{level}--cell_type_ontology_term_id` | Per-level IC-best CL CURIE |
160
+ | `{level}--cell_type` | Per-level label |
161
+ | `{level}--cell_type_pcl_ontology_term_id` | PCL CURIE (PCL exact only) |
162
+ | `{level}--cell_type_pcl` | PCL label (PCL exact only) |
163
+ | `{level}--cell_type_cl_broad_ontology_term_ids` | `\|`-joined broad CL CURIEs (PCL exact only) |
164
+
165
+ Cells present in the h5ad but absent from the mmc CSV get empty strings.
166
+
167
+ ---
168
+
169
+ ### `update-mappings`
170
+
171
+ Download the latest `pcl.owl` and regenerate the bundled `mapping.json`. Pass `--cl-owl` to include IC-ranked best-CL data (strongly recommended).
172
+
173
+ ```bash
174
+ mmc2cl update-mappings [OPTIONS]
175
+ ```
176
+
177
+ | Option | Description |
178
+ |--------|-------------|
179
+ | `--owl PATH` | Use a local `pcl.owl` instead of downloading |
180
+ | `--cl-owl PATH` | Path to base `cl.owl` for IC computation. Downloads if omitted |
181
+ | `--output PATH` | Output path (default: bundled `src/mapmycells2cl/data/mapping.json`) |
182
+
183
+ **Examples:**
184
+
185
+ ```bash
186
+ # Download latest pcl.owl and regenerate (no IC)
187
+ mmc2cl update-mappings
188
+
189
+ # With IC ranking (recommended) — requires cl.owl (~63 MB)
190
+ mmc2cl update-mappings --cl-owl cl.owl
191
+
192
+ # Use locally cached files
193
+ mmc2cl update-mappings --owl pcl.owl --cl-owl cl.owl
194
+ ```
195
+
196
+ > **Note:** `cl.owl` is large (~63 MB). The PURL `http://purl.obolibrary.org/obo/cl.owl` redirects to GitHub; download it manually if needed and pass the path with `--cl-owl`.
197
+
198
+ ---
199
+
200
+ ## Python API
201
+
202
+ ### `CellTypeMapper`
203
+
204
+ ```python
205
+ from mapmycells2cl import CellTypeMapper
206
+
207
+ mapper = CellTypeMapper() # bundled mapping
208
+ print(mapper.mapping_version) # e.g. "2025-07-07"
209
+ print(mapper.has_ic) # True when mapping includes IC data
210
+ ```
211
+
212
+ ### Single lookup
213
+
214
+ ```python
215
+ result = mapper.lookup("CS20230722_SUBC_313")
216
+
217
+ result.found # True
218
+ result.exact_id # "CL:4300353"
219
+ result.exact_label # "Purkinje cell (Mmus)"
220
+ result.ontology # "CL"
221
+ result.broad # [] — already CL, no broad match needed
222
+ result.best_cl_id # "CL:4300353" — IC-ranked most specific CL term
223
+ result.best_cl_label # "Purkinje cell (Mmus)"
224
+ result.best_cl_ic # IC score (higher = more specific)
225
+ result.mapping_version # "2025-07-07"
226
+ ```
227
+
228
+ ```python
229
+ result = mapper.lookup("CS20230722_SUBC_053")
230
+
231
+ result.exact_id # "PCL:0110113"
232
+ result.ontology # "PCL"
233
+ result.best_cl_id # "CL:4023017" — IC-ranked best CL broad match
234
+ result.broad # [BroadMatch(id="CL:4023017", ...), BroadMatch(id="CL:4023069", ...)]
235
+
236
+ for b in result.broad:
237
+ print(b.id, b.label, b.via)
238
+ ```
239
+
240
+ ```python
241
+ result = mapper.lookup("CS20230722_UNKNOWN_999")
242
+ result.found # False
243
+ result.best_cl_id # ""
244
+ ```
245
+
246
+ ### Batch lookup
247
+
248
+ ```python
249
+ results = mapper.lookup_many([
250
+ "CS20230722_SUBC_313",
251
+ "CS20230722_SUBC_053",
252
+ "CS20230722_CLUS_0768",
253
+ ])
254
+ # Returns List[MatchResult] in the same order
255
+ ```
256
+
257
+ ### Annotator (programmatic use)
258
+
259
+ ```python
260
+ from pathlib import Path
261
+ from mapmycells2cl import CellTypeMapper
262
+ from mapmycells2cl.annotator import annotate_csv, annotate_json, annotate_h5ad
263
+
264
+ mapper = CellTypeMapper()
265
+
266
+ # CSV / JSON
267
+ annotate_csv(Path("results.csv"), Path("results_annotated.csv"), mapper)
268
+ annotate_json(Path("results.json"), Path("results_annotated.json"), mapper)
269
+
270
+ # h5ad — CxG-compliant obs columns
271
+ annotate_h5ad(
272
+ Path("results.csv"),
273
+ Path("cells.h5ad"),
274
+ Path("cells_annotated.h5ad"),
275
+ mapper,
276
+ cxg_level="cluster", # level used for unprefixed cell_type columns
277
+ )
278
+ ```
279
+
280
+ ---
281
+
282
+ ## How it works
283
+
284
+ ### Exact matches
285
+
286
+ Extracted from `owl:equivalentClass` axioms in `pcl.owl`:
287
+
288
+ ```
289
+ CL/PCL_class ≡ CL_0000000 ∧ (RO_0015001 hasValue <ABA_individual>)
290
+ ```
291
+
292
+ Every ABA taxonomy ID maps to either a **CL term** (direct Cell Ontology entry) or a **PCL term** (Provisional Cell Ontology — finer-grained types not yet promoted to CL).
293
+
294
+ ### Broad matches
295
+
296
+ For PCL exact matches, the library walks `rdfs:subClassOf` edges upward until CL terms are reached. Because the hierarchy is a DAG (not a tree), a single PCL term may yield **multiple CL broad matches** (polyhierarchy).
297
+
298
+ ### IC-ranked best CL term
299
+
300
+ When multiple CL broad matches exist, the **most specific** is selected using structure-based Information Content computed over the base CL hierarchy (no PCL):
301
+
302
+ ```
303
+ IC(c) = -log2(|distinct leaf descendants of c| / |total CL leaves|)
304
+ ```
305
+
306
+ Higher IC = more specific. This is pre-computed at `update-mappings` time and stored in `mapping.json`, so there is no runtime CL dependency.
307
+
308
+ ### Coverage (CCN20230722 taxonomy)
309
+
310
+ | Level | → CL | → PCL | Total |
311
+ |-------|------|-------|-------|
312
+ | CLAS (class) | 3 | 24 | 27 |
313
+ | SUBC (subclass) | 15 | 230 | 245 |
314
+ | SUPT (supertype) | 32 | 983 | 1,015 |
315
+ | CLUS (cluster) | 80 | 5,234 | 5,314 |
316
+ | **Total** | **130** | **6,471** | **6,601** |
317
+
318
+ ---
319
+
320
+ ## Data sources
321
+
322
+ - [`pcl.owl`](http://purl.obolibrary.org/obo/pcl.owl) — Provisional Cell Ontology; primary mapping source
323
+ - [`cl.owl`](http://purl.obolibrary.org/obo/cl.owl) — Base Cell Ontology (no imports); used for IC computation
324
+
325
+ Both large OWL files are excluded from the repo. The bundled `mapping.json` is versioned with the PCL release date and includes all pre-computed IC scores.
326
+
327
+ ---
328
+
329
+ ## Development
330
+
331
+ ```bash
332
+ uv sync --dev
333
+
334
+ uv run mypy src/ # type check
335
+ uv run ruff check --fix src/ tests/ # lint
336
+ uv run ruff format src/ tests/ # format
337
+
338
+ uv run pytest -m unit --cov # unit tests (fast, no external deps)
339
+ uv run pytest -m integration # integration tests (requires test_resources/)
340
+ ```
341
+
342
+ CI runs mypy, ruff, and unit tests on every PR via GitHub Actions.
343
+
344
+ ---
345
+
346
+ ## Known gaps
347
+
348
+ - Basal Ganglia ABA mappings are absent from CL — fix planned for a future CL release.
349
+ - `oaklib` integration deferred (not yet needed for current use cases).