tessera-foundation 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (42) hide show
  1. tessera_foundation-0.1.0/LICENSE +139 -0
  2. tessera_foundation-0.1.0/PKG-INFO +340 -0
  3. tessera_foundation-0.1.0/README.md +167 -0
  4. tessera_foundation-0.1.0/pyproject.toml +75 -0
  5. tessera_foundation-0.1.0/setup.cfg +4 -0
  6. tessera_foundation-0.1.0/tessera/__init__.py +20 -0
  7. tessera_foundation-0.1.0/tessera/_legacy.py +187 -0
  8. tessera_foundation-0.1.0/tessera/base.py +1117 -0
  9. tessera_foundation-0.1.0/tessera/data/__init__.py +1 -0
  10. tessera_foundation-0.1.0/tessera/data/liftover.py +193 -0
  11. tessera_foundation-0.1.0/tessera/data/preprocessing.py +749 -0
  12. tessera_foundation-0.1.0/tessera/hub.py +166 -0
  13. tessera_foundation-0.1.0/tessera/input_keys.py +95 -0
  14. tessera_foundation-0.1.0/tessera/layers/__init__.py +208 -0
  15. tessera_foundation-0.1.0/tessera/layers/act_functions.py +124 -0
  16. tessera_foundation-0.1.0/tessera/layers/attention.py +699 -0
  17. tessera_foundation-0.1.0/tessera/layers/cna_features.py +647 -0
  18. tessera_foundation-0.1.0/tessera/layers/cross_modal.py +317 -0
  19. tessera_foundation-0.1.0/tessera/layers/masking.py +504 -0
  20. tessera_foundation-0.1.0/tessera/layers/mil.py +499 -0
  21. tessera_foundation-0.1.0/tessera/layers/pipelines.py +730 -0
  22. tessera_foundation-0.1.0/tessera/layers/pooling.py +170 -0
  23. tessera_foundation-0.1.0/tessera/layers/positional.py +442 -0
  24. tessera_foundation-0.1.0/tessera/layers/utils.py +343 -0
  25. tessera_foundation-0.1.0/tessera/layers/variant_features.py +1054 -0
  26. tessera_foundation-0.1.0/tessera/model.py +2405 -0
  27. tessera_foundation-0.1.0/tessera/ref_genomes/GRCh37_chr_sizes.txt +93 -0
  28. tessera_foundation-0.1.0/tessera/ref_genomes/GRCh38_chr_sizes.txt +455 -0
  29. tessera_foundation-0.1.0/tessera/ref_genomes/download_ref_genomes.sh +108 -0
  30. tessera_foundation-0.1.0/tessera/training/__init__.py +83 -0
  31. tessera_foundation-0.1.0/tessera/training/callbacks.py +484 -0
  32. tessera_foundation-0.1.0/tessera/training/logging.py +233 -0
  33. tessera_foundation-0.1.0/tessera/training/losses.py +847 -0
  34. tessera_foundation-0.1.0/tessera/training/metrics.py +217 -0
  35. tessera_foundation-0.1.0/tessera/training/models.py +1610 -0
  36. tessera_foundation-0.1.0/tessera/training/schedules.py +160 -0
  37. tessera_foundation-0.1.0/tessera/training/utils.py +109 -0
  38. tessera_foundation-0.1.0/tessera_foundation.egg-info/PKG-INFO +340 -0
  39. tessera_foundation-0.1.0/tessera_foundation.egg-info/SOURCES.txt +40 -0
  40. tessera_foundation-0.1.0/tessera_foundation.egg-info/dependency_links.txt +1 -0
  41. tessera_foundation-0.1.0/tessera_foundation.egg-info/requires.txt +9 -0
  42. tessera_foundation-0.1.0/tessera_foundation.egg-info/top_level.txt +1 -0
@@ -0,0 +1,139 @@
1
+ PolyForm Noncommercial License 1.0.0
2
+
3
+ <https://polyformproject.org/licenses/noncommercial/1.0.0>
4
+
5
+ ## Acceptance
6
+
7
+ In order to get any license under these terms, you must agree
8
+ to them as both strict obligations and conditions to all
9
+ your licenses.
10
+
11
+ ## Copyright License
12
+
13
+ The licensor grants you a copyright license for the
14
+ software to do everything you might do with the software
15
+ that would otherwise infringe the licensor's copyright
16
+ in it for any permitted purpose. However, you may
17
+ only distribute the software according to [Distribution
18
+ License](#distribution-license) and make changes or new works
19
+ based on the software according to [Changes and New Works
20
+ License](#changes-and-new-works-license).
21
+
22
+ ## Distribution License
23
+
24
+ The licensor grants you an additional copyright license
25
+ to distribute copies of the software. Your license
26
+ to distribute covers distributing the software with
27
+ changes and new works permitted by [Changes and New Works
28
+ License](#changes-and-new-works-license).
29
+
30
+ ## Notices
31
+
32
+ You must ensure that anyone who gets a copy of any part of
33
+ the software from you also gets a copy of these terms or the
34
+ URL for them above, as well as copies of any plain-text lines
35
+ beginning with `Required Notice:` that the licensor provided
36
+ with the software. For example:
37
+
38
+ > Required Notice: Copyright 2026 NewYork-Presbyterian and Weill Cornell Medicine.
39
+ > TESSERA is licensed for academic and non-commercial use only.
40
+ > Commercial licensing: contact NewYork-Presbyterian's technology transfer office.
41
+
42
+ ## Changes and New Works License
43
+
44
+ The licensor grants you an additional copyright license to
45
+ make changes and new works based on the software for any
46
+ permitted purpose.
47
+
48
+ ## Patent License
49
+
50
+ The licensor grants you a patent license for the software that
51
+ covers patent claims the licensor can license, or becomes able
52
+ to license, that you would infringe by using the software.
53
+
54
+ ## Noncommercial Purposes
55
+
56
+ Any noncommercial purpose is a permitted purpose.
57
+
58
+ ## Personal Uses
59
+
60
+ Personal use for research, experiment, and testing for
61
+ the benefit of public knowledge, personal study, private
62
+ entertainment, hobby projects, amateur pursuits, or religious
63
+ observance, without any anticipated commercial application,
64
+ is use for a permitted purpose.
65
+
66
+ ## Noncommercial Organizations
67
+
68
+ Use by any charitable organization, educational institution,
69
+ public research organization, public safety or health
70
+ organization, environmental protection organization,
71
+ or government institution is use for a permitted purpose
72
+ regardless of the source of funding or obligations resulting
73
+ from the funding.
74
+
75
+ ## Fair Use
76
+
77
+ You may have "fair use" rights for the software under the
78
+ law. These terms do not limit them.
79
+
80
+ ## No Other Rights
81
+
82
+ These terms do not allow you to sublicense or transfer any of
83
+ your licenses to anyone else, or prevent the licensor from
84
+ granting licenses to anyone else. These terms do not imply
85
+ any other licenses.
86
+
87
+ ## Patent Defense
88
+
89
+ If you make any written claim that the software infringes or
90
+ contributes to infringement of any patent, your patent license
91
+ for the software granted under these terms ends immediately. If
92
+ your company makes such a claim, your patent license ends
93
+ immediately for work on behalf of your company.
94
+
95
+ ## Violations
96
+
97
+ The first time you are notified in writing that you have
98
+ violated any of these terms, or done anything with the software
99
+ not covered by your licenses, your licenses can nonetheless
100
+ continue if you come into full compliance with these terms,
101
+ and take practical steps to correct past violations, within
102
+ 32 days of receiving notice. Otherwise, all your licenses
103
+ end immediately.
104
+
105
+ ## No Liability
106
+
107
+ ***As far as the law allows, the software comes as is, without
108
+ any warranty or condition, and the licensor will not be liable
109
+ to you for any damages arising out of these terms or the use
110
+ or nature of the software, under any kind of legal claim.***
111
+
112
+ ## Definitions
113
+
114
+ The **licensor** is the individual or entity offering these
115
+ terms, and the **software** is the software the licensor makes
116
+ available under these terms.
117
+
118
+ **You** refers to the individual or entity agreeing to these
119
+ terms.
120
+
121
+ **Your company** is any legal entity, sole proprietorship,
122
+ or other kind of organization that you work for, plus all
123
+ organizations that have control over, are under the control of,
124
+ or are under common control with that organization. **Control**
125
+ means ownership of substantially all the assets of an entity,
126
+ or the power to direct its management and policies by vote,
127
+ contract, or otherwise. Control can be direct or indirect.
128
+
129
+ **Your licenses** are all the licenses granted to you for the
130
+ software under these terms.
131
+
132
+ **Use** means anything you do with the software requiring one
133
+ of your licenses.
134
+
135
+ ---
136
+
137
+ Required Notice: Copyright 2026 NewYork-Presbyterian and Weill Cornell Medicine.
138
+ TESSERA is licensed for academic and non-commercial use only.
139
+ Commercial licensing: contact NewYork-Presbyterian's technology transfer office.
@@ -0,0 +1,340 @@
1
+ Metadata-Version: 2.4
2
+ Name: tessera-foundation
3
+ Version: 0.1.0
4
+ Summary: TESSERA: a foundation model for the cancer genome (joint SNV+CNA self-supervised pretraining).
5
+ Author-email: John-William Sidhom <johnwilliamsidhom@gmail.com>
6
+ License: PolyForm Noncommercial License 1.0.0
7
+
8
+ <https://polyformproject.org/licenses/noncommercial/1.0.0>
9
+
10
+ ## Acceptance
11
+
12
+ In order to get any license under these terms, you must agree
13
+ to them as both strict obligations and conditions to all
14
+ your licenses.
15
+
16
+ ## Copyright License
17
+
18
+ The licensor grants you a copyright license for the
19
+ software to do everything you might do with the software
20
+ that would otherwise infringe the licensor's copyright
21
+ in it for any permitted purpose. However, you may
22
+ only distribute the software according to [Distribution
23
+ License](#distribution-license) and make changes or new works
24
+ based on the software according to [Changes and New Works
25
+ License](#changes-and-new-works-license).
26
+
27
+ ## Distribution License
28
+
29
+ The licensor grants you an additional copyright license
30
+ to distribute copies of the software. Your license
31
+ to distribute covers distributing the software with
32
+ changes and new works permitted by [Changes and New Works
33
+ License](#changes-and-new-works-license).
34
+
35
+ ## Notices
36
+
37
+ You must ensure that anyone who gets a copy of any part of
38
+ the software from you also gets a copy of these terms or the
39
+ URL for them above, as well as copies of any plain-text lines
40
+ beginning with `Required Notice:` that the licensor provided
41
+ with the software. For example:
42
+
43
+ > Required Notice: Copyright 2026 NewYork-Presbyterian and Weill Cornell Medicine.
44
+ > TESSERA is licensed for academic and non-commercial use only.
45
+ > Commercial licensing: contact NewYork-Presbyterian's technology transfer office.
46
+
47
+ ## Changes and New Works License
48
+
49
+ The licensor grants you an additional copyright license to
50
+ make changes and new works based on the software for any
51
+ permitted purpose.
52
+
53
+ ## Patent License
54
+
55
+ The licensor grants you a patent license for the software that
56
+ covers patent claims the licensor can license, or becomes able
57
+ to license, that you would infringe by using the software.
58
+
59
+ ## Noncommercial Purposes
60
+
61
+ Any noncommercial purpose is a permitted purpose.
62
+
63
+ ## Personal Uses
64
+
65
+ Personal use for research, experiment, and testing for
66
+ the benefit of public knowledge, personal study, private
67
+ entertainment, hobby projects, amateur pursuits, or religious
68
+ observance, without any anticipated commercial application,
69
+ is use for a permitted purpose.
70
+
71
+ ## Noncommercial Organizations
72
+
73
+ Use by any charitable organization, educational institution,
74
+ public research organization, public safety or health
75
+ organization, environmental protection organization,
76
+ or government institution is use for a permitted purpose
77
+ regardless of the source of funding or obligations resulting
78
+ from the funding.
79
+
80
+ ## Fair Use
81
+
82
+ You may have "fair use" rights for the software under the
83
+ law. These terms do not limit them.
84
+
85
+ ## No Other Rights
86
+
87
+ These terms do not allow you to sublicense or transfer any of
88
+ your licenses to anyone else, or prevent the licensor from
89
+ granting licenses to anyone else. These terms do not imply
90
+ any other licenses.
91
+
92
+ ## Patent Defense
93
+
94
+ If you make any written claim that the software infringes or
95
+ contributes to infringement of any patent, your patent license
96
+ for the software granted under these terms ends immediately. If
97
+ your company makes such a claim, your patent license ends
98
+ immediately for work on behalf of your company.
99
+
100
+ ## Violations
101
+
102
+ The first time you are notified in writing that you have
103
+ violated any of these terms, or done anything with the software
104
+ not covered by your licenses, your licenses can nonetheless
105
+ continue if you come into full compliance with these terms,
106
+ and take practical steps to correct past violations, within
107
+ 32 days of receiving notice. Otherwise, all your licenses
108
+ end immediately.
109
+
110
+ ## No Liability
111
+
112
+ ***As far as the law allows, the software comes as is, without
113
+ any warranty or condition, and the licensor will not be liable
114
+ to you for any damages arising out of these terms or the use
115
+ or nature of the software, under any kind of legal claim.***
116
+
117
+ ## Definitions
118
+
119
+ The **licensor** is the individual or entity offering these
120
+ terms, and the **software** is the software the licensor makes
121
+ available under these terms.
122
+
123
+ **You** refers to the individual or entity agreeing to these
124
+ terms.
125
+
126
+ **Your company** is any legal entity, sole proprietorship,
127
+ or other kind of organization that you work for, plus all
128
+ organizations that have control over, are under the control of,
129
+ or are under common control with that organization. **Control**
130
+ means ownership of substantially all the assets of an entity,
131
+ or the power to direct its management and policies by vote,
132
+ contract, or otherwise. Control can be direct or indirect.
133
+
134
+ **Your licenses** are all the licenses granted to you for the
135
+ software under these terms.
136
+
137
+ **Use** means anything you do with the software requiring one
138
+ of your licenses.
139
+
140
+ ---
141
+
142
+ Required Notice: Copyright 2026 NewYork-Presbyterian and Weill Cornell Medicine.
143
+ TESSERA is licensed for academic and non-commercial use only.
144
+ Commercial licensing: contact NewYork-Presbyterian's technology transfer office.
145
+
146
+ Project-URL: Homepage, https://github.com/JW-Sidhom-Lab/tessera
147
+ Project-URL: Repository, https://github.com/JW-Sidhom-Lab/tessera
148
+ Project-URL: Model weights, https://huggingface.co/JW-Sidhom-Lab/tessera-foundation
149
+ Project-URL: Issues, https://github.com/JW-Sidhom-Lab/tessera/issues
150
+ Keywords: cancer-genomics,foundation-model,self-supervised-learning,tcga,somatic-variants,copy-number-alterations,bioinformatics
151
+ Classifier: Development Status :: 4 - Beta
152
+ Classifier: Intended Audience :: Science/Research
153
+ Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
154
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
155
+ Classifier: Programming Language :: Python :: 3
156
+ Classifier: Programming Language :: Python :: 3.10
157
+ Classifier: Programming Language :: Python :: 3.11
158
+ Classifier: Programming Language :: Python :: 3.12
159
+ Classifier: Operating System :: OS Independent
160
+ Requires-Python: >=3.10
161
+ Description-Content-Type: text/markdown
162
+ License-File: LICENSE
163
+ Requires-Dist: tensorflow>=2.16
164
+ Requires-Dist: numpy
165
+ Requires-Dist: pandas>=2.0
166
+ Requires-Dist: scipy>=1.10
167
+ Requires-Dist: scikit-learn>=1.3
168
+ Requires-Dist: pyfaidx>=0.7
169
+ Requires-Dist: pyliftover>=0.4
170
+ Requires-Dist: tqdm>=4.66
171
+ Requires-Dist: huggingface_hub>=0.20
172
+ Dynamic: license-file
173
+
174
+ <p align="center">
175
+ <img src="logo.png" alt="TESSERA logo" width="220">
176
+ </p>
177
+
178
+ <p align="center">
179
+ <em>Tumour Embeddings via Self-Supervised Encoding and Reconstruction of Alterations</em><br>
180
+ A foundation model for the cancer genome.
181
+ </p>
182
+
183
+ ---
184
+
185
+ TESSERA is a self-supervised foundation model jointly pretrained on somatic single-nucleotide variants (SNVs) and copy-number alterations (CNAs) from the TCGA Pan-Cancer Atlas. A single learned representation, produced once and reused without retraining, supports variant pathogenicity prediction, pan-cancer tumour-type classification, unsupervised molecular subtyping, prognostic stratification, and counterfactual treatment-effect estimation.
186
+
187
+ This repository contains the reference implementation, the pretrained-weights pointer, the inference utilities described in the accompanying paper, and the end-to-end analysis pipelines that reproduce every panel of Figures 1-6 and Supplementary Figures 1-12.
188
+
189
+ ## Quick start
190
+
191
+ The fastest way to use TESSERA is via the public inference API on Hugging Face; no local installation required. Upload SNV and/or CNA data, get back per-variant predictions and embeddings:
192
+
193
+ 🔗 **Inference API**: [huggingface.co/spaces/JW-Sidhom-Lab/tessera](https://huggingface.co/spaces/JW-Sidhom-Lab/tessera) *(coming soon)*
194
+
195
+ From Python (`pip install gradio_client`):
196
+
197
+ ```python
198
+ import time
199
+ from gradio_client import Client, handle_file
200
+
201
+ client = Client("JW-Sidhom-Lab/tessera") # the public Spaces URL also works
202
+
203
+ # Submit returns (status_html, job_id) immediately; inference runs async
204
+ _, job_id = client.predict(
205
+ handle_file("snv.csv"), # SNV CSV; or None
206
+ handle_file("cna.csv"), # CNA CSV; or None. At least one required.
207
+ True, # apply TCGA quantile normalization to CNA
208
+ "you@example.com", # email address for the download link
209
+ "GRCh37", # genome assembly: "GRCh37" or "GRCh38"
210
+ api_name="/submit",
211
+ )
212
+
213
+ # Poll for completion (the same URL also gets emailed when the job finishes)
214
+ while True:
215
+ status = client.predict(job_id, api_name="/status")
216
+ if status["status"] in ("done", "failed"):
217
+ break
218
+ time.sleep(10)
219
+
220
+ print(status["url"]) # 24h pre-signed S3 download URL with the result ZIP
221
+ ```
222
+
223
+ The API serves the foundation-model outputs only (per-token embeddings + per-token reconstruction predictions, returned as `.npy` files inside the result ZIP). Downstream task heads (tumour-type classifier, treatment-effect score) are available on request under a Data Use Agreement.
224
+
225
+ CSV column conventions:
226
+
227
+ - **SNV**: `Tumor_Sample_Barcode`, `Chromosome` (no `chr` prefix), `Start_Position`, `Reference_Allele`, `Tumor_Seq_Allele2`, plus either `vaf` or both `t_alt_count` + `t_ref_count`. Single-base substitutions only.
228
+ - **CNA**: `Tumor_Sample_Barcode`, `Chromosome`, `Start`, `End`, `Segment_Mean` (log2 ratio); optional `LOH` column triggers the with-LoH model variant.
229
+
230
+ ## Local installation
231
+
232
+ For users who want to run inference offline, integrate TESSERA into a custom pipeline, or retrain on their own data:
233
+
234
+ ```bash
235
+ # Clone
236
+ git clone https://github.com/JW-Sidhom-Lab/tessera.git
237
+ cd tessera
238
+
239
+ # Recommended: a virtual environment so deps don't clash with system Python
240
+ python3 -m venv .venv && source .venv/bin/activate
241
+
242
+ # Install all dependencies
243
+ pip install -r requirements.txt
244
+
245
+ # Download reference genome (default: GRCh37)
246
+ bash tessera/ref_genomes/download_ref_genomes.sh
247
+ ```
248
+
249
+ `requirements.txt` covers the foundation-model package, all manuscript-reproduction scripts (pretraining, classifiers, prognostic / predictive-biomarker analyses), and the Gradio inference API. A trimmer subset for deploying only the inference API is at [`inference_api/requirements.txt`](inference_api/requirements.txt).
250
+
251
+ Weights are hosted on Hugging Face Hub at [huggingface.co/JW-Sidhom-Lab/tessera-foundation](https://huggingface.co/JW-Sidhom-Lab/tessera-foundation) under CC-BY-NC-4.0. The shortest path from raw dataframes to feature tensors is the `featurize` one-liner, which downloads weights on first call (cached afterwards), lifts non-hg19 coordinates, builds the dataset, and runs both per-modality feature heads:
252
+
253
+ ```python
254
+ import tessera
255
+
256
+ result = tessera.featurize(
257
+ snv_df=snv_df, # columns: Tumor_Sample_Barcode, Chromosome, Start_Position,
258
+ # Reference_Allele, Tumor_Seq_Allele2, vaf
259
+ cna_df=cna_df, # columns: Tumor_Sample_Barcode, Chromosome, Start, End, Segment_Mean
260
+ variant="joint_snv_cna_noloh", # or "joint_snv_cna" for the with-LoH variant
261
+ from_assembly="GRCh38", # "GRCh37" / "hg19" is a no-op; otherwise UCSC liftover runs
262
+ )
263
+
264
+ result.snv_features # (n_variants, 1169) per-variant embeddings, row-aligned with result.snv_table
265
+ result.cna_features # (n_segments, 688) per-segment embeddings, row-aligned with result.cna_table
266
+ result.liftover_stats # {"snv": {"n_in", "n_out", "n_dropped"}, "cna": {...}}
267
+ ```
268
+
269
+ For finer-grained control there are still building blocks:
270
+
271
+ ```python
272
+ from tessera import load_pretrained, lift_snv, lift_cna
273
+
274
+ model = load_pretrained("joint_snv_cna_noloh") # download + instantiate, ~3 s cold
275
+ snv_df, _ = lift_snv(snv_df, from_assembly="GRCh38") # identity if from_assembly=="GRCh37"
276
+ cna_df, _ = lift_cna(cna_df, from_assembly="GRCh38")
277
+ result = model.featurize(snv_df=snv_df, cna_df=cna_df) # repeat without re-downloading
278
+ ```
279
+
280
+ UCSC chain files are downloaded on first use and cached at `~/.cache/pyliftover/`; offline environments can point the loader at a bundled chain file via the `chain_file=` argument or the `TESSERA_LIFTOVER_CHAIN` environment variable.
281
+
282
+ ## Reproducing the manuscript
283
+
284
+ Every published panel is backed by a script in this repository. The
285
+ pipeline runs in three stages:
286
+
287
+ 1. **Data preparation** ([`data/`](data/README.md)): per-cohort
288
+ download instructions, source-table provenance, and the
289
+ `create_training_data*.py` / `build_<cohort>_metadata.py` builders
290
+ that turn raw releases into the analysis-ready CSVs.
291
+ 2. **Foundation-model pretraining**
292
+ ([`scripts/tcga_pancan_*/`](scripts/README.md)): trains the SNV
293
+ models, the CNA models, and the joint SNV+CNA InfoNCE-aligned
294
+ foundation model on the TCGA Pan-Cancer Atlas.
295
+ 3. **Downstream analyses** ([`scripts/`](scripts/README.md)):
296
+ variant-pathogenicity (Fig. 1 h-o), cross-platform validation
297
+ (Fig. 1 f-g, Fig. 2 d), tumour-type classification (Fig. 3,
298
+ Fig. 4 b-e), prognostic UMAP + joint Cox (Fig. 5), doubly-robust
299
+ counterfactual treatment-effect (Fig. 6 a-m), and DepMap
300
+ cell-line transfer (Fig. 6 n).
301
+
302
+ [`scripts/README.md`](scripts/README.md) and
303
+ [`data/README.md`](data/README.md) hold the full per-directory tables
304
+ mapping each script and cohort to its manuscript figure.
305
+
306
+ ## Repository layout
307
+
308
+ ```
309
+ tessera/
310
+ ├── tessera/ # foundation-model package
311
+ │ ├── base.py # BaseModel: shared data + training infrastructure
312
+ │ ├── input_keys.py # input-key helpers
313
+ │ ├── model.py # TESSERA: foundation-model class
314
+ │ ├── data/
315
+ │ │ └── preprocessing.py # SNV/CNA tokenization, FASTA lookup, sample bagging
316
+ │ ├── layers/ # custom Keras layers (attention, masking, MIL, ...)
317
+ │ ├── training/ # training utilities (callbacks, losses, schedules)
318
+ │ └── ref_genomes/ # reference-genome download script + indices
319
+ ├── data/ # per-cohort data preparation pipelines (data/README.md)
320
+ ├── scripts/ # analysis pipelines backing the manuscript figures (scripts/README.md)
321
+ └── README.md
322
+ ```
323
+
324
+ ## Citing TESSERA
325
+
326
+ If you use TESSERA in your work, please cite:
327
+
328
+ > *citation pending publication*
329
+
330
+ A BibTeX entry will be added on acceptance.
331
+
332
+ ## License
333
+
334
+ This repository is distributed under the **PolyForm Noncommercial License 1.0.0** (see [`LICENSE`](LICENSE)). Use is permitted for academic research, education, public-research-organization use, and personal experimentation; commercial use is not permitted without a separate license. Pretrained foundation-model weights are released on the Hugging Face Hub under **CC-BY-NC-4.0** (non-commercial, attribution required). Pretrained weights for downstream clinical task heads (CRC and PDAC treatment-effect models) remain available on request under a Data Use Agreement. Patents covering clinical applications of TESSERA are assigned to NewYork-Presbyterian; commercial licensing inquiries should be directed to NYP's technology transfer office.
335
+
336
+ ## Lab
337
+
338
+ TESSERA is developed in the [JW Sidhom Lab](https://github.com/JW-Sidhom-Lab) at Weill Cornell Medicine.
339
+
340
+ For questions, collaborations, or commercial-licensing enquiries, contact the corresponding author.
@@ -0,0 +1,167 @@
1
+ <p align="center">
2
+ <img src="logo.png" alt="TESSERA logo" width="220">
3
+ </p>
4
+
5
+ <p align="center">
6
+ <em>Tumour Embeddings via Self-Supervised Encoding and Reconstruction of Alterations</em><br>
7
+ A foundation model for the cancer genome.
8
+ </p>
9
+
10
+ ---
11
+
12
+ TESSERA is a self-supervised foundation model jointly pretrained on somatic single-nucleotide variants (SNVs) and copy-number alterations (CNAs) from the TCGA Pan-Cancer Atlas. A single learned representation, produced once and reused without retraining, supports variant pathogenicity prediction, pan-cancer tumour-type classification, unsupervised molecular subtyping, prognostic stratification, and counterfactual treatment-effect estimation.
13
+
14
+ This repository contains the reference implementation, the pretrained-weights pointer, the inference utilities described in the accompanying paper, and the end-to-end analysis pipelines that reproduce every panel of Figures 1-6 and Supplementary Figures 1-12.
15
+
16
+ ## Quick start
17
+
18
+ The fastest way to use TESSERA is via the public inference API on Hugging Face; no local installation required. Upload SNV and/or CNA data, get back per-variant predictions and embeddings:
19
+
20
+ 🔗 **Inference API**: [huggingface.co/spaces/JW-Sidhom-Lab/tessera](https://huggingface.co/spaces/JW-Sidhom-Lab/tessera) *(coming soon)*
21
+
22
+ From Python (`pip install gradio_client`):
23
+
24
+ ```python
25
+ import time
26
+ from gradio_client import Client, handle_file
27
+
28
+ client = Client("JW-Sidhom-Lab/tessera") # the public Spaces URL also works
29
+
30
+ # Submit returns (status_html, job_id) immediately; inference runs async
31
+ _, job_id = client.predict(
32
+ handle_file("snv.csv"), # SNV CSV; or None
33
+ handle_file("cna.csv"), # CNA CSV; or None. At least one required.
34
+ True, # apply TCGA quantile normalization to CNA
35
+ "you@example.com", # email address for the download link
36
+ "GRCh37", # genome assembly: "GRCh37" or "GRCh38"
37
+ api_name="/submit",
38
+ )
39
+
40
+ # Poll for completion (the same URL also gets emailed when the job finishes)
41
+ while True:
42
+ status = client.predict(job_id, api_name="/status")
43
+ if status["status"] in ("done", "failed"):
44
+ break
45
+ time.sleep(10)
46
+
47
+ print(status["url"]) # 24h pre-signed S3 download URL with the result ZIP
48
+ ```
49
+
50
+ The API serves the foundation-model outputs only (per-token embeddings + per-token reconstruction predictions, returned as `.npy` files inside the result ZIP). Downstream task heads (tumour-type classifier, treatment-effect score) are available on request under a Data Use Agreement.
51
+
52
+ CSV column conventions:
53
+
54
+ - **SNV**: `Tumor_Sample_Barcode`, `Chromosome` (no `chr` prefix), `Start_Position`, `Reference_Allele`, `Tumor_Seq_Allele2`, plus either `vaf` or both `t_alt_count` + `t_ref_count`. Single-base substitutions only.
55
+ - **CNA**: `Tumor_Sample_Barcode`, `Chromosome`, `Start`, `End`, `Segment_Mean` (log2 ratio); optional `LOH` column triggers the with-LoH model variant.
56
+
57
+ ## Local installation
58
+
59
+ For users who want to run inference offline, integrate TESSERA into a custom pipeline, or retrain on their own data:
60
+
61
+ ```bash
62
+ # Clone
63
+ git clone https://github.com/JW-Sidhom-Lab/tessera.git
64
+ cd tessera
65
+
66
+ # Recommended: a virtual environment so deps don't clash with system Python
67
+ python3 -m venv .venv && source .venv/bin/activate
68
+
69
+ # Install all dependencies
70
+ pip install -r requirements.txt
71
+
72
+ # Download reference genome (default: GRCh37)
73
+ bash tessera/ref_genomes/download_ref_genomes.sh
74
+ ```
75
+
76
+ `requirements.txt` covers the foundation-model package, all manuscript-reproduction scripts (pretraining, classifiers, prognostic / predictive-biomarker analyses), and the Gradio inference API. A trimmer subset for deploying only the inference API is at [`inference_api/requirements.txt`](inference_api/requirements.txt).
77
+
78
+ Weights are hosted on Hugging Face Hub at [huggingface.co/JW-Sidhom-Lab/tessera-foundation](https://huggingface.co/JW-Sidhom-Lab/tessera-foundation) under CC-BY-NC-4.0. The shortest path from raw dataframes to feature tensors is the `featurize` one-liner, which downloads weights on first call (cached afterwards), lifts non-hg19 coordinates, builds the dataset, and runs both per-modality feature heads:
79
+
80
+ ```python
81
+ import tessera
82
+
83
+ result = tessera.featurize(
84
+ snv_df=snv_df, # columns: Tumor_Sample_Barcode, Chromosome, Start_Position,
85
+ # Reference_Allele, Tumor_Seq_Allele2, vaf
86
+ cna_df=cna_df, # columns: Tumor_Sample_Barcode, Chromosome, Start, End, Segment_Mean
87
+ variant="joint_snv_cna_noloh", # or "joint_snv_cna" for the with-LoH variant
88
+ from_assembly="GRCh38", # "GRCh37" / "hg19" is a no-op; otherwise UCSC liftover runs
89
+ )
90
+
91
+ result.snv_features # (n_variants, 1169) per-variant embeddings, row-aligned with result.snv_table
92
+ result.cna_features # (n_segments, 688) per-segment embeddings, row-aligned with result.cna_table
93
+ result.liftover_stats # {"snv": {"n_in", "n_out", "n_dropped"}, "cna": {...}}
94
+ ```
95
+
96
+ For finer-grained control there are still building blocks:
97
+
98
+ ```python
99
+ from tessera import load_pretrained, lift_snv, lift_cna
100
+
101
+ model = load_pretrained("joint_snv_cna_noloh") # download + instantiate, ~3 s cold
102
+ snv_df, _ = lift_snv(snv_df, from_assembly="GRCh38") # identity if from_assembly=="GRCh37"
103
+ cna_df, _ = lift_cna(cna_df, from_assembly="GRCh38")
104
+ result = model.featurize(snv_df=snv_df, cna_df=cna_df) # repeat without re-downloading
105
+ ```
106
+
107
+ UCSC chain files are downloaded on first use and cached at `~/.cache/pyliftover/`; offline environments can point the loader at a bundled chain file via the `chain_file=` argument or the `TESSERA_LIFTOVER_CHAIN` environment variable.
108
+
109
+ ## Reproducing the manuscript
110
+
111
+ Every published panel is backed by a script in this repository. The
112
+ pipeline runs in three stages:
113
+
114
+ 1. **Data preparation** ([`data/`](data/README.md)): per-cohort
115
+ download instructions, source-table provenance, and the
116
+ `create_training_data*.py` / `build_<cohort>_metadata.py` builders
117
+ that turn raw releases into the analysis-ready CSVs.
118
+ 2. **Foundation-model pretraining**
119
+ ([`scripts/tcga_pancan_*/`](scripts/README.md)): trains the SNV
120
+ models, the CNA models, and the joint SNV+CNA InfoNCE-aligned
121
+ foundation model on the TCGA Pan-Cancer Atlas.
122
+ 3. **Downstream analyses** ([`scripts/`](scripts/README.md)):
123
+ variant-pathogenicity (Fig. 1 h-o), cross-platform validation
124
+ (Fig. 1 f-g, Fig. 2 d), tumour-type classification (Fig. 3,
125
+ Fig. 4 b-e), prognostic UMAP + joint Cox (Fig. 5), doubly-robust
126
+ counterfactual treatment-effect (Fig. 6 a-m), and DepMap
127
+ cell-line transfer (Fig. 6 n).
128
+
129
+ [`scripts/README.md`](scripts/README.md) and
130
+ [`data/README.md`](data/README.md) hold the full per-directory tables
131
+ mapping each script and cohort to its manuscript figure.
132
+
133
+ ## Repository layout
134
+
135
+ ```
136
+ tessera/
137
+ ├── tessera/ # foundation-model package
138
+ │ ├── base.py # BaseModel: shared data + training infrastructure
139
+ │ ├── input_keys.py # input-key helpers
140
+ │ ├── model.py # TESSERA: foundation-model class
141
+ │ ├── data/
142
+ │ │ └── preprocessing.py # SNV/CNA tokenization, FASTA lookup, sample bagging
143
+ │ ├── layers/ # custom Keras layers (attention, masking, MIL, ...)
144
+ │ ├── training/ # training utilities (callbacks, losses, schedules)
145
+ │ └── ref_genomes/ # reference-genome download script + indices
146
+ ├── data/ # per-cohort data preparation pipelines (data/README.md)
147
+ ├── scripts/ # analysis pipelines backing the manuscript figures (scripts/README.md)
148
+ └── README.md
149
+ ```
150
+
151
+ ## Citing TESSERA
152
+
153
+ If you use TESSERA in your work, please cite:
154
+
155
+ > *citation pending publication*
156
+
157
+ A BibTeX entry will be added on acceptance.
158
+
159
+ ## License
160
+
161
+ This repository is distributed under the **PolyForm Noncommercial License 1.0.0** (see [`LICENSE`](LICENSE)). Use is permitted for academic research, education, public-research-organization use, and personal experimentation; commercial use is not permitted without a separate license. Pretrained foundation-model weights are released on the Hugging Face Hub under **CC-BY-NC-4.0** (non-commercial, attribution required). Pretrained weights for downstream clinical task heads (CRC and PDAC treatment-effect models) remain available on request under a Data Use Agreement. Patents covering clinical applications of TESSERA are assigned to NewYork-Presbyterian; commercial licensing inquiries should be directed to NYP's technology transfer office.
162
+
163
+ ## Lab
164
+
165
+ TESSERA is developed in the [JW Sidhom Lab](https://github.com/JW-Sidhom-Lab) at Weill Cornell Medicine.
166
+
167
+ For questions, collaborations, or commercial-licensing enquiries, contact the corresponding author.