tessera-foundation 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- tessera_foundation-0.1.0/LICENSE +139 -0
- tessera_foundation-0.1.0/PKG-INFO +340 -0
- tessera_foundation-0.1.0/README.md +167 -0
- tessera_foundation-0.1.0/pyproject.toml +75 -0
- tessera_foundation-0.1.0/setup.cfg +4 -0
- tessera_foundation-0.1.0/tessera/__init__.py +20 -0
- tessera_foundation-0.1.0/tessera/_legacy.py +187 -0
- tessera_foundation-0.1.0/tessera/base.py +1117 -0
- tessera_foundation-0.1.0/tessera/data/__init__.py +1 -0
- tessera_foundation-0.1.0/tessera/data/liftover.py +193 -0
- tessera_foundation-0.1.0/tessera/data/preprocessing.py +749 -0
- tessera_foundation-0.1.0/tessera/hub.py +166 -0
- tessera_foundation-0.1.0/tessera/input_keys.py +95 -0
- tessera_foundation-0.1.0/tessera/layers/__init__.py +208 -0
- tessera_foundation-0.1.0/tessera/layers/act_functions.py +124 -0
- tessera_foundation-0.1.0/tessera/layers/attention.py +699 -0
- tessera_foundation-0.1.0/tessera/layers/cna_features.py +647 -0
- tessera_foundation-0.1.0/tessera/layers/cross_modal.py +317 -0
- tessera_foundation-0.1.0/tessera/layers/masking.py +504 -0
- tessera_foundation-0.1.0/tessera/layers/mil.py +499 -0
- tessera_foundation-0.1.0/tessera/layers/pipelines.py +730 -0
- tessera_foundation-0.1.0/tessera/layers/pooling.py +170 -0
- tessera_foundation-0.1.0/tessera/layers/positional.py +442 -0
- tessera_foundation-0.1.0/tessera/layers/utils.py +343 -0
- tessera_foundation-0.1.0/tessera/layers/variant_features.py +1054 -0
- tessera_foundation-0.1.0/tessera/model.py +2405 -0
- tessera_foundation-0.1.0/tessera/ref_genomes/GRCh37_chr_sizes.txt +93 -0
- tessera_foundation-0.1.0/tessera/ref_genomes/GRCh38_chr_sizes.txt +455 -0
- tessera_foundation-0.1.0/tessera/ref_genomes/download_ref_genomes.sh +108 -0
- tessera_foundation-0.1.0/tessera/training/__init__.py +83 -0
- tessera_foundation-0.1.0/tessera/training/callbacks.py +484 -0
- tessera_foundation-0.1.0/tessera/training/logging.py +233 -0
- tessera_foundation-0.1.0/tessera/training/losses.py +847 -0
- tessera_foundation-0.1.0/tessera/training/metrics.py +217 -0
- tessera_foundation-0.1.0/tessera/training/models.py +1610 -0
- tessera_foundation-0.1.0/tessera/training/schedules.py +160 -0
- tessera_foundation-0.1.0/tessera/training/utils.py +109 -0
- tessera_foundation-0.1.0/tessera_foundation.egg-info/PKG-INFO +340 -0
- tessera_foundation-0.1.0/tessera_foundation.egg-info/SOURCES.txt +40 -0
- tessera_foundation-0.1.0/tessera_foundation.egg-info/dependency_links.txt +1 -0
- tessera_foundation-0.1.0/tessera_foundation.egg-info/requires.txt +9 -0
- tessera_foundation-0.1.0/tessera_foundation.egg-info/top_level.txt +1 -0
|
@@ -0,0 +1,139 @@
|
|
|
1
|
+
PolyForm Noncommercial License 1.0.0
|
|
2
|
+
|
|
3
|
+
<https://polyformproject.org/licenses/noncommercial/1.0.0>
|
|
4
|
+
|
|
5
|
+
## Acceptance
|
|
6
|
+
|
|
7
|
+
In order to get any license under these terms, you must agree
|
|
8
|
+
to them as both strict obligations and conditions to all
|
|
9
|
+
your licenses.
|
|
10
|
+
|
|
11
|
+
## Copyright License
|
|
12
|
+
|
|
13
|
+
The licensor grants you a copyright license for the
|
|
14
|
+
software to do everything you might do with the software
|
|
15
|
+
that would otherwise infringe the licensor's copyright
|
|
16
|
+
in it for any permitted purpose. However, you may
|
|
17
|
+
only distribute the software according to [Distribution
|
|
18
|
+
License](#distribution-license) and make changes or new works
|
|
19
|
+
based on the software according to [Changes and New Works
|
|
20
|
+
License](#changes-and-new-works-license).
|
|
21
|
+
|
|
22
|
+
## Distribution License
|
|
23
|
+
|
|
24
|
+
The licensor grants you an additional copyright license
|
|
25
|
+
to distribute copies of the software. Your license
|
|
26
|
+
to distribute covers distributing the software with
|
|
27
|
+
changes and new works permitted by [Changes and New Works
|
|
28
|
+
License](#changes-and-new-works-license).
|
|
29
|
+
|
|
30
|
+
## Notices
|
|
31
|
+
|
|
32
|
+
You must ensure that anyone who gets a copy of any part of
|
|
33
|
+
the software from you also gets a copy of these terms or the
|
|
34
|
+
URL for them above, as well as copies of any plain-text lines
|
|
35
|
+
beginning with `Required Notice:` that the licensor provided
|
|
36
|
+
with the software. For example:
|
|
37
|
+
|
|
38
|
+
> Required Notice: Copyright 2026 NewYork-Presbyterian and Weill Cornell Medicine.
|
|
39
|
+
> TESSERA is licensed for academic and non-commercial use only.
|
|
40
|
+
> Commercial licensing: contact NewYork-Presbyterian's technology transfer office.
|
|
41
|
+
|
|
42
|
+
## Changes and New Works License
|
|
43
|
+
|
|
44
|
+
The licensor grants you an additional copyright license to
|
|
45
|
+
make changes and new works based on the software for any
|
|
46
|
+
permitted purpose.
|
|
47
|
+
|
|
48
|
+
## Patent License
|
|
49
|
+
|
|
50
|
+
The licensor grants you a patent license for the software that
|
|
51
|
+
covers patent claims the licensor can license, or becomes able
|
|
52
|
+
to license, that you would infringe by using the software.
|
|
53
|
+
|
|
54
|
+
## Noncommercial Purposes
|
|
55
|
+
|
|
56
|
+
Any noncommercial purpose is a permitted purpose.
|
|
57
|
+
|
|
58
|
+
## Personal Uses
|
|
59
|
+
|
|
60
|
+
Personal use for research, experiment, and testing for
|
|
61
|
+
the benefit of public knowledge, personal study, private
|
|
62
|
+
entertainment, hobby projects, amateur pursuits, or religious
|
|
63
|
+
observance, without any anticipated commercial application,
|
|
64
|
+
is use for a permitted purpose.
|
|
65
|
+
|
|
66
|
+
## Noncommercial Organizations
|
|
67
|
+
|
|
68
|
+
Use by any charitable organization, educational institution,
|
|
69
|
+
public research organization, public safety or health
|
|
70
|
+
organization, environmental protection organization,
|
|
71
|
+
or government institution is use for a permitted purpose
|
|
72
|
+
regardless of the source of funding or obligations resulting
|
|
73
|
+
from the funding.
|
|
74
|
+
|
|
75
|
+
## Fair Use
|
|
76
|
+
|
|
77
|
+
You may have "fair use" rights for the software under the
|
|
78
|
+
law. These terms do not limit them.
|
|
79
|
+
|
|
80
|
+
## No Other Rights
|
|
81
|
+
|
|
82
|
+
These terms do not allow you to sublicense or transfer any of
|
|
83
|
+
your licenses to anyone else, or prevent the licensor from
|
|
84
|
+
granting licenses to anyone else. These terms do not imply
|
|
85
|
+
any other licenses.
|
|
86
|
+
|
|
87
|
+
## Patent Defense
|
|
88
|
+
|
|
89
|
+
If you make any written claim that the software infringes or
|
|
90
|
+
contributes to infringement of any patent, your patent license
|
|
91
|
+
for the software granted under these terms ends immediately. If
|
|
92
|
+
your company makes such a claim, your patent license ends
|
|
93
|
+
immediately for work on behalf of your company.
|
|
94
|
+
|
|
95
|
+
## Violations
|
|
96
|
+
|
|
97
|
+
The first time you are notified in writing that you have
|
|
98
|
+
violated any of these terms, or done anything with the software
|
|
99
|
+
not covered by your licenses, your licenses can nonetheless
|
|
100
|
+
continue if you come into full compliance with these terms,
|
|
101
|
+
and take practical steps to correct past violations, within
|
|
102
|
+
32 days of receiving notice. Otherwise, all your licenses
|
|
103
|
+
end immediately.
|
|
104
|
+
|
|
105
|
+
## No Liability
|
|
106
|
+
|
|
107
|
+
***As far as the law allows, the software comes as is, without
|
|
108
|
+
any warranty or condition, and the licensor will not be liable
|
|
109
|
+
to you for any damages arising out of these terms or the use
|
|
110
|
+
or nature of the software, under any kind of legal claim.***
|
|
111
|
+
|
|
112
|
+
## Definitions
|
|
113
|
+
|
|
114
|
+
The **licensor** is the individual or entity offering these
|
|
115
|
+
terms, and the **software** is the software the licensor makes
|
|
116
|
+
available under these terms.
|
|
117
|
+
|
|
118
|
+
**You** refers to the individual or entity agreeing to these
|
|
119
|
+
terms.
|
|
120
|
+
|
|
121
|
+
**Your company** is any legal entity, sole proprietorship,
|
|
122
|
+
or other kind of organization that you work for, plus all
|
|
123
|
+
organizations that have control over, are under the control of,
|
|
124
|
+
or are under common control with that organization. **Control**
|
|
125
|
+
means ownership of substantially all the assets of an entity,
|
|
126
|
+
or the power to direct its management and policies by vote,
|
|
127
|
+
contract, or otherwise. Control can be direct or indirect.
|
|
128
|
+
|
|
129
|
+
**Your licenses** are all the licenses granted to you for the
|
|
130
|
+
software under these terms.
|
|
131
|
+
|
|
132
|
+
**Use** means anything you do with the software requiring one
|
|
133
|
+
of your licenses.
|
|
134
|
+
|
|
135
|
+
---
|
|
136
|
+
|
|
137
|
+
Required Notice: Copyright 2026 NewYork-Presbyterian and Weill Cornell Medicine.
|
|
138
|
+
TESSERA is licensed for academic and non-commercial use only.
|
|
139
|
+
Commercial licensing: contact NewYork-Presbyterian's technology transfer office.
|
|
@@ -0,0 +1,340 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: tessera-foundation
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: TESSERA: a foundation model for the cancer genome (joint SNV+CNA self-supervised pretraining).
|
|
5
|
+
Author-email: John-William Sidhom <johnwilliamsidhom@gmail.com>
|
|
6
|
+
License: PolyForm Noncommercial License 1.0.0
|
|
7
|
+
|
|
8
|
+
<https://polyformproject.org/licenses/noncommercial/1.0.0>
|
|
9
|
+
|
|
10
|
+
## Acceptance
|
|
11
|
+
|
|
12
|
+
In order to get any license under these terms, you must agree
|
|
13
|
+
to them as both strict obligations and conditions to all
|
|
14
|
+
your licenses.
|
|
15
|
+
|
|
16
|
+
## Copyright License
|
|
17
|
+
|
|
18
|
+
The licensor grants you a copyright license for the
|
|
19
|
+
software to do everything you might do with the software
|
|
20
|
+
that would otherwise infringe the licensor's copyright
|
|
21
|
+
in it for any permitted purpose. However, you may
|
|
22
|
+
only distribute the software according to [Distribution
|
|
23
|
+
License](#distribution-license) and make changes or new works
|
|
24
|
+
based on the software according to [Changes and New Works
|
|
25
|
+
License](#changes-and-new-works-license).
|
|
26
|
+
|
|
27
|
+
## Distribution License
|
|
28
|
+
|
|
29
|
+
The licensor grants you an additional copyright license
|
|
30
|
+
to distribute copies of the software. Your license
|
|
31
|
+
to distribute covers distributing the software with
|
|
32
|
+
changes and new works permitted by [Changes and New Works
|
|
33
|
+
License](#changes-and-new-works-license).
|
|
34
|
+
|
|
35
|
+
## Notices
|
|
36
|
+
|
|
37
|
+
You must ensure that anyone who gets a copy of any part of
|
|
38
|
+
the software from you also gets a copy of these terms or the
|
|
39
|
+
URL for them above, as well as copies of any plain-text lines
|
|
40
|
+
beginning with `Required Notice:` that the licensor provided
|
|
41
|
+
with the software. For example:
|
|
42
|
+
|
|
43
|
+
> Required Notice: Copyright 2026 NewYork-Presbyterian and Weill Cornell Medicine.
|
|
44
|
+
> TESSERA is licensed for academic and non-commercial use only.
|
|
45
|
+
> Commercial licensing: contact NewYork-Presbyterian's technology transfer office.
|
|
46
|
+
|
|
47
|
+
## Changes and New Works License
|
|
48
|
+
|
|
49
|
+
The licensor grants you an additional copyright license to
|
|
50
|
+
make changes and new works based on the software for any
|
|
51
|
+
permitted purpose.
|
|
52
|
+
|
|
53
|
+
## Patent License
|
|
54
|
+
|
|
55
|
+
The licensor grants you a patent license for the software that
|
|
56
|
+
covers patent claims the licensor can license, or becomes able
|
|
57
|
+
to license, that you would infringe by using the software.
|
|
58
|
+
|
|
59
|
+
## Noncommercial Purposes
|
|
60
|
+
|
|
61
|
+
Any noncommercial purpose is a permitted purpose.
|
|
62
|
+
|
|
63
|
+
## Personal Uses
|
|
64
|
+
|
|
65
|
+
Personal use for research, experiment, and testing for
|
|
66
|
+
the benefit of public knowledge, personal study, private
|
|
67
|
+
entertainment, hobby projects, amateur pursuits, or religious
|
|
68
|
+
observance, without any anticipated commercial application,
|
|
69
|
+
is use for a permitted purpose.
|
|
70
|
+
|
|
71
|
+
## Noncommercial Organizations
|
|
72
|
+
|
|
73
|
+
Use by any charitable organization, educational institution,
|
|
74
|
+
public research organization, public safety or health
|
|
75
|
+
organization, environmental protection organization,
|
|
76
|
+
or government institution is use for a permitted purpose
|
|
77
|
+
regardless of the source of funding or obligations resulting
|
|
78
|
+
from the funding.
|
|
79
|
+
|
|
80
|
+
## Fair Use
|
|
81
|
+
|
|
82
|
+
You may have "fair use" rights for the software under the
|
|
83
|
+
law. These terms do not limit them.
|
|
84
|
+
|
|
85
|
+
## No Other Rights
|
|
86
|
+
|
|
87
|
+
These terms do not allow you to sublicense or transfer any of
|
|
88
|
+
your licenses to anyone else, or prevent the licensor from
|
|
89
|
+
granting licenses to anyone else. These terms do not imply
|
|
90
|
+
any other licenses.
|
|
91
|
+
|
|
92
|
+
## Patent Defense
|
|
93
|
+
|
|
94
|
+
If you make any written claim that the software infringes or
|
|
95
|
+
contributes to infringement of any patent, your patent license
|
|
96
|
+
for the software granted under these terms ends immediately. If
|
|
97
|
+
your company makes such a claim, your patent license ends
|
|
98
|
+
immediately for work on behalf of your company.
|
|
99
|
+
|
|
100
|
+
## Violations
|
|
101
|
+
|
|
102
|
+
The first time you are notified in writing that you have
|
|
103
|
+
violated any of these terms, or done anything with the software
|
|
104
|
+
not covered by your licenses, your licenses can nonetheless
|
|
105
|
+
continue if you come into full compliance with these terms,
|
|
106
|
+
and take practical steps to correct past violations, within
|
|
107
|
+
32 days of receiving notice. Otherwise, all your licenses
|
|
108
|
+
end immediately.
|
|
109
|
+
|
|
110
|
+
## No Liability
|
|
111
|
+
|
|
112
|
+
***As far as the law allows, the software comes as is, without
|
|
113
|
+
any warranty or condition, and the licensor will not be liable
|
|
114
|
+
to you for any damages arising out of these terms or the use
|
|
115
|
+
or nature of the software, under any kind of legal claim.***
|
|
116
|
+
|
|
117
|
+
## Definitions
|
|
118
|
+
|
|
119
|
+
The **licensor** is the individual or entity offering these
|
|
120
|
+
terms, and the **software** is the software the licensor makes
|
|
121
|
+
available under these terms.
|
|
122
|
+
|
|
123
|
+
**You** refers to the individual or entity agreeing to these
|
|
124
|
+
terms.
|
|
125
|
+
|
|
126
|
+
**Your company** is any legal entity, sole proprietorship,
|
|
127
|
+
or other kind of organization that you work for, plus all
|
|
128
|
+
organizations that have control over, are under the control of,
|
|
129
|
+
or are under common control with that organization. **Control**
|
|
130
|
+
means ownership of substantially all the assets of an entity,
|
|
131
|
+
or the power to direct its management and policies by vote,
|
|
132
|
+
contract, or otherwise. Control can be direct or indirect.
|
|
133
|
+
|
|
134
|
+
**Your licenses** are all the licenses granted to you for the
|
|
135
|
+
software under these terms.
|
|
136
|
+
|
|
137
|
+
**Use** means anything you do with the software requiring one
|
|
138
|
+
of your licenses.
|
|
139
|
+
|
|
140
|
+
---
|
|
141
|
+
|
|
142
|
+
Required Notice: Copyright 2026 NewYork-Presbyterian and Weill Cornell Medicine.
|
|
143
|
+
TESSERA is licensed for academic and non-commercial use only.
|
|
144
|
+
Commercial licensing: contact NewYork-Presbyterian's technology transfer office.
|
|
145
|
+
|
|
146
|
+
Project-URL: Homepage, https://github.com/JW-Sidhom-Lab/tessera
|
|
147
|
+
Project-URL: Repository, https://github.com/JW-Sidhom-Lab/tessera
|
|
148
|
+
Project-URL: Model weights, https://huggingface.co/JW-Sidhom-Lab/tessera-foundation
|
|
149
|
+
Project-URL: Issues, https://github.com/JW-Sidhom-Lab/tessera/issues
|
|
150
|
+
Keywords: cancer-genomics,foundation-model,self-supervised-learning,tcga,somatic-variants,copy-number-alterations,bioinformatics
|
|
151
|
+
Classifier: Development Status :: 4 - Beta
|
|
152
|
+
Classifier: Intended Audience :: Science/Research
|
|
153
|
+
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
|
|
154
|
+
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
|
|
155
|
+
Classifier: Programming Language :: Python :: 3
|
|
156
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
157
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
158
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
159
|
+
Classifier: Operating System :: OS Independent
|
|
160
|
+
Requires-Python: >=3.10
|
|
161
|
+
Description-Content-Type: text/markdown
|
|
162
|
+
License-File: LICENSE
|
|
163
|
+
Requires-Dist: tensorflow>=2.16
|
|
164
|
+
Requires-Dist: numpy
|
|
165
|
+
Requires-Dist: pandas>=2.0
|
|
166
|
+
Requires-Dist: scipy>=1.10
|
|
167
|
+
Requires-Dist: scikit-learn>=1.3
|
|
168
|
+
Requires-Dist: pyfaidx>=0.7
|
|
169
|
+
Requires-Dist: pyliftover>=0.4
|
|
170
|
+
Requires-Dist: tqdm>=4.66
|
|
171
|
+
Requires-Dist: huggingface_hub>=0.20
|
|
172
|
+
Dynamic: license-file
|
|
173
|
+
|
|
174
|
+
<p align="center">
|
|
175
|
+
<img src="logo.png" alt="TESSERA logo" width="220">
|
|
176
|
+
</p>
|
|
177
|
+
|
|
178
|
+
<p align="center">
|
|
179
|
+
<em>Tumour Embeddings via Self-Supervised Encoding and Reconstruction of Alterations</em><br>
|
|
180
|
+
A foundation model for the cancer genome.
|
|
181
|
+
</p>
|
|
182
|
+
|
|
183
|
+
---
|
|
184
|
+
|
|
185
|
+
TESSERA is a self-supervised foundation model jointly pretrained on somatic single-nucleotide variants (SNVs) and copy-number alterations (CNAs) from the TCGA Pan-Cancer Atlas. A single learned representation, produced once and reused without retraining, supports variant pathogenicity prediction, pan-cancer tumour-type classification, unsupervised molecular subtyping, prognostic stratification, and counterfactual treatment-effect estimation.
|
|
186
|
+
|
|
187
|
+
This repository contains the reference implementation, the pretrained-weights pointer, the inference utilities described in the accompanying paper, and the end-to-end analysis pipelines that reproduce every panel of Figures 1-6 and Supplementary Figures 1-12.
|
|
188
|
+
|
|
189
|
+
## Quick start
|
|
190
|
+
|
|
191
|
+
The fastest way to use TESSERA is via the public inference API on Hugging Face; no local installation required. Upload SNV and/or CNA data, get back per-variant predictions and embeddings:
|
|
192
|
+
|
|
193
|
+
🔗 **Inference API**: [huggingface.co/spaces/JW-Sidhom-Lab/tessera](https://huggingface.co/spaces/JW-Sidhom-Lab/tessera) *(coming soon)*
|
|
194
|
+
|
|
195
|
+
From Python (`pip install gradio_client`):
|
|
196
|
+
|
|
197
|
+
```python
|
|
198
|
+
import time
|
|
199
|
+
from gradio_client import Client, handle_file
|
|
200
|
+
|
|
201
|
+
client = Client("JW-Sidhom-Lab/tessera") # the public Spaces URL also works
|
|
202
|
+
|
|
203
|
+
# Submit returns (status_html, job_id) immediately; inference runs async
|
|
204
|
+
_, job_id = client.predict(
|
|
205
|
+
handle_file("snv.csv"), # SNV CSV; or None
|
|
206
|
+
handle_file("cna.csv"), # CNA CSV; or None. At least one required.
|
|
207
|
+
True, # apply TCGA quantile normalization to CNA
|
|
208
|
+
"you@example.com", # email address for the download link
|
|
209
|
+
"GRCh37", # genome assembly: "GRCh37" or "GRCh38"
|
|
210
|
+
api_name="/submit",
|
|
211
|
+
)
|
|
212
|
+
|
|
213
|
+
# Poll for completion (the same URL also gets emailed when the job finishes)
|
|
214
|
+
while True:
|
|
215
|
+
status = client.predict(job_id, api_name="/status")
|
|
216
|
+
if status["status"] in ("done", "failed"):
|
|
217
|
+
break
|
|
218
|
+
time.sleep(10)
|
|
219
|
+
|
|
220
|
+
print(status["url"]) # 24h pre-signed S3 download URL with the result ZIP
|
|
221
|
+
```
|
|
222
|
+
|
|
223
|
+
The API serves the foundation-model outputs only (per-token embeddings + per-token reconstruction predictions, returned as `.npy` files inside the result ZIP). Downstream task heads (tumour-type classifier, treatment-effect score) are available on request under a Data Use Agreement.
|
|
224
|
+
|
|
225
|
+
CSV column conventions:
|
|
226
|
+
|
|
227
|
+
- **SNV**: `Tumor_Sample_Barcode`, `Chromosome` (no `chr` prefix), `Start_Position`, `Reference_Allele`, `Tumor_Seq_Allele2`, plus either `vaf` or both `t_alt_count` + `t_ref_count`. Single-base substitutions only.
|
|
228
|
+
- **CNA**: `Tumor_Sample_Barcode`, `Chromosome`, `Start`, `End`, `Segment_Mean` (log2 ratio); optional `LOH` column triggers the with-LoH model variant.
|
|
229
|
+
|
|
230
|
+
## Local installation
|
|
231
|
+
|
|
232
|
+
For users who want to run inference offline, integrate TESSERA into a custom pipeline, or retrain on their own data:
|
|
233
|
+
|
|
234
|
+
```bash
|
|
235
|
+
# Clone
|
|
236
|
+
git clone https://github.com/JW-Sidhom-Lab/tessera.git
|
|
237
|
+
cd tessera
|
|
238
|
+
|
|
239
|
+
# Recommended: a virtual environment so deps don't clash with system Python
|
|
240
|
+
python3 -m venv .venv && source .venv/bin/activate
|
|
241
|
+
|
|
242
|
+
# Install all dependencies
|
|
243
|
+
pip install -r requirements.txt
|
|
244
|
+
|
|
245
|
+
# Download reference genome (default: GRCh37)
|
|
246
|
+
bash tessera/ref_genomes/download_ref_genomes.sh
|
|
247
|
+
```
|
|
248
|
+
|
|
249
|
+
`requirements.txt` covers the foundation-model package, all manuscript-reproduction scripts (pretraining, classifiers, prognostic / predictive-biomarker analyses), and the Gradio inference API. A trimmer subset for deploying only the inference API is at [`inference_api/requirements.txt`](inference_api/requirements.txt).
|
|
250
|
+
|
|
251
|
+
Weights are hosted on Hugging Face Hub at [huggingface.co/JW-Sidhom-Lab/tessera-foundation](https://huggingface.co/JW-Sidhom-Lab/tessera-foundation) under CC-BY-NC-4.0. The shortest path from raw dataframes to feature tensors is the `featurize` one-liner, which downloads weights on first call (cached afterwards), lifts non-hg19 coordinates, builds the dataset, and runs both per-modality feature heads:
|
|
252
|
+
|
|
253
|
+
```python
|
|
254
|
+
import tessera
|
|
255
|
+
|
|
256
|
+
result = tessera.featurize(
|
|
257
|
+
snv_df=snv_df, # columns: Tumor_Sample_Barcode, Chromosome, Start_Position,
|
|
258
|
+
# Reference_Allele, Tumor_Seq_Allele2, vaf
|
|
259
|
+
cna_df=cna_df, # columns: Tumor_Sample_Barcode, Chromosome, Start, End, Segment_Mean
|
|
260
|
+
variant="joint_snv_cna_noloh", # or "joint_snv_cna" for the with-LoH variant
|
|
261
|
+
from_assembly="GRCh38", # "GRCh37" / "hg19" is a no-op; otherwise UCSC liftover runs
|
|
262
|
+
)
|
|
263
|
+
|
|
264
|
+
result.snv_features # (n_variants, 1169) per-variant embeddings, row-aligned with result.snv_table
|
|
265
|
+
result.cna_features # (n_segments, 688) per-segment embeddings, row-aligned with result.cna_table
|
|
266
|
+
result.liftover_stats # {"snv": {"n_in", "n_out", "n_dropped"}, "cna": {...}}
|
|
267
|
+
```
|
|
268
|
+
|
|
269
|
+
For finer-grained control there are still building blocks:
|
|
270
|
+
|
|
271
|
+
```python
|
|
272
|
+
from tessera import load_pretrained, lift_snv, lift_cna
|
|
273
|
+
|
|
274
|
+
model = load_pretrained("joint_snv_cna_noloh") # download + instantiate, ~3 s cold
|
|
275
|
+
snv_df, _ = lift_snv(snv_df, from_assembly="GRCh38") # identity if from_assembly=="GRCh37"
|
|
276
|
+
cna_df, _ = lift_cna(cna_df, from_assembly="GRCh38")
|
|
277
|
+
result = model.featurize(snv_df=snv_df, cna_df=cna_df) # repeat without re-downloading
|
|
278
|
+
```
|
|
279
|
+
|
|
280
|
+
UCSC chain files are downloaded on first use and cached at `~/.cache/pyliftover/`; offline environments can point the loader at a bundled chain file via the `chain_file=` argument or the `TESSERA_LIFTOVER_CHAIN` environment variable.
|
|
281
|
+
|
|
282
|
+
## Reproducing the manuscript
|
|
283
|
+
|
|
284
|
+
Every published panel is backed by a script in this repository. The
|
|
285
|
+
pipeline runs in three stages:
|
|
286
|
+
|
|
287
|
+
1. **Data preparation** ([`data/`](data/README.md)): per-cohort
|
|
288
|
+
download instructions, source-table provenance, and the
|
|
289
|
+
`create_training_data*.py` / `build_<cohort>_metadata.py` builders
|
|
290
|
+
that turn raw releases into the analysis-ready CSVs.
|
|
291
|
+
2. **Foundation-model pretraining**
|
|
292
|
+
([`scripts/tcga_pancan_*/`](scripts/README.md)): trains the SNV
|
|
293
|
+
models, the CNA models, and the joint SNV+CNA InfoNCE-aligned
|
|
294
|
+
foundation model on the TCGA Pan-Cancer Atlas.
|
|
295
|
+
3. **Downstream analyses** ([`scripts/`](scripts/README.md)):
|
|
296
|
+
variant-pathogenicity (Fig. 1 h-o), cross-platform validation
|
|
297
|
+
(Fig. 1 f-g, Fig. 2 d), tumour-type classification (Fig. 3,
|
|
298
|
+
Fig. 4 b-e), prognostic UMAP + joint Cox (Fig. 5), doubly-robust
|
|
299
|
+
counterfactual treatment-effect (Fig. 6 a-m), and DepMap
|
|
300
|
+
cell-line transfer (Fig. 6 n).
|
|
301
|
+
|
|
302
|
+
[`scripts/README.md`](scripts/README.md) and
|
|
303
|
+
[`data/README.md`](data/README.md) hold the full per-directory tables
|
|
304
|
+
mapping each script and cohort to its manuscript figure.
|
|
305
|
+
|
|
306
|
+
## Repository layout
|
|
307
|
+
|
|
308
|
+
```
|
|
309
|
+
tessera/
|
|
310
|
+
├── tessera/ # foundation-model package
|
|
311
|
+
│ ├── base.py # BaseModel: shared data + training infrastructure
|
|
312
|
+
│ ├── input_keys.py # input-key helpers
|
|
313
|
+
│ ├── model.py # TESSERA: foundation-model class
|
|
314
|
+
│ ├── data/
|
|
315
|
+
│ │ └── preprocessing.py # SNV/CNA tokenization, FASTA lookup, sample bagging
|
|
316
|
+
│ ├── layers/ # custom Keras layers (attention, masking, MIL, ...)
|
|
317
|
+
│ ├── training/ # training utilities (callbacks, losses, schedules)
|
|
318
|
+
│ └── ref_genomes/ # reference-genome download script + indices
|
|
319
|
+
├── data/ # per-cohort data preparation pipelines (data/README.md)
|
|
320
|
+
├── scripts/ # analysis pipelines backing the manuscript figures (scripts/README.md)
|
|
321
|
+
└── README.md
|
|
322
|
+
```
|
|
323
|
+
|
|
324
|
+
## Citing TESSERA
|
|
325
|
+
|
|
326
|
+
If you use TESSERA in your work, please cite:
|
|
327
|
+
|
|
328
|
+
> *citation pending publication*
|
|
329
|
+
|
|
330
|
+
A BibTeX entry will be added on acceptance.
|
|
331
|
+
|
|
332
|
+
## License
|
|
333
|
+
|
|
334
|
+
This repository is distributed under the **PolyForm Noncommercial License 1.0.0** (see [`LICENSE`](LICENSE)). Use is permitted for academic research, education, public-research-organization use, and personal experimentation; commercial use is not permitted without a separate license. Pretrained foundation-model weights are released on the Hugging Face Hub under **CC-BY-NC-4.0** (non-commercial, attribution required). Pretrained weights for downstream clinical task heads (CRC and PDAC treatment-effect models) remain available on request under a Data Use Agreement. Patents covering clinical applications of TESSERA are assigned to NewYork-Presbyterian; commercial licensing inquiries should be directed to NYP's technology transfer office.
|
|
335
|
+
|
|
336
|
+
## Lab
|
|
337
|
+
|
|
338
|
+
TESSERA is developed in the [JW Sidhom Lab](https://github.com/JW-Sidhom-Lab) at Weill Cornell Medicine.
|
|
339
|
+
|
|
340
|
+
For questions, collaborations, or commercial-licensing enquiries, contact the corresponding author.
|
|
@@ -0,0 +1,167 @@
|
|
|
1
|
+
<p align="center">
|
|
2
|
+
<img src="logo.png" alt="TESSERA logo" width="220">
|
|
3
|
+
</p>
|
|
4
|
+
|
|
5
|
+
<p align="center">
|
|
6
|
+
<em>Tumour Embeddings via Self-Supervised Encoding and Reconstruction of Alterations</em><br>
|
|
7
|
+
A foundation model for the cancer genome.
|
|
8
|
+
</p>
|
|
9
|
+
|
|
10
|
+
---
|
|
11
|
+
|
|
12
|
+
TESSERA is a self-supervised foundation model jointly pretrained on somatic single-nucleotide variants (SNVs) and copy-number alterations (CNAs) from the TCGA Pan-Cancer Atlas. A single learned representation, produced once and reused without retraining, supports variant pathogenicity prediction, pan-cancer tumour-type classification, unsupervised molecular subtyping, prognostic stratification, and counterfactual treatment-effect estimation.
|
|
13
|
+
|
|
14
|
+
This repository contains the reference implementation, the pretrained-weights pointer, the inference utilities described in the accompanying paper, and the end-to-end analysis pipelines that reproduce every panel of Figures 1-6 and Supplementary Figures 1-12.
|
|
15
|
+
|
|
16
|
+
## Quick start
|
|
17
|
+
|
|
18
|
+
The fastest way to use TESSERA is via the public inference API on Hugging Face; no local installation required. Upload SNV and/or CNA data, get back per-variant predictions and embeddings:
|
|
19
|
+
|
|
20
|
+
🔗 **Inference API**: [huggingface.co/spaces/JW-Sidhom-Lab/tessera](https://huggingface.co/spaces/JW-Sidhom-Lab/tessera) *(coming soon)*
|
|
21
|
+
|
|
22
|
+
From Python (`pip install gradio_client`):
|
|
23
|
+
|
|
24
|
+
```python
|
|
25
|
+
import time
|
|
26
|
+
from gradio_client import Client, handle_file
|
|
27
|
+
|
|
28
|
+
client = Client("JW-Sidhom-Lab/tessera") # the public Spaces URL also works
|
|
29
|
+
|
|
30
|
+
# Submit returns (status_html, job_id) immediately; inference runs async
|
|
31
|
+
_, job_id = client.predict(
|
|
32
|
+
handle_file("snv.csv"), # SNV CSV; or None
|
|
33
|
+
handle_file("cna.csv"), # CNA CSV; or None. At least one required.
|
|
34
|
+
True, # apply TCGA quantile normalization to CNA
|
|
35
|
+
"you@example.com", # email address for the download link
|
|
36
|
+
"GRCh37", # genome assembly: "GRCh37" or "GRCh38"
|
|
37
|
+
api_name="/submit",
|
|
38
|
+
)
|
|
39
|
+
|
|
40
|
+
# Poll for completion (the same URL also gets emailed when the job finishes)
|
|
41
|
+
while True:
|
|
42
|
+
status = client.predict(job_id, api_name="/status")
|
|
43
|
+
if status["status"] in ("done", "failed"):
|
|
44
|
+
break
|
|
45
|
+
time.sleep(10)
|
|
46
|
+
|
|
47
|
+
print(status["url"]) # 24h pre-signed S3 download URL with the result ZIP
|
|
48
|
+
```
|
|
49
|
+
|
|
50
|
+
The API serves the foundation-model outputs only (per-token embeddings + per-token reconstruction predictions, returned as `.npy` files inside the result ZIP). Downstream task heads (tumour-type classifier, treatment-effect score) are available on request under a Data Use Agreement.
|
|
51
|
+
|
|
52
|
+
CSV column conventions:
|
|
53
|
+
|
|
54
|
+
- **SNV**: `Tumor_Sample_Barcode`, `Chromosome` (no `chr` prefix), `Start_Position`, `Reference_Allele`, `Tumor_Seq_Allele2`, plus either `vaf` or both `t_alt_count` + `t_ref_count`. Single-base substitutions only.
|
|
55
|
+
- **CNA**: `Tumor_Sample_Barcode`, `Chromosome`, `Start`, `End`, `Segment_Mean` (log2 ratio); optional `LOH` column triggers the with-LoH model variant.
|
|
56
|
+
|
|
57
|
+
## Local installation
|
|
58
|
+
|
|
59
|
+
For users who want to run inference offline, integrate TESSERA into a custom pipeline, or retrain on their own data:
|
|
60
|
+
|
|
61
|
+
```bash
|
|
62
|
+
# Clone
|
|
63
|
+
git clone https://github.com/JW-Sidhom-Lab/tessera.git
|
|
64
|
+
cd tessera
|
|
65
|
+
|
|
66
|
+
# Recommended: a virtual environment so deps don't clash with system Python
|
|
67
|
+
python3 -m venv .venv && source .venv/bin/activate
|
|
68
|
+
|
|
69
|
+
# Install all dependencies
|
|
70
|
+
pip install -r requirements.txt
|
|
71
|
+
|
|
72
|
+
# Download reference genome (default: GRCh37)
|
|
73
|
+
bash tessera/ref_genomes/download_ref_genomes.sh
|
|
74
|
+
```
|
|
75
|
+
|
|
76
|
+
`requirements.txt` covers the foundation-model package, all manuscript-reproduction scripts (pretraining, classifiers, prognostic / predictive-biomarker analyses), and the Gradio inference API. A trimmer subset for deploying only the inference API is at [`inference_api/requirements.txt`](inference_api/requirements.txt).
|
|
77
|
+
|
|
78
|
+
Weights are hosted on Hugging Face Hub at [huggingface.co/JW-Sidhom-Lab/tessera-foundation](https://huggingface.co/JW-Sidhom-Lab/tessera-foundation) under CC-BY-NC-4.0. The shortest path from raw dataframes to feature tensors is the `featurize` one-liner, which downloads weights on first call (cached afterwards), lifts non-hg19 coordinates, builds the dataset, and runs both per-modality feature heads:
|
|
79
|
+
|
|
80
|
+
```python
|
|
81
|
+
import tessera
|
|
82
|
+
|
|
83
|
+
result = tessera.featurize(
|
|
84
|
+
snv_df=snv_df, # columns: Tumor_Sample_Barcode, Chromosome, Start_Position,
|
|
85
|
+
# Reference_Allele, Tumor_Seq_Allele2, vaf
|
|
86
|
+
cna_df=cna_df, # columns: Tumor_Sample_Barcode, Chromosome, Start, End, Segment_Mean
|
|
87
|
+
variant="joint_snv_cna_noloh", # or "joint_snv_cna" for the with-LoH variant
|
|
88
|
+
from_assembly="GRCh38", # "GRCh37" / "hg19" is a no-op; otherwise UCSC liftover runs
|
|
89
|
+
)
|
|
90
|
+
|
|
91
|
+
result.snv_features # (n_variants, 1169) per-variant embeddings, row-aligned with result.snv_table
|
|
92
|
+
result.cna_features # (n_segments, 688) per-segment embeddings, row-aligned with result.cna_table
|
|
93
|
+
result.liftover_stats # {"snv": {"n_in", "n_out", "n_dropped"}, "cna": {...}}
|
|
94
|
+
```
|
|
95
|
+
|
|
96
|
+
For finer-grained control there are still building blocks:
|
|
97
|
+
|
|
98
|
+
```python
|
|
99
|
+
from tessera import load_pretrained, lift_snv, lift_cna
|
|
100
|
+
|
|
101
|
+
model = load_pretrained("joint_snv_cna_noloh") # download + instantiate, ~3 s cold
|
|
102
|
+
snv_df, _ = lift_snv(snv_df, from_assembly="GRCh38") # identity if from_assembly=="GRCh37"
|
|
103
|
+
cna_df, _ = lift_cna(cna_df, from_assembly="GRCh38")
|
|
104
|
+
result = model.featurize(snv_df=snv_df, cna_df=cna_df) # repeat without re-downloading
|
|
105
|
+
```
|
|
106
|
+
|
|
107
|
+
UCSC chain files are downloaded on first use and cached at `~/.cache/pyliftover/`; offline environments can point the loader at a bundled chain file via the `chain_file=` argument or the `TESSERA_LIFTOVER_CHAIN` environment variable.
|
|
108
|
+
|
|
109
|
+
## Reproducing the manuscript
|
|
110
|
+
|
|
111
|
+
Every published panel is backed by a script in this repository. The
|
|
112
|
+
pipeline runs in three stages:
|
|
113
|
+
|
|
114
|
+
1. **Data preparation** ([`data/`](data/README.md)): per-cohort
|
|
115
|
+
download instructions, source-table provenance, and the
|
|
116
|
+
`create_training_data*.py` / `build_<cohort>_metadata.py` builders
|
|
117
|
+
that turn raw releases into the analysis-ready CSVs.
|
|
118
|
+
2. **Foundation-model pretraining**
|
|
119
|
+
([`scripts/tcga_pancan_*/`](scripts/README.md)): trains the SNV
|
|
120
|
+
models, the CNA models, and the joint SNV+CNA InfoNCE-aligned
|
|
121
|
+
foundation model on the TCGA Pan-Cancer Atlas.
|
|
122
|
+
3. **Downstream analyses** ([`scripts/`](scripts/README.md)):
|
|
123
|
+
variant-pathogenicity (Fig. 1 h-o), cross-platform validation
|
|
124
|
+
(Fig. 1 f-g, Fig. 2 d), tumour-type classification (Fig. 3,
|
|
125
|
+
Fig. 4 b-e), prognostic UMAP + joint Cox (Fig. 5), doubly-robust
|
|
126
|
+
counterfactual treatment-effect (Fig. 6 a-m), and DepMap
|
|
127
|
+
cell-line transfer (Fig. 6 n).
|
|
128
|
+
|
|
129
|
+
[`scripts/README.md`](scripts/README.md) and
|
|
130
|
+
[`data/README.md`](data/README.md) hold the full per-directory tables
|
|
131
|
+
mapping each script and cohort to its manuscript figure.
|
|
132
|
+
|
|
133
|
+
## Repository layout
|
|
134
|
+
|
|
135
|
+
```
|
|
136
|
+
tessera/
|
|
137
|
+
├── tessera/ # foundation-model package
|
|
138
|
+
│ ├── base.py # BaseModel: shared data + training infrastructure
|
|
139
|
+
│ ├── input_keys.py # input-key helpers
|
|
140
|
+
│ ├── model.py # TESSERA: foundation-model class
|
|
141
|
+
│ ├── data/
|
|
142
|
+
│ │ └── preprocessing.py # SNV/CNA tokenization, FASTA lookup, sample bagging
|
|
143
|
+
│ ├── layers/ # custom Keras layers (attention, masking, MIL, ...)
|
|
144
|
+
│ ├── training/ # training utilities (callbacks, losses, schedules)
|
|
145
|
+
│ └── ref_genomes/ # reference-genome download script + indices
|
|
146
|
+
├── data/ # per-cohort data preparation pipelines (data/README.md)
|
|
147
|
+
├── scripts/ # analysis pipelines backing the manuscript figures (scripts/README.md)
|
|
148
|
+
└── README.md
|
|
149
|
+
```
|
|
150
|
+
|
|
151
|
+
## Citing TESSERA
|
|
152
|
+
|
|
153
|
+
If you use TESSERA in your work, please cite:
|
|
154
|
+
|
|
155
|
+
> *citation pending publication*
|
|
156
|
+
|
|
157
|
+
A BibTeX entry will be added on acceptance.
|
|
158
|
+
|
|
159
|
+
## License
|
|
160
|
+
|
|
161
|
+
This repository is distributed under the **PolyForm Noncommercial License 1.0.0** (see [`LICENSE`](LICENSE)). Use is permitted for academic research, education, public-research-organization use, and personal experimentation; commercial use is not permitted without a separate license. Pretrained foundation-model weights are released on the Hugging Face Hub under **CC-BY-NC-4.0** (non-commercial, attribution required). Pretrained weights for downstream clinical task heads (CRC and PDAC treatment-effect models) remain available on request under a Data Use Agreement. Patents covering clinical applications of TESSERA are assigned to NewYork-Presbyterian; commercial licensing inquiries should be directed to NYP's technology transfer office.
|
|
162
|
+
|
|
163
|
+
## Lab
|
|
164
|
+
|
|
165
|
+
TESSERA is developed in the [JW Sidhom Lab](https://github.com/JW-Sidhom-Lab) at Weill Cornell Medicine.
|
|
166
|
+
|
|
167
|
+
For questions, collaborations, or commercial-licensing enquiries, contact the corresponding author.
|