just-prs 0.2.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,471 @@
1
+ Metadata-Version: 2.3
2
+ Name: just-prs
3
+ Version: 0.2.0
4
+ Summary: Polars-bio based tool to compute polygenic risk scores from PGS Catalog
5
+ Author: antonkulaga
6
+ Author-email: antonkulaga <antonkulaga@gmail.com>
7
+ Requires-Dist: polars-bio>=0.23.0
8
+ Requires-Dist: pycomfort>=0.0.18
9
+ Requires-Dist: typer[all]>=0.15.0
10
+ Requires-Dist: httpx>=0.28.0
11
+ Requires-Dist: pydantic>=2.0
12
+ Requires-Dist: eliot>=1.15.0
13
+ Requires-Dist: fsspec[http]>=2026.2.0
14
+ Requires-Dist: huggingface-hub>=0.28.0
15
+ Requires-Dist: python-dotenv>=1.1.0
16
+ Requires-Dist: prs-ui ; extra == 'ui'
17
+ Requires-Python: >=3.14
18
+ Provides-Extra: ui
19
+ Description-Content-Type: text/markdown
20
+
21
+ # just-prs
22
+
23
+ [![PyPI version](https://badge.fury.io/py/just-prs.svg)](https://badge.fury.io/py/just-prs)
24
+
25
+ A [Polars](https://pola.rs/)-bio based tool to compute **Polygenic Risk Scores (PRS)** from the [PGS Catalog](https://www.pgscatalog.org/).
26
+
27
+ ## Features
28
+
29
+ - **`PRSCatalog`** — high-level class for searching scores, computing PRS, and estimating percentiles using cleaned bulk metadata (no REST API calls needed)
30
+ - **Cleanup pipeline** — normalizes genome builds (hg19/hg38/NCBI36 → GRCh37/GRCh38/GRCh36), renames columns to snake_case, parses performance metric strings into structured numeric fields
31
+ - **HuggingFace sync** — cleaned metadata parquets are published to [just-dna-seq/polygenic_risk_scores](https://huggingface.co/datasets/just-dna-seq/polygenic_risk_scores) and auto-downloaded on first use
32
+ - Compute PRS for one or many scores against a VCF file
33
+ - Search and inspect PGS Catalog scores and traits via the REST API
34
+ - **Bulk download** the entire PGS Catalog metadata (all ~5,000+ scores) via EBI FTP — one HTTP request per sheet, not hundreds of API pages
35
+ - Stream harmonized scoring files directly from EBI FTP without storing intermediate `.gz` files
36
+ - All data saved as **Parquet** for fast, efficient downstream analysis with Polars
37
+
38
+ ## Validation against PLINK2
39
+
40
+ Our PRS computation is validated against [PLINK2](https://www.cog-genomics.org/plink/2.0/) `--score` on real genomic data. The integration test suite downloads a whole-genome VCF from Zenodo, computes PRS for multiple GRCh38 scores using both `just-prs` and PLINK2, and asserts agreement:
41
+
42
+ | PGS ID | just-prs | PLINK2 | Relative diff | Variants matched |
43
+ |--------|----------|--------|---------------|-----------------|
44
+ | PGS000001 | 0.030123 | 0.030123 | 6.5e-7 | 51 / 77 |
45
+ | PGS000002 | -0.137089 | -0.137089 | 1.1e-7 | 51 / 77 |
46
+ | PGS000003 | 0.588127 | 0.588127 | 8.1e-9 | 51 / 77 |
47
+ | PGS000004 | -0.7158 | -0.7158 | 3.1e-16 | 170 / 313 |
48
+ | PGS000005 | -0.8903 | -0.8903 | 5.0e-16 | 170 / 313 |
49
+
50
+ All differences are within floating-point precision. PLINK2 is auto-downloaded if not already installed, so the tests run on any Linux, macOS, or Windows machine:
51
+
52
+ ```bash
53
+ uv run pytest tests/test_plink.py -v
54
+ ```
55
+
56
+ ## Installation
57
+
58
+ Requires Python ≥ 3.14. Uses [uv](https://github.com/astral-sh/uv) for dependency management.
59
+
60
+ **From PyPI:**
61
+
62
+ ```bash
63
+ uv add just-prs
64
+ # or
65
+ pip install just-prs
66
+ ```
67
+
68
+ **From source:**
69
+
70
+ ```bash
71
+ git clone https://github.com/antonkulaga/just-prs
72
+ cd just-prs
73
+ uv sync
74
+ ```
75
+
76
+ For the optional web UI: `pip install just-prs[ui]` or `uv sync --all-packages` when developing from source.
77
+
78
+ The CLI is available as both `just-prs` and `prs`.
79
+
80
+ ## CLI Reference
81
+
82
+ ### Top-level commands
83
+
84
+ ```
85
+ prs --help
86
+ prs compute --help
87
+ prs catalog --help
88
+ ```
89
+
90
+ ---
91
+
92
+ ### `prs compute` — Compute PRS for a VCF
93
+
94
+ ```bash
95
+ prs compute --vcf sample.vcf.gz --pgs-id PGS000001
96
+ prs compute --vcf sample.vcf.gz --pgs-id PGS000001,PGS000002,PGS000003
97
+ prs compute --vcf sample.vcf.gz --pgs-id PGS000001 --build GRCh37 --output results.json
98
+ ```
99
+
100
+ Options:
101
+
102
+ | Flag | Default | Description |
103
+ |------|---------|-------------|
104
+ | `--vcf / -v` | — | Path to VCF file (required) |
105
+ | `--pgs-id / -p` | — | Comma-separated PGS ID(s) (required) |
106
+ | `--build / -b` | `GRCh38` | Genome build |
107
+ | `--cache-dir` | `~/.cache/just-prs/scores` | Cache directory for scoring files |
108
+ | `--output / -o` | — | Save results as JSON |
109
+
110
+ ---
111
+
112
+ ### `prs catalog scores` — Search and inspect scores (REST API)
113
+
114
+ ```bash
115
+ prs catalog scores list # first 100 scores
116
+ prs catalog scores list --all # every score in catalog
117
+ prs catalog scores search --term "breast cancer"
118
+ prs catalog scores info PGS000001
119
+ ```
120
+
121
+ ### `prs catalog traits` — Search and inspect traits (REST API)
122
+
123
+ ```bash
124
+ prs catalog traits search --term "diabetes"
125
+ prs catalog traits info EFO_0001645
126
+ ```
127
+
128
+ ### `prs catalog download` — Download a single scoring file
129
+
130
+ Downloads the harmonized `.txt.gz` scoring file for one score and caches it locally.
131
+
132
+ ```bash
133
+ prs catalog download PGS000001
134
+ prs catalog download PGS000001 --output-dir ./my_scores --build GRCh37
135
+ ```
136
+
137
+ ---
138
+
139
+ ### `prs catalog bulk` — Bulk FTP downloads (fast, parquet output)
140
+
141
+ These commands use the [EBI FTP HTTPS mirror](https://ftp.ebi.ac.uk/pub/databases/spot/pgs/) via **fsspec** to download pre-built catalog-wide files directly — far faster than paginating the REST API.
142
+
143
+ #### `prs catalog bulk metadata` — All catalog metadata as parquet
144
+
145
+ Downloads the PGS Catalog bulk metadata CSVs and converts each to a parquet file.
146
+ The full catalog (~5,000+ scores) downloads in seconds as a single HTTP request per sheet.
147
+
148
+ ```bash
149
+ # Download all 7 metadata sheets → ./output/pgs_metadata/*.parquet
150
+ prs catalog bulk metadata
151
+
152
+ # Download only the scores sheet
153
+ prs catalog bulk metadata --sheet scores
154
+
155
+ # Specify output directory; force re-download
156
+ prs catalog bulk metadata --output-dir /data/pgs --overwrite
157
+ ```
158
+
159
+ Available sheets:
160
+
161
+ | Sheet | Contents |
162
+ |-------|----------|
163
+ | `scores` | All PGS scores and their metadata |
164
+ | `publications` | Publication sources for each PGS |
165
+ | `efo_traits` | Ontology-mapped trait information |
166
+ | `score_development_samples` | GWAS and training samples |
167
+ | `performance_metrics` | Evaluation performance metrics |
168
+ | `evaluation_sample_sets` | Evaluation sample set descriptions |
169
+ | `cohorts` | Cohort information |
170
+
171
+ Options:
172
+
173
+ | Flag | Default | Description |
174
+ |------|---------|-------------|
175
+ | `--output-dir / -o` | `./output/pgs_metadata` | Directory for parquet output |
176
+ | `--sheet / -s` | all sheets | Single sheet name to download |
177
+ | `--overwrite` | `False` | Re-download existing files |
178
+
179
+ #### `prs catalog bulk scores` — All scoring files as parquet
180
+
181
+ Streams each harmonized scoring file from EBI FTP and saves it as a parquet file
182
+ (with an added `pgs_id` column). No intermediate `.gz` files are written to disk.
183
+
184
+ ```bash
185
+ # Download ALL ~5,000+ scoring files (GRCh38) → ./output/pgs_scores/PGS######.parquet
186
+ prs catalog bulk scores
187
+
188
+ # Download a specific subset
189
+ prs catalog bulk scores --ids PGS000001,PGS000002,PGS000003
190
+
191
+ # GRCh37 build, custom output dir
192
+ prs catalog bulk scores --build GRCh37 --output-dir /data/scores
193
+
194
+ # Force re-download of existing files
195
+ prs catalog bulk scores --ids PGS000001 --overwrite
196
+ ```
197
+
198
+ Options:
199
+
200
+ | Flag | Default | Description |
201
+ |------|---------|-------------|
202
+ | `--output-dir / -o` | `./output/pgs_scores` | Directory for parquet output |
203
+ | `--build / -b` | `GRCh38` | Genome build (`GRCh37` or `GRCh38`) |
204
+ | `--ids` | all | Comma-separated PGS IDs to download |
205
+ | `--overwrite` | `False` | Re-download existing parquet files |
206
+
207
+ #### `prs catalog bulk clean-metadata` — Build cleaned metadata parquets
208
+
209
+ Downloads raw metadata from EBI FTP, runs the cleanup pipeline (genome build normalization, column renaming, metric parsing, performance flattening), and saves three cleaned parquet files.
210
+
211
+ ```bash
212
+ # Build cleaned parquets → ./output/pgs_metadata/
213
+ prs catalog bulk clean-metadata
214
+
215
+ # Custom output directory
216
+ prs catalog bulk clean-metadata --output-dir /data/cleaned
217
+ ```
218
+
219
+ Output files:
220
+
221
+ | File | Contents |
222
+ |------|----------|
223
+ | `scores.parquet` | All PGS scores with snake_case columns, normalized genome builds |
224
+ | `performance.parquet` | Performance metrics joined with evaluation samples, parsed numeric columns |
225
+ | `best_performance.parquet` | One best row per PGS ID (largest sample, European-preferred) |
226
+
227
+ Options:
228
+
229
+ | Flag | Default | Description |
230
+ |------|---------|-------------|
231
+ | `--output-dir / -o` | `./output/pgs_metadata` | Directory for cleaned parquet output |
232
+
233
+ #### `prs catalog bulk push-hf` — Push cleaned parquets to HuggingFace
234
+
235
+ Uploads cleaned metadata parquets to a HuggingFace dataset repository. Builds them first if not already present. Token is read from `.env` file or `HF_TOKEN` environment variable.
236
+
237
+ ```bash
238
+ # Push to default repo (just-dna-seq/polygenic_risk_scores)
239
+ prs catalog bulk push-hf
240
+
241
+ # Push from a custom directory to a custom repo
242
+ prs catalog bulk push-hf --output-dir /data/cleaned --repo my-org/my-dataset
243
+ ```
244
+
245
+ Options:
246
+
247
+ | Flag | Default | Description |
248
+ |------|---------|-------------|
249
+ | `--output-dir / -o` | `./output/pgs_metadata` | Directory containing cleaned parquets |
250
+ | `--repo / -r` | `just-dna-seq/polygenic_risk_scores` | HuggingFace dataset repo ID |
251
+
252
+ #### `prs catalog bulk pull-hf` — Pull cleaned parquets from HuggingFace
253
+
254
+ Downloads cleaned metadata parquets from a HuggingFace dataset repository. Useful for bootstrapping a local cache without running the cleanup pipeline.
255
+
256
+ ```bash
257
+ # Pull to default directory
258
+ prs catalog bulk pull-hf
259
+
260
+ # Pull to custom directory from custom repo
261
+ prs catalog bulk pull-hf --output-dir /data/cleaned --repo my-org/my-dataset
262
+ ```
263
+
264
+ Options:
265
+
266
+ | Flag | Default | Description |
267
+ |------|---------|-------------|
268
+ | `--output-dir / -o` | `./output/pgs_metadata` | Directory to save pulled parquets |
269
+ | `--repo / -r` | `just-dna-seq/polygenic_risk_scores` | HuggingFace dataset repo ID |
270
+
271
+ #### `prs catalog bulk ids` — List all PGS IDs
272
+
273
+ Fetches `pgs_scores_list.txt` from EBI FTP (one request) and prints every PGS ID.
274
+
275
+ ```bash
276
+ prs catalog bulk ids
277
+ prs catalog bulk ids | wc -l # count total scores
278
+ ```
279
+
280
+ ---
281
+
282
+ ## Python API
283
+
284
+ ### Bulk FTP downloads (`just_prs.ftp`)
285
+
286
+ ```python
287
+ from just_prs.ftp import (
288
+ list_all_pgs_ids,
289
+ download_metadata_sheet,
290
+ download_all_metadata,
291
+ stream_scoring_file,
292
+ download_scoring_as_parquet,
293
+ bulk_download_scoring_parquets,
294
+ )
295
+ from pathlib import Path
296
+
297
+ # Full ID list in one request
298
+ ids = list_all_pgs_ids() # ['PGS000001', 'PGS000002', ...]
299
+
300
+ # All score metadata as a Polars DataFrame, saved to parquet
301
+ df = download_metadata_sheet("scores", Path("./output/pgs_metadata/scores.parquet"))
302
+
303
+ # All 7 sheets at once
304
+ sheets = download_all_metadata(Path("./output/pgs_metadata"))
305
+
306
+ # Stream a scoring file as a LazyFrame (no local .gz written)
307
+ lf = stream_scoring_file("PGS000001", genome_build="GRCh38")
308
+
309
+ # Download one scoring file as parquet (adds pgs_id column)
310
+ path = download_scoring_as_parquet("PGS000001", Path("./output/pgs_scores"))
311
+
312
+ # Bulk download a list (or all) scoring files as parquet
313
+ paths = bulk_download_scoring_parquets(Path("./output/pgs_scores"), pgs_ids=["PGS000001", "PGS000002"])
314
+ paths = bulk_download_scoring_parquets(Path("./output/pgs_scores")) # all ~5000+
315
+ ```
316
+
317
+ ### REST API client (`just_prs.catalog`)
318
+
319
+ ```python
320
+ from just_prs.catalog import PGSCatalogClient
321
+
322
+ with PGSCatalogClient() as client:
323
+ score = client.get_score("PGS000001")
324
+ results = client.search_scores("breast cancer", limit=10)
325
+ trait = client.get_trait("EFO_0001645")
326
+ for score in client.iter_all_scores(page_size=100):
327
+ print(score.id, score.trait_reported)
328
+ ```
329
+
330
+ ### PRSCatalog — search, compute, and percentile (`just_prs.prs_catalog`)
331
+
332
+ `PRSCatalog` is the recommended high-level interface. It persists 3 cleaned parquet files locally and loads them on access using a 3-tier fallback chain: local files -> HuggingFace pull -> raw FTP download + cleanup. All lookups, searches, and PRS computations use cleaned data with no per-score REST API calls.
333
+
334
+ ```python
335
+ from just_prs import PRSCatalog
336
+
337
+ catalog = PRSCatalog() # uses ~/.cache/just-prs by default
338
+
339
+ # Browse cleaned scores (genome builds normalized, snake_case columns)
340
+ scores_df = catalog.scores(genome_build="GRCh38").collect()
341
+
342
+ # Search across pgs_id, name, trait_reported, and trait_efo
343
+ results = catalog.search("breast cancer", genome_build="GRCh38").collect()
344
+
345
+ # Get cleaned metadata for a single score
346
+ info = catalog.score_info_row("PGS000001") # dict or None
347
+
348
+ # Best performance metric per score (largest sample, European-preferred)
349
+ best = catalog.best_performance(pgs_id="PGS000001").collect()
350
+
351
+ # Compute PRS (trait lookup from cached metadata, not REST API)
352
+ result = catalog.compute_prs(vcf_path="sample.vcf.gz", pgs_id="PGS000001")
353
+ print(result.score, result.match_rate)
354
+
355
+ # Batch computation
356
+ results = catalog.compute_prs_batch(
357
+ vcf_path="sample.vcf.gz",
358
+ pgs_ids=["PGS000001", "PGS000002"],
359
+ )
360
+
361
+ # Percentile estimation (AUROC-based or explicit mean/std)
362
+ pct = catalog.percentile(prs_score=1.5, pgs_id="PGS000014")
363
+ pct = catalog.percentile(prs_score=1.5, pgs_id="PGS000014", mean=0.0, std=1.0)
364
+
365
+ # Build cleaned parquets explicitly (download from FTP + cleanup)
366
+ paths = catalog.build_cleaned_parquets(output_dir=Path("./output/pgs_metadata"))
367
+ # {'scores': Path('output/pgs_metadata/scores.parquet'), 'performance': ..., 'best_performance': ...}
368
+
369
+ # Push cleaned parquets to HuggingFace
370
+ catalog.push_to_hf() # token from .env / HF_TOKEN
371
+ catalog.push_to_hf(token="hf_...", repo_id="my-org/my-dataset")
372
+ ```
373
+
374
+ ### HuggingFace sync (`just_prs.hf`)
375
+
376
+ ```python
377
+ from just_prs.hf import push_cleaned_parquets, pull_cleaned_parquets
378
+ from pathlib import Path
379
+
380
+ # Push cleaned parquets to HF dataset repo
381
+ push_cleaned_parquets(Path("./output/pgs_metadata")) # default: just-dna-seq/polygenic_risk_scores
382
+
383
+ # Pull cleaned parquets from HF
384
+ downloaded = pull_cleaned_parquets(Path("./local_cache"))
385
+ # [Path('local_cache/scores.parquet'), Path('local_cache/performance.parquet'), ...]
386
+ ```
387
+
388
+ ### Cleanup pipeline (`just_prs.cleanup`)
389
+
390
+ The cleanup functions can be used independently of `PRSCatalog`:
391
+
392
+ ```python
393
+ from just_prs.cleanup import clean_scores, clean_performance_metrics, parse_metric_string
394
+ from just_prs.ftp import download_metadata_sheet
395
+ from pathlib import Path
396
+
397
+ # Clean scores: rename columns, normalize genome builds
398
+ raw_df = download_metadata_sheet("scores", Path("./output/pgs_metadata/scores_raw.parquet"))
399
+ cleaned_lf = clean_scores(raw_df) # LazyFrame with snake_case columns
400
+
401
+ # Parse a metric string
402
+ parse_metric_string("1.55 [1.52,1.58]")
403
+ # {'estimate': 1.55, 'ci_lower': 1.52, 'ci_upper': 1.58, 'se': None}
404
+
405
+ # Clean performance metrics: parse strings, join with evaluation samples
406
+ perf_df = download_metadata_sheet("performance_metrics", Path("./output/pgs_metadata/perf.parquet"))
407
+ eval_df = download_metadata_sheet("evaluation_sample_sets", Path("./output/pgs_metadata/eval.parquet"))
408
+ cleaned_perf_lf = clean_performance_metrics(perf_df, eval_df)
409
+ ```
410
+
411
+ ### Low-level PRS computation (`just_prs.prs`)
412
+
413
+ ```python
414
+ from pathlib import Path
415
+ from just_prs.prs import compute_prs, compute_prs_batch
416
+
417
+ result = compute_prs(
418
+ vcf_path=Path("sample.vcf.gz"),
419
+ scoring_file="PGS000001", # PGS ID, local path, or LazyFrame
420
+ genome_build="GRCh38",
421
+ )
422
+ print(result.score, result.match_rate)
423
+
424
+ results = compute_prs_batch(
425
+ vcf_path=Path("sample.vcf.gz"),
426
+ pgs_ids=["PGS000001", "PGS000002"],
427
+ )
428
+ ```
429
+
430
+ ## Web UI (`prs-ui`)
431
+
432
+ An interactive [Reflex](https://reflex.dev/) web application for browsing PGS Catalog data and computing PRS scores.
433
+
434
+ ```bash
435
+ cd prs-ui
436
+ uv run reflex run
437
+ ```
438
+
439
+ The UI has three tabs:
440
+
441
+ ### Metadata Sheets
442
+
443
+ Browse all 7 PGS Catalog metadata sheets (Scores, Publications, EFO Traits, etc.) in a MUI DataGrid with server-side filtering and sorting. Select rows with checkboxes and download their scoring files to the local cache with a single **Download Selected** button.
444
+
445
+ ### Scoring File
446
+
447
+ Stream any harmonized scoring file by PGS ID directly from EBI FTP and view it in the grid. Select the genome build (GRCh37 / GRCh38) before loading.
448
+
449
+ ### Compute PRS
450
+
451
+ End-to-end PRS computation workflow:
452
+
453
+ 1. **Upload a VCF** — drag-and-drop or browse; genome build is auto-detected from `##reference` and `##contig` headers
454
+ 2. **Load Scores** — fetches the PGS Catalog scores metadata, pre-filtered by the detected (or manually selected) genome build. Scores are shown in a paginated, searchable table
455
+ 3. **Select scores** — use checkboxes to pick individual scores, or "Select All" to select everything matching the current search
456
+ 4. **Compute** — runs PRS for each selected score against the uploaded VCF and shows results with match rates, effect sizes, and classification metrics from PGS Catalog evaluation studies
457
+
458
+ Configuration:
459
+
460
+ | Environment variable | Default | Description |
461
+ |---------------------|---------|-------------|
462
+ | `PRS_CACHE_DIR` | `~/.cache/just-prs` | Root directory for cached metadata and scoring files |
463
+
464
+ ---
465
+
466
+ ## Data sources
467
+
468
+ - PGS Catalog REST API: <https://www.pgscatalog.org/rest/>
469
+ - EBI FTP bulk downloads: <https://ftp.ebi.ac.uk/pub/databases/spot/pgs/>
470
+ - PGS Catalog download documentation: <https://www.pgscatalog.org/downloads/>
471
+ - Cleaned metadata parquets on HuggingFace: <https://huggingface.co/datasets/just-dna-seq/polygenic_risk_scores>