gentroutils 1.5.0__tar.gz → 1.6.0.dev2__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (86) hide show
  1. gentroutils-1.6.0.dev2/.RData +0 -0
  2. {gentroutils-1.5.0 → gentroutils-1.6.0.dev2}/.gitignore +14 -0
  3. gentroutils-1.6.0.dev2/.vscode/extensions.json +7 -0
  4. gentroutils-1.6.0.dev2/.vscode/settings.json +19 -0
  5. {gentroutils-1.5.0 → gentroutils-1.6.0.dev2}/CHANGELOG.md +16 -0
  6. gentroutils-1.6.0.dev2/Dockerfile +16 -0
  7. {gentroutils-1.5.0 → gentroutils-1.6.0.dev2}/LICENSE +1 -1
  8. {gentroutils-1.5.0 → gentroutils-1.6.0.dev2}/Makefile +1 -1
  9. gentroutils-1.6.0.dev2/PKG-INFO +274 -0
  10. gentroutils-1.6.0.dev2/README.md +249 -0
  11. gentroutils-1.6.0.dev2/config.yaml +32 -0
  12. gentroutils-1.6.0.dev2/conftest.py +1 -0
  13. gentroutils-1.6.0.dev2/docs/00_prepare_tables_for_curation.R +126 -0
  14. gentroutils-1.6.0.dev2/docs/gwas_catalog_curation.md +45 -0
  15. gentroutils-1.6.0.dev2/pyproject.toml +333 -0
  16. gentroutils-1.6.0.dev2/src/gentroutils/__init__.py +11 -0
  17. gentroutils-1.6.0.dev2/src/gentroutils/errors.py +39 -0
  18. gentroutils-1.6.0.dev2/src/gentroutils/io/path/__init__.py +6 -0
  19. gentroutils-1.6.0.dev2/src/gentroutils/io/path/ftp.py +48 -0
  20. gentroutils-1.6.0.dev2/src/gentroutils/io/path/gcs.py +45 -0
  21. gentroutils-1.6.0.dev2/src/gentroutils/io/transfer/__init__.py +6 -0
  22. gentroutils-1.6.0.dev2/src/gentroutils/io/transfer/ftp_to_gcs.py +49 -0
  23. gentroutils-1.6.0.dev2/src/gentroutils/io/transfer/model.py +36 -0
  24. gentroutils-1.6.0.dev2/src/gentroutils/io/transfer/polars_to_gcs.py +20 -0
  25. gentroutils-1.6.0.dev2/src/gentroutils/parsers/__init__.py +1 -0
  26. gentroutils-1.6.0.dev2/src/gentroutils/parsers/curation.py +168 -0
  27. gentroutils-1.6.0.dev2/src/gentroutils/tasks/__init__.py +90 -0
  28. gentroutils-1.6.0.dev2/src/gentroutils/tasks/crawl.py +156 -0
  29. gentroutils-1.6.0.dev2/src/gentroutils/tasks/curation.py +110 -0
  30. gentroutils-1.6.0.dev2/src/gentroutils/tasks/fetch.py +141 -0
  31. gentroutils-1.6.0.dev2/src/gentroutils/transfer.py +81 -0
  32. gentroutils-1.6.0.dev2/tests/data/ftp/test/databases/gwas/summary_statistics/harmonised_list.txt +0 -0
  33. gentroutils-1.6.0.dev2/tests/data/gsutil_list.txt +95567 -0
  34. gentroutils-1.6.0.dev2/tests/data/test.h.tsv.gz +0 -0
  35. gentroutils-1.6.0.dev2/tests/io/conftest.py +0 -0
  36. gentroutils-1.6.0.dev2/tests/io/path/conftest.py +0 -0
  37. gentroutils-1.6.0.dev2/tests/io/path/test_ftp.py +36 -0
  38. gentroutils-1.6.0.dev2/tests/io/path/test_gcs.py +35 -0
  39. gentroutils-1.6.0.dev2/tests/io/transfer/conftest.py +0 -0
  40. gentroutils-1.6.0.dev2/tests/io/transfer/test_ftp_to_gcs.py +87 -0
  41. gentroutils-1.6.0.dev2/tests/io/transfer/test_model.py +23 -0
  42. gentroutils-1.6.0.dev2/tests/io/transfer/test_polars_to_gcs.py +45 -0
  43. gentroutils-1.6.0.dev2/tests/parsers/conftest.py +0 -0
  44. gentroutils-1.6.0.dev2/tests/parsers/test_curation.py +157 -0
  45. gentroutils-1.6.0.dev2/tests/tasks/conftest.py +84 -0
  46. gentroutils-1.6.0.dev2/tests/tasks/test_crawl_task.py +219 -0
  47. gentroutils-1.6.0.dev2/tests/tasks/test_curation_task.py +220 -0
  48. gentroutils-1.6.0.dev2/tests/tasks/test_fetch_task.py +198 -0
  49. gentroutils-1.6.0.dev2/tests/test_transfer.py +94 -0
  50. gentroutils-1.6.0.dev2/uv.lock +2063 -0
  51. gentroutils-1.5.0/PKG-INFO +0 -135
  52. gentroutils-1.5.0/README.md +0 -110
  53. gentroutils-1.5.0/pyproject.toml +0 -218
  54. gentroutils-1.5.0/src/gentroutils/__init__.py +0 -46
  55. gentroutils-1.5.0/src/gentroutils/commands/__init__.py +0 -11
  56. gentroutils-1.5.0/src/gentroutils/commands/update_gwas_curation_metadata.py +0 -287
  57. gentroutils-1.5.0/src/gentroutils/commands/utils.py +0 -152
  58. gentroutils-1.5.0/src/gentroutils/commands/validate_gwas_curation.py +0 -165
  59. gentroutils-1.5.0/tests/conftest.py +0 -132
  60. gentroutils-1.5.0/tests/test_cli.py +0 -23
  61. gentroutils-1.5.0/tests/test_update_gwas_curation_metadata.py +0 -205
  62. gentroutils-1.5.0/tests/test_validate_gwas_curation.py +0 -60
  63. gentroutils-1.5.0/uv.lock +0 -1796
  64. /gentroutils-1.5.0/src/gentroutils/py.typed → /gentroutils-1.6.0.dev2/.Rhistory +0 -0
  65. {gentroutils-1.5.0 → gentroutils-1.6.0.dev2}/.github/workflows/labeler.yaml +0 -0
  66. {gentroutils-1.5.0 → gentroutils-1.6.0.dev2}/.github/workflows/pr.yaml +0 -0
  67. {gentroutils-1.5.0 → gentroutils-1.6.0.dev2}/.github/workflows/release.yaml +0 -0
  68. {gentroutils-1.5.0 → gentroutils-1.6.0.dev2}/.github/workflows/release_pr.yaml +0 -0
  69. {gentroutils-1.5.0 → gentroutils-1.6.0.dev2}/.github/workflows/tag.yaml +0 -0
  70. {gentroutils-1.5.0 → gentroutils-1.6.0.dev2}/.pre-commit-config.yaml +0 -0
  71. {gentroutils-1.5.0 → gentroutils-1.6.0.dev2}/commitlint.config.js +0 -0
  72. {gentroutils-1.5.0 → gentroutils-1.6.0.dev2}/setup.sh +0 -0
  73. /gentroutils-1.5.0/tests/data/harmonised_list.txt → /gentroutils-1.6.0.dev2/src/gentroutils/py.typed +0 -0
  74. {gentroutils-1.5.0 → gentroutils-1.6.0.dev2}/tests/data/manual_curation/correct_curation.tsv +0 -0
  75. {gentroutils-1.5.0 → gentroutils-1.6.0.dev2}/tests/data/manual_curation/incorrect_analysisFlag_type.tsv +0 -0
  76. {gentroutils-1.5.0 → gentroutils-1.6.0.dev2}/tests/data/manual_curation/incorrect_analysisFlag_value.tsv +0 -0
  77. {gentroutils-1.5.0 → gentroutils-1.6.0.dev2}/tests/data/manual_curation/incorrect_columns_curation.tsv +0 -0
  78. {gentroutils-1.5.0 → gentroutils-1.6.0.dev2}/tests/data/manual_curation/incorrect_publicationTitle_type.tsv +0 -0
  79. {gentroutils-1.5.0 → gentroutils-1.6.0.dev2}/tests/data/manual_curation/incorrect_pubmedId_type.tsv +0 -0
  80. {gentroutils-1.5.0 → gentroutils-1.6.0.dev2}/tests/data/manual_curation/incorrect_studyId_type.tsv +0 -0
  81. {gentroutils-1.5.0 → gentroutils-1.6.0.dev2}/tests/data/manual_curation/incorrect_studyId_value.tsv +0 -0
  82. {gentroutils-1.5.0 → gentroutils-1.6.0.dev2}/tests/data/manual_curation/incorrect_studyType_type.tsv +0 -0
  83. {gentroutils-1.5.0 → gentroutils-1.6.0.dev2}/tests/data/manual_curation/incorrect_studyType_value.tsv +0 -0
  84. {gentroutils-1.5.0 → gentroutils-1.6.0.dev2}/tests/data/manual_curation/incorrect_traitFromSource_type.tsv +0 -0
  85. {gentroutils-1.5.0 → gentroutils-1.6.0.dev2}/tests/data/manual_curation/non_unique_studyId.tsv +0 -0
  86. {gentroutils-1.5.0 → gentroutils-1.6.0.dev2}/tests/data/manual_curation/null_value_in_studyId.tsv +0 -0
Binary file
@@ -17,3 +17,17 @@ coverage.xml
17
17
  interrogate_badge.svg
18
18
 
19
19
  .idea
20
+ .cloudstorage
21
+ .mypy_cache
22
+ .pytest_cache
23
+ .ruff_cache
24
+
25
+
26
+ # R
27
+ .RData
28
+ .Rhistory
29
+ data
30
+
31
+
32
+ # otter
33
+ work/
@@ -0,0 +1,7 @@
1
+ {
2
+ "recommendations": [
3
+ "charliermarsh.ruff",
4
+ "ms-python.python",
5
+ "ms-python.vscode-pylance",
6
+ ],
7
+ }
@@ -0,0 +1,19 @@
1
+ {
2
+ "python.analysis.typeCheckingMode": "standard",
3
+ "[python]": {
4
+ "editor.formatOnSave": true,
5
+ "editor.defaultFormatter": "charliermarsh.ruff",
6
+ "editor.codeActionsOnSave": {
7
+ "source.organizeImports": "explicit"
8
+ }
9
+ },
10
+ "python.testing.pytestArgs": ["tests", "src"],
11
+ "python.testing.unittestEnabled": false,
12
+ "python.testing.pytestEnabled": true,
13
+ "cSpell.words": [
14
+ "aioftp",
15
+ "gentroutils",
16
+ "harmonised",
17
+ "sumstat"
18
+ ]
19
+ }
@@ -1,6 +1,22 @@
1
1
  # CHANGELOG
2
2
 
3
3
 
4
+ ## v1.6.0-dev.2 (2025-08-12)
5
+
6
+ ### Features
7
+
8
+ - Update readme
9
+ ([`e966927`](https://github.com/opentargets/gentroutils/commit/e966927f8c4b3c670c694258c65eb6b3da6eeb49))
10
+
11
+
12
+ ## v1.6.0-dev.1 (2025-08-12)
13
+
14
+ ### Features
15
+
16
+ - Version 2.0.0
17
+ ([`47c9690`](https://github.com/opentargets/gentroutils/commit/47c9690ffc23be713ef0246aae5271ebe2ab5e3a))
18
+
19
+
4
20
  ## v1.5.0 (2025-02-12)
5
21
 
6
22
 
@@ -0,0 +1,16 @@
1
+ # Description: Dockerfile for the gentroutils package
2
+ #
3
+ # To run locally, you must have a credentials file for GCP. Assuming you do,
4
+ # you can run the following command:
5
+ #
6
+ # docker run -v /path/to/credentials.json:/app/credentials.json -e GOOGLE_APPLICATION_CREDENTIALS=/app/credentials.json gentroutuls -s gwas_catalog_release
7
+
8
+ FROM python:3.13.1-alpine3.21
9
+ COPY --from=ghcr.io/astral-sh/uv:latest /uv /bin/uv
10
+
11
+ ADD . /app
12
+
13
+ WORKDIR /app
14
+ RUN uv sync --frozen
15
+
16
+ ENTRYPOINT ["uv", "run", "gentroutils"]
@@ -186,7 +186,7 @@ APPENDIX: How to apply the Apache License to your work.
186
186
  same "printed page" as the copyright notice for easier
187
187
  identification within third-party archives.
188
188
 
189
- Copyright [yyyy] [name of copyright owner]
189
+ Copyright 2025 [name of copyright owner]
190
190
 
191
191
  Licensed under the Apache License, Version 2.0 (the "License");
192
192
  you may not use this file except in compliance with the License.
@@ -19,7 +19,7 @@ lint: ## run linting
19
19
  @echo "Running linting tools..."
20
20
  @uv run --frozen ruff check --fix --select I src/$(APP_NAME) tests
21
21
  @uv run --frozen pydoclint --config=pyproject.toml src tests
22
- @uv run --frozen interrogate -vv src/$(APP_NAME) tests
22
+ @uv run --frozen interrogate -vv src/$(APP_NAME)
23
23
 
24
24
  type-check: ## run mypy and check types
25
25
  @echo "Running type checks..."
@@ -0,0 +1,274 @@
1
+ Metadata-Version: 2.4
2
+ Name: gentroutils
3
+ Version: 1.6.0.dev2
4
+ Summary: Open Targets python genetics utility CLI tools
5
+ Author-email: Szymon Szyszkowski <ss60@sanger.ac.uk>
6
+ License-Expression: Apache-2.0
7
+ License-File: LICENSE
8
+ Classifier: Development Status :: 3 - Alpha
9
+ Classifier: Intended Audience :: Healthcare Industry
10
+ Classifier: Intended Audience :: Science/Research
11
+ Classifier: License :: OSI Approved :: Apache Software License
12
+ Classifier: Operating System :: Unix
13
+ Classifier: Programming Language :: Python :: 3.13
14
+ Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
15
+ Requires-Python: >=3.13
16
+ Requires-Dist: aioftp>=0.25.1
17
+ Requires-Dist: aiohttp>=3.11.18
18
+ Requires-Dist: google-cloud-storage>=3.1.1
19
+ Requires-Dist: loguru>=0.7.3
20
+ Requires-Dist: opentargets-otter>=25.0.2
21
+ Requires-Dist: polars>=1.31.0
22
+ Requires-Dist: pydantic>=2.10.6
23
+ Requires-Dist: tqdm>=4.67.1
24
+ Description-Content-Type: text/markdown
25
+
26
+ # gentroutils
27
+
28
+ [![checks](https://github.com/opentargets/gentroutils/actions/workflows/pr.yaml/badge.svg?branch=dev)](https://github.com/opentargets/gentroutils/actions/workflows/pr.yaml)
29
+ ![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)
30
+ [![release](https://github.com/opentargets/gentroutils/actions/workflows/release.yaml/badge.svg)](https://github.com/opentargets/gentroutils/actions/workflows/release.yaml)
31
+
32
+ Set of Command Line Interface tools to process Open Targets Genetics GWAS data.
33
+
34
+ ## Installation
35
+
36
+ ```
37
+ pip install gentroutils
38
+ ```
39
+
40
+ ## Available commands
41
+
42
+ To see all available commands after installation run
43
+
44
+ ```{bash}
45
+ gentroutils --help
46
+ ```
47
+
48
+ ## Usage
49
+
50
+ To run a single step run
51
+ ```{bash}
52
+ uv run gentroutils -s gwas_catalog_release # After cloning the repository
53
+ gentroutils -s gwas_catalog_release -c otter_config.yaml # When installed by pip
54
+ ```
55
+
56
+ The `gentroutils` repository uses the [otter](https://github.com/opentargets/otter) framework to build the set of tasks to run. The current implementation of tasks can be found in the `config.yaml` file in the root of the repository. To run gentroutils installed via `pip` you need to define the otter config that looks like the `config.yaml` file.
57
+
58
+ <details>
59
+ <summary>Example config</summary>
60
+
61
+ For the top level fields refer to the [otter documentation](https://opentargets.github.io/otter/otter.config.html)
62
+
63
+ ```yaml
64
+ ---
65
+ work_path: ./work
66
+ log_level: DEBUG
67
+ scratchpad:
68
+ steps:
69
+ gwas_catalog_release:
70
+ - name: crawl release metadata
71
+ stats_uri: "https://www.ebi.ac.uk/gwas/api/search/stats"
72
+ destination_template: "gs://gwas_catalog_inputs/gentroutils/{release_date}/stats.json"
73
+ promote: "true"
74
+ - name: fetch associations
75
+ stats_uri: "https://www.ebi.ac.uk/gwas/api/search/stats"
76
+ source_template: "ftp://ftp.ebi.ac.uk/pub/databases/gwas/releases/{release_date}/gwas-catalog-associations_ontology-annotated.tsv"
77
+ destination_template: "gs://gwas_catalog_inputs/gentroutils/{release_date}/gwas_catalog_associations_ontology_annotated.tsv"
78
+ promote: true
79
+ - name: fetch studies
80
+ stats_uri: "https://www.ebi.ac.uk/gwas/api/search/stats"
81
+ source_template: "ftp://ftp.ebi.ac.uk/pub/databases/gwas/releases/{release_date}/gwas-catalog-download-studies-v1.0.3.1.txt"
82
+ destination_template: "gs://gwas_catalog_inputs/gentroutils/{release_date}/gwas_catalog_download_studies.tsv"
83
+ promote: true
84
+ - name: fetch ancestries
85
+ stats_uri: "https://www.ebi.ac.uk/gwas/api/search/stats"
86
+ source_template: "ftp://ftp.ebi.ac.uk/pub/databases/gwas/releases/{release_date}/gwas-catalog-download-ancestries-v1.0.3.1.txt"
87
+ destination_template: "gs://gwas_catalog_inputs/gentroutils/{release_date}/gwas_catalog_download_ancestries.tsv"
88
+ promote: true
89
+ - name: curation study
90
+ requires:
91
+ - fetch studies
92
+ previous_curation: gs://gwas_catalog_inputs/curation/latest/curated/GWAS_Catalog_study_curation.tsv
93
+ studies: gs://gwas_catalog_inputs/gentroutils/latest/gwas_catalog_download_studies.tsv
94
+ destination_template: ./work/curation_{release_date}.tsv
95
+ promote: true
96
+ ```
97
+
98
+ The config above defines the steps that are run in parallel by the `otter` framework.
99
+
100
+ </details>
101
+
102
+ ### Available tasks
103
+
104
+ The list of tasks (defined in the `config.yaml` file) that can be run are:
105
+
106
+ #### Crawl release metadata
107
+
108
+ ```yaml
109
+ - name: crawl release metadata
110
+ stats_uri: "https://www.ebi.ac.uk/gwas/api/search/stats"
111
+ destination_template: "gs://gwas_catalog_inputs/gentroutils/{release_date}/stats.json"
112
+ promote: "true"
113
+ ```
114
+
115
+ This task fetches the latest GWAS Catalog release metadata from the `https://www.ebi.ac.uk/gwas/api/search/stats` endpoint and saves it to the specified destination.
116
+
117
+ > [!NOTE]
118
+ > **Task parameters**
119
+ >
120
+ > - The `stats_uri` is used to fetch the latest release date and other metadata.
121
+ > - The `destination_template` is where the metadata will be saved, and it uses the `{release_date}` placeholder to specify the release date dynamically. By default it searches for the release directly in the stats_uri json output.
122
+ > - The `promote` field is set to `true`, which means the output will be promoted to the latest release. Meaning that the file will be saved under `gs://gwas_catalog_inputs/gentroutils/latest/stats.json` after the task is completed. If the `promote` field is set to `false`, the file will not be promoted and will be saved under the specified path with the release date.
123
+
124
+ ---
125
+
126
+ ### Fetch associations
127
+
128
+ ```yaml
129
+ - name: fetch associations
130
+ stats_uri: "https://www.ebi.ac.uk/gwas/api/search/stats"
131
+ source_template: "ftp://ftp.ebi.ac.uk/pub/databases/gwas/releases/{release_date}/gwas-catalog-associations_ontology-annotated.tsv"
132
+ destination_template: "gs://gwas_catalog_inputs/gentroutils/{release_date}/gwas_catalog_associations_ontology_annotated.tsv"
133
+ promote: true
134
+ ```
135
+
136
+ This task fetches the GWAS Catalog associations file from the specified FTP server and saves it to the specified destination.
137
+
138
+ > [!NOTE]
139
+ > **Task parameters**
140
+ >
141
+ > - The `stats_uri` is used to fetch the latest release date and other metadata.
142
+ > - The `source_template` is the URL of the GWAS Catalog associations file, which uses the `{release_date}` placeholder to specify the release date dynamically. The release date is fetched from the `stats_uri` endpoint.
143
+ > - The `destination_template` is where the associations file will be saved, and it also uses the `{release_date}` placeholder. The release date is fetched from the `stats_uri` endpoint.
144
+ > - The `promote` field is set to `true`, which means the output will be promoted to the latest release. Meaning that the file will be saved under `gs://gwas_catalog_inputs/gentroutils/latest/gwas_catalog_associations_ontology_annotated.tsv` after the task is completed. If the `promote` field is set to `false`, the file will not be promoted and will be saved under the specified path with the release date.
145
+
146
+ ---
147
+
148
+ ### Fetch studies
149
+
150
+ ```yaml
151
+ - name: fetch studies
152
+ stats_uri: "https://www.ebi.ac.uk/gwas/api/search/stats"
153
+ source_template: "ftp://ftp.ebi.ac.uk/pub/databases/gwas/releases/{release_date}/gwas-catalog-download-studies-v1.0.3.1.txt"
154
+ destination_template: "gs://gwas_catalog_inputs/gentroutils/{release_date}/gwas_catalog_download_studies.tsv"
155
+ promote: true
156
+ ```
157
+
158
+ This task fetches the GWAS Catalog studies file from the specified FTP server and saves it to the specified destination.
159
+
160
+ > [!NOTE]
161
+ > **Task parameters**
162
+ >
163
+ > - The `stats_uri` is used to fetch the latest release date and other metadata.
164
+ > - The `source_template` is the URL of the GWAS Catalog studies file, which uses the `{release_date}` placeholder to specify the release date dynamically. The release date is fetched from the `stats_uri` endpoint.
165
+ > - The `destination_template` is where the studies file will be saved, and it also uses the `{release_date}` placeholder. The release date is fetched from the `stats_uri` endpoint.
166
+ > - The `promote` field is set to `true`, which means the output will be promoted to the latest release. Meaning that the file will be saved under `gs://gwas_catalog_inputs/gentroutils/latest/gwas_catalog_download_studies.tsv` after the task is completed. If the `promote` field is set to `false`, the file will not be promoted and will be saved under the specified path with the release date.
167
+
168
+ ---
169
+
170
+ ### Fetch ancestries
171
+
172
+ ```yaml
173
+ - name: fetch ancestries
174
+ stats_uri: "https://www.ebi.ac.uk/gwas/api/search/stats"
175
+ source_template: "ftp://ftp.ebi.ac.uk/pub/databases/gwas/releases/{release_date}/gwas-catalog-download-ancestries-v1.0.3.1.txt"
176
+ destination_template: "gs://gwas_catalog_inputs/gentroutils/{release_date}/gwas_catalog_download_ancestries.tsv"
177
+ promote: true
178
+ ```
179
+
180
+ This task fetches the GWAS Catalog ancestries file from the specified FTP server and saves it to the specified destination.
181
+
182
+ > [!NOTE]
183
+ > **Task parameters**
184
+ >
185
+ > - The `stats_uri` is used to fetch the latest release date and other metadata.
186
+ > - The `source_template` is the URL of the GWAS Catalog ancestries file, which uses the `{release_date}` placeholder to specify the release date dynamically. The release date is fetched from the `stats_uri` endpoint.
187
+ > - The `destination_template` is where the ancestries file will be saved, and it also uses the `{release_date}` placeholder. The release date is fetched from the `stats_uri` endpoint.
188
+ > - The `promote` field is set to `true`, which means the output will be promoted to the latest release. Meaning that the file will be saved under `gs://gwas_catalog_inputs/gentroutils/latest/gwas_catalog_download_ancestries.tsv` after the task is completed. If the `promote` field is set to `false`, the file will not be promoted and will be saved under the specified path with the release date.
189
+
190
+ ---
191
+
192
+ ### Curation
193
+
194
+ ```yaml
195
+ - name: curation study
196
+ requires:
197
+ - fetch studies
198
+ previous_curation: gs://gwas_catalog_inputs/curation/latest/curated/GWAS_Catalog_study_curation.tsv
199
+ studies: gs://gwas_catalog_inputs/gentroutils/latest/gwas_catalog_download_studies.tsv
200
+ destination_template: gs://gwas_catalog_inputs/curation/{release_date}/raw/gwas_catalog_study_curation.tsv
201
+ promote: true
202
+ ```
203
+
204
+ This task is used to build the GWAS Catalog curation file that is later used as a template for manual curation. It requires the `fetch studies` task to be completed before it can run. This is due to the fact that the curation file is build based on the list of studies fetched from `download studies` file.
205
+
206
+ > [!NOTE]
207
+ > **Task parameters**
208
+ >
209
+ > - The `requires` field specifies that this task depends on the `fetch studies` task, meaning it will only run after the studies have been fetched.
210
+ > - The `previous_curation` field is used to specify the path to the previous curation file. This is used to build the new curation file based on the previous one.
211
+ > - The `studies` field is the path to the studies file that was fetched in the `fetch studies` task. This file is used to build the curation file.
212
+ > - The `destination_template` is where the curation file will be saved, and it uses the `{release_date}` placeholder to specify the release date dynamically. The release date is fetched from the `stats_uri` endpoint.
213
+ > - The `promote` field is set to `true`, which means the output will be promoted to the latest release. Meaning that the file will be saved under `gs://gwas_catalog_inputs/curation/latest/raw/gwas_catalog_study_curation.tsv` after the task is completed. If the `promote` field is set to `false`, the file will not be promoted and will be saved under the specified path with the release date.
214
+
215
+ ---
216
+
217
+ ## Curation process
218
+
219
+ The base of the curation process for GWAS Catalog data is defined in the [docs/gwas_catalog_curation.md](docs/gwas_catalog_curation.md). The original solution uses R script to prepare the data for curation and then manually curates the data. The solution proposed in the `curation` task autommates the preparation of the data for curation and provides a template for manual curation. The manual curation process is still required, but the data preparation is automated.
220
+
221
+ The automated process includes:
222
+
223
+ 1. Reading `download studies` file with the list of studies that are currently comming from the latest GWAS Catalog release.
224
+ 2. Reading `previous curation` file that contains the list of the curated studies from the previous release.
225
+ 3. Comparing the two datasets with following logic:
226
+ - In case the study is present in the `previous curation` and `download studies`, the study is marked as `curated`
227
+ * In case the study is present in the `download studies` but not in the `previous curation`, the study is marked as `new`
228
+ * In case the study is present in the `previous curation` but not in the `download studies`, the study is marked as `removed`
229
+ 4. The output of the curation process is a file that contains the list of studies with their status (curated, new, removed) and the fields that are required for manual curation. The output file is saved to the `destination_template` path specified in the task configuration. The file is saved under `gs://gwas_catalog_inputs/curation/{release_date}/raw/gwas_catalog_study_curation.tsv` path.
230
+ 5. The output file is then promoted to the latest release path `gs://gwas_catalog_inputs/curation/latest/raw/gwas_catalog_study_curation.tsv` so that it can be used for manual curation.
231
+ 6. The manual curation process is then performed on the `gs://gwas_catalog_inputs/curation/latest/raw/gwas_catalog_study_curation.tsv` file. The manual curation process is not automated and requires manual intervention. The output from the manual curation process should be saved then to the `gs://gwas_catalog_inputs/curation/latest/curated/GWAS_Catalog_study_curation.tsv` and `gs://gwas_catalog_inputs/curation/{release_date}/curated/GWAS_Catalog_study_curation.tsv` file. This file is then used for the [Open Targets Staging Dags](https://github.com/opentargets/orchestration).
232
+
233
+ ---
234
+
235
+ ## Contribute
236
+
237
+ To be able to contribute to the project you need to set it up. This project
238
+ runs on:
239
+
240
+ - [x] python 3.13
241
+ - [x] uv (dependency manager)
242
+
243
+ To set up the project run
244
+
245
+ ```{bash}
246
+ make dev
247
+ ```
248
+
249
+ The command will install above dependencies (initial requirements are curl and bash) if not present and
250
+ install all python dependencies listed in `pyproject.toml`. Finally the command will install `pre-commit` hooks
251
+ required to be run before the commit is created.
252
+
253
+ The project has additional `dev` dependencies that include the list of packages used for testing purposes.
254
+ All of the `dev` dependencies are automatically installed by `uv`.
255
+
256
+ To see all available dev commands
257
+
258
+ Run following command to see all available dev commands
259
+
260
+ ```{bash}
261
+ make help
262
+ ```
263
+
264
+ ### Manual testing of CLI module
265
+
266
+ To check CLI execution manually you need to run
267
+
268
+ ```{bash}
269
+ uv run gentroutils
270
+ ```
271
+ ---
272
+
273
+ This software was developed as part of the Open Targets project. For more
274
+ information please see: http://www.opentargets.org
@@ -0,0 +1,249 @@
1
+ # gentroutils
2
+
3
+ [![checks](https://github.com/opentargets/gentroutils/actions/workflows/pr.yaml/badge.svg?branch=dev)](https://github.com/opentargets/gentroutils/actions/workflows/pr.yaml)
4
+ ![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)
5
+ [![release](https://github.com/opentargets/gentroutils/actions/workflows/release.yaml/badge.svg)](https://github.com/opentargets/gentroutils/actions/workflows/release.yaml)
6
+
7
+ Set of Command Line Interface tools to process Open Targets Genetics GWAS data.
8
+
9
+ ## Installation
10
+
11
+ ```
12
+ pip install gentroutils
13
+ ```
14
+
15
+ ## Available commands
16
+
17
+ To see all available commands after installation run
18
+
19
+ ```{bash}
20
+ gentroutils --help
21
+ ```
22
+
23
+ ## Usage
24
+
25
+ To run a single step run
26
+ ```{bash}
27
+ uv run gentroutils -s gwas_catalog_release # After cloning the repository
28
+ gentroutils -s gwas_catalog_release -c otter_config.yaml # When installed by pip
29
+ ```
30
+
31
+ The `gentroutils` repository uses the [otter](https://github.com/opentargets/otter) framework to build the set of tasks to run. The current implementation of tasks can be found in the `config.yaml` file in the root of the repository. To run gentroutils installed via `pip` you need to define the otter config that looks like the `config.yaml` file.
32
+
33
+ <details>
34
+ <summary>Example config</summary>
35
+
36
+ For the top level fields refer to the [otter documentation](https://opentargets.github.io/otter/otter.config.html)
37
+
38
+ ```yaml
39
+ ---
40
+ work_path: ./work
41
+ log_level: DEBUG
42
+ scratchpad:
43
+ steps:
44
+ gwas_catalog_release:
45
+ - name: crawl release metadata
46
+ stats_uri: "https://www.ebi.ac.uk/gwas/api/search/stats"
47
+ destination_template: "gs://gwas_catalog_inputs/gentroutils/{release_date}/stats.json"
48
+ promote: "true"
49
+ - name: fetch associations
50
+ stats_uri: "https://www.ebi.ac.uk/gwas/api/search/stats"
51
+ source_template: "ftp://ftp.ebi.ac.uk/pub/databases/gwas/releases/{release_date}/gwas-catalog-associations_ontology-annotated.tsv"
52
+ destination_template: "gs://gwas_catalog_inputs/gentroutils/{release_date}/gwas_catalog_associations_ontology_annotated.tsv"
53
+ promote: true
54
+ - name: fetch studies
55
+ stats_uri: "https://www.ebi.ac.uk/gwas/api/search/stats"
56
+ source_template: "ftp://ftp.ebi.ac.uk/pub/databases/gwas/releases/{release_date}/gwas-catalog-download-studies-v1.0.3.1.txt"
57
+ destination_template: "gs://gwas_catalog_inputs/gentroutils/{release_date}/gwas_catalog_download_studies.tsv"
58
+ promote: true
59
+ - name: fetch ancestries
60
+ stats_uri: "https://www.ebi.ac.uk/gwas/api/search/stats"
61
+ source_template: "ftp://ftp.ebi.ac.uk/pub/databases/gwas/releases/{release_date}/gwas-catalog-download-ancestries-v1.0.3.1.txt"
62
+ destination_template: "gs://gwas_catalog_inputs/gentroutils/{release_date}/gwas_catalog_download_ancestries.tsv"
63
+ promote: true
64
+ - name: curation study
65
+ requires:
66
+ - fetch studies
67
+ previous_curation: gs://gwas_catalog_inputs/curation/latest/curated/GWAS_Catalog_study_curation.tsv
68
+ studies: gs://gwas_catalog_inputs/gentroutils/latest/gwas_catalog_download_studies.tsv
69
+ destination_template: ./work/curation_{release_date}.tsv
70
+ promote: true
71
+ ```
72
+
73
+ The config above defines the steps that are run in parallel by the `otter` framework.
74
+
75
+ </details>
76
+
77
+ ### Available tasks
78
+
79
+ The list of tasks (defined in the `config.yaml` file) that can be run are:
80
+
81
+ #### Crawl release metadata
82
+
83
+ ```yaml
84
+ - name: crawl release metadata
85
+ stats_uri: "https://www.ebi.ac.uk/gwas/api/search/stats"
86
+ destination_template: "gs://gwas_catalog_inputs/gentroutils/{release_date}/stats.json"
87
+ promote: "true"
88
+ ```
89
+
90
+ This task fetches the latest GWAS Catalog release metadata from the `https://www.ebi.ac.uk/gwas/api/search/stats` endpoint and saves it to the specified destination.
91
+
92
+ > [!NOTE]
93
+ > **Task parameters**
94
+ >
95
+ > - The `stats_uri` is used to fetch the latest release date and other metadata.
96
+ > - The `destination_template` is where the metadata will be saved, and it uses the `{release_date}` placeholder to specify the release date dynamically. By default it searches for the release directly in the stats_uri json output.
97
+ > - The `promote` field is set to `true`, which means the output will be promoted to the latest release. Meaning that the file will be saved under `gs://gwas_catalog_inputs/gentroutils/latest/stats.json` after the task is completed. If the `promote` field is set to `false`, the file will not be promoted and will be saved under the specified path with the release date.
98
+
99
+ ---
100
+
101
+ ### Fetch associations
102
+
103
+ ```yaml
104
+ - name: fetch associations
105
+ stats_uri: "https://www.ebi.ac.uk/gwas/api/search/stats"
106
+ source_template: "ftp://ftp.ebi.ac.uk/pub/databases/gwas/releases/{release_date}/gwas-catalog-associations_ontology-annotated.tsv"
107
+ destination_template: "gs://gwas_catalog_inputs/gentroutils/{release_date}/gwas_catalog_associations_ontology_annotated.tsv"
108
+ promote: true
109
+ ```
110
+
111
+ This task fetches the GWAS Catalog associations file from the specified FTP server and saves it to the specified destination.
112
+
113
+ > [!NOTE]
114
+ > **Task parameters**
115
+ >
116
+ > - The `stats_uri` is used to fetch the latest release date and other metadata.
117
+ > - The `source_template` is the URL of the GWAS Catalog associations file, which uses the `{release_date}` placeholder to specify the release date dynamically. The release date is fetched from the `stats_uri` endpoint.
118
+ > - The `destination_template` is where the associations file will be saved, and it also uses the `{release_date}` placeholder. The release date is fetched from the `stats_uri` endpoint.
119
+ > - The `promote` field is set to `true`, which means the output will be promoted to the latest release. Meaning that the file will be saved under `gs://gwas_catalog_inputs/gentroutils/latest/gwas_catalog_associations_ontology_annotated.tsv` after the task is completed. If the `promote` field is set to `false`, the file will not be promoted and will be saved under the specified path with the release date.
120
+
121
+ ---
122
+
123
+ ### Fetch studies
124
+
125
+ ```yaml
126
+ - name: fetch studies
127
+ stats_uri: "https://www.ebi.ac.uk/gwas/api/search/stats"
128
+ source_template: "ftp://ftp.ebi.ac.uk/pub/databases/gwas/releases/{release_date}/gwas-catalog-download-studies-v1.0.3.1.txt"
129
+ destination_template: "gs://gwas_catalog_inputs/gentroutils/{release_date}/gwas_catalog_download_studies.tsv"
130
+ promote: true
131
+ ```
132
+
133
+ This task fetches the GWAS Catalog studies file from the specified FTP server and saves it to the specified destination.
134
+
135
+ > [!NOTE]
136
+ > **Task parameters**
137
+ >
138
+ > - The `stats_uri` is used to fetch the latest release date and other metadata.
139
+ > - The `source_template` is the URL of the GWAS Catalog studies file, which uses the `{release_date}` placeholder to specify the release date dynamically. The release date is fetched from the `stats_uri` endpoint.
140
+ > - The `destination_template` is where the studies file will be saved, and it also uses the `{release_date}` placeholder. The release date is fetched from the `stats_uri` endpoint.
141
+ > - The `promote` field is set to `true`, which means the output will be promoted to the latest release. Meaning that the file will be saved under `gs://gwas_catalog_inputs/gentroutils/latest/gwas_catalog_download_studies.tsv` after the task is completed. If the `promote` field is set to `false`, the file will not be promoted and will be saved under the specified path with the release date.
142
+
143
+ ---
144
+
145
+ ### Fetch ancestries
146
+
147
+ ```yaml
148
+ - name: fetch ancestries
149
+ stats_uri: "https://www.ebi.ac.uk/gwas/api/search/stats"
150
+ source_template: "ftp://ftp.ebi.ac.uk/pub/databases/gwas/releases/{release_date}/gwas-catalog-download-ancestries-v1.0.3.1.txt"
151
+ destination_template: "gs://gwas_catalog_inputs/gentroutils/{release_date}/gwas_catalog_download_ancestries.tsv"
152
+ promote: true
153
+ ```
154
+
155
+ This task fetches the GWAS Catalog ancestries file from the specified FTP server and saves it to the specified destination.
156
+
157
+ > [!NOTE]
158
+ > **Task parameters**
159
+ >
160
+ > - The `stats_uri` is used to fetch the latest release date and other metadata.
161
+ > - The `source_template` is the URL of the GWAS Catalog ancestries file, which uses the `{release_date}` placeholder to specify the release date dynamically. The release date is fetched from the `stats_uri` endpoint.
162
+ > - The `destination_template` is where the ancestries file will be saved, and it also uses the `{release_date}` placeholder. The release date is fetched from the `stats_uri` endpoint.
163
+ > - The `promote` field is set to `true`, which means the output will be promoted to the latest release. Meaning that the file will be saved under `gs://gwas_catalog_inputs/gentroutils/latest/gwas_catalog_download_ancestries.tsv` after the task is completed. If the `promote` field is set to `false`, the file will not be promoted and will be saved under the specified path with the release date.
164
+
165
+ ---
166
+
167
+ ### Curation
168
+
169
+ ```yaml
170
+ - name: curation study
171
+ requires:
172
+ - fetch studies
173
+ previous_curation: gs://gwas_catalog_inputs/curation/latest/curated/GWAS_Catalog_study_curation.tsv
174
+ studies: gs://gwas_catalog_inputs/gentroutils/latest/gwas_catalog_download_studies.tsv
175
+ destination_template: gs://gwas_catalog_inputs/curation/{release_date}/raw/gwas_catalog_study_curation.tsv
176
+ promote: true
177
+ ```
178
+
179
+ This task is used to build the GWAS Catalog curation file that is later used as a template for manual curation. It requires the `fetch studies` task to be completed before it can run. This is due to the fact that the curation file is build based on the list of studies fetched from `download studies` file.
180
+
181
+ > [!NOTE]
182
+ > **Task parameters**
183
+ >
184
+ > - The `requires` field specifies that this task depends on the `fetch studies` task, meaning it will only run after the studies have been fetched.
185
+ > - The `previous_curation` field is used to specify the path to the previous curation file. This is used to build the new curation file based on the previous one.
186
+ > - The `studies` field is the path to the studies file that was fetched in the `fetch studies` task. This file is used to build the curation file.
187
+ > - The `destination_template` is where the curation file will be saved, and it uses the `{release_date}` placeholder to specify the release date dynamically. The release date is fetched from the `stats_uri` endpoint.
188
+ > - The `promote` field is set to `true`, which means the output will be promoted to the latest release. Meaning that the file will be saved under `gs://gwas_catalog_inputs/curation/latest/raw/gwas_catalog_study_curation.tsv` after the task is completed. If the `promote` field is set to `false`, the file will not be promoted and will be saved under the specified path with the release date.
189
+
190
+ ---
191
+
192
+ ## Curation process
193
+
194
+ The base of the curation process for GWAS Catalog data is defined in the [docs/gwas_catalog_curation.md](docs/gwas_catalog_curation.md). The original solution uses R script to prepare the data for curation and then manually curates the data. The solution proposed in the `curation` task autommates the preparation of the data for curation and provides a template for manual curation. The manual curation process is still required, but the data preparation is automated.
195
+
196
+ The automated process includes:
197
+
198
+ 1. Reading `download studies` file with the list of studies that are currently comming from the latest GWAS Catalog release.
199
+ 2. Reading `previous curation` file that contains the list of the curated studies from the previous release.
200
+ 3. Comparing the two datasets with following logic:
201
+ - In case the study is present in the `previous curation` and `download studies`, the study is marked as `curated`
202
+ * In case the study is present in the `download studies` but not in the `previous curation`, the study is marked as `new`
203
+ * In case the study is present in the `previous curation` but not in the `download studies`, the study is marked as `removed`
204
+ 4. The output of the curation process is a file that contains the list of studies with their status (curated, new, removed) and the fields that are required for manual curation. The output file is saved to the `destination_template` path specified in the task configuration. The file is saved under `gs://gwas_catalog_inputs/curation/{release_date}/raw/gwas_catalog_study_curation.tsv` path.
205
+ 5. The output file is then promoted to the latest release path `gs://gwas_catalog_inputs/curation/latest/raw/gwas_catalog_study_curation.tsv` so that it can be used for manual curation.
206
+ 6. The manual curation process is then performed on the `gs://gwas_catalog_inputs/curation/latest/raw/gwas_catalog_study_curation.tsv` file. The manual curation process is not automated and requires manual intervention. The output from the manual curation process should be saved then to the `gs://gwas_catalog_inputs/curation/latest/curated/GWAS_Catalog_study_curation.tsv` and `gs://gwas_catalog_inputs/curation/{release_date}/curated/GWAS_Catalog_study_curation.tsv` file. This file is then used for the [Open Targets Staging Dags](https://github.com/opentargets/orchestration).
207
+
208
+ ---
209
+
210
+ ## Contribute
211
+
212
+ To be able to contribute to the project you need to set it up. This project
213
+ runs on:
214
+
215
+ - [x] python 3.13
216
+ - [x] uv (dependency manager)
217
+
218
+ To set up the project run
219
+
220
+ ```{bash}
221
+ make dev
222
+ ```
223
+
224
+ The command will install above dependencies (initial requirements are curl and bash) if not present and
225
+ install all python dependencies listed in `pyproject.toml`. Finally the command will install `pre-commit` hooks
226
+ required to be run before the commit is created.
227
+
228
+ The project has additional `dev` dependencies that include the list of packages used for testing purposes.
229
+ All of the `dev` dependencies are automatically installed by `uv`.
230
+
231
+ To see all available dev commands
232
+
233
+ Run following command to see all available dev commands
234
+
235
+ ```{bash}
236
+ make help
237
+ ```
238
+
239
+ ### Manual testing of CLI module
240
+
241
+ To check CLI execution manually you need to run
242
+
243
+ ```{bash}
244
+ uv run gentroutils
245
+ ```
246
+ ---
247
+
248
+ This software was developed as part of the Open Targets project. For more
249
+ information please see: http://www.opentargets.org