PyPI - gentroutils - Versions diffs - 3.1.0__tar.gz → 4.0.0__tar.gz - Mend

gentroutils 3.1.0tar.gz → 4.0.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (75) hide show

{gentroutils-3.1.0 → gentroutils-4.0.0}/CHANGELOG.md RENAMED Viewed

@@ -1,6 +1,72 @@
 # CHANGELOG
+## v4.0.0 (2026-02-03)
+## v4.0.0-dev.1 (2026-02-03)
+### Features
+- Updete dependencies
+  ([`b6af4d2`](https://github.com/opentargets/gentroutils/commit/b6af4d28605e7c687f5ec15cae7187c64e834cb0))
+## v3.2.0 (2026-02-03)
+### Chores
+- Update uv lock
+  ([`6f13fc0`](https://github.com/opentargets/gentroutils/commit/6f13fc0055ee9a49a215166d3cccb31747602a4f))
+## v3.2.0-dev.2 (2026-02-03)
+### Bug Fixes
+- Output tsv file instead of csv
+  ([`aff71b1`](https://github.com/opentargets/gentroutils/commit/aff71b16b6c4d273cc851050a793d3798bae27ac))
+- Test
+  ([`f9dd890`](https://github.com/opentargets/gentroutils/commit/f9dd890efc32ab969fbcd14eb0da14e40678e8fb))
+- Test for curation
+  ([`b853358`](https://github.com/opentargets/gentroutils/commit/b85335815d7a22745c61404f81b612a14cce06d5))
+- Test for curation
+  ([`22138ab`](https://github.com/opentargets/gentroutils/commit/22138ab31f7551a4b161f6f1885b7975d57a0ac7))
+### Chores
+- Cleanup
+  ([`68a3f66`](https://github.com/opentargets/gentroutils/commit/68a3f6607a4a1b61441c1369f2a9d3b4babec30c))
+- Fix glob pattern
+  ([`404b8ca`](https://github.com/opentargets/gentroutils/commit/404b8ca71b95764529ebb3df7c39881a0a12ff5e))
+- Handle mutliple sumstat files
+  ([`1fc8902`](https://github.com/opentargets/gentroutils/commit/1fc8902171a8f6edac407b790c3bcbe691792f96))
+- Update
+  ([`e69575b`](https://github.com/opentargets/gentroutils/commit/e69575b5a6c802b78b959314b84348d7969eeaeb))
+- Update readme
+  ([`12f274c`](https://github.com/opentargets/gentroutils/commit/12f274c5158b3986ba2511791fc2289b24d9aa40))
+## v3.2.0-dev.1 (2025-11-05)
+### Chores
+- Uncomment config
+  ([`30c4d68`](https://github.com/opentargets/gentroutils/commit/30c4d68e79a35d2c5c83cd17a15f63906ef834d6))
+### Features
+- **associations**: Allow zip file transfer from ftp
+  ([`662a635`](https://github.com/opentargets/gentroutils/commit/662a63593cd5f340a768974041461cc65e1566b9))
 ## v3.1.0 (2025-09-02)
 ### Chores

{gentroutils-3.1.0 → gentroutils-4.0.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: gentroutils
-Version: 3.1.0
+Version: 4.0.0
 Summary: Open Targets python genetics utility CLI tools
 Author-email: Szymon Szyszkowski <ss60@sanger.ac.uk>
 License-Expression: Apache-2.0
@@ -12,13 +12,13 @@ Classifier: License :: OSI Approved :: Apache Software License
 Classifier: Operating System :: Unix
 Classifier: Programming Language :: Python :: 3.13
 Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
-Requires-Python: >=3.13
+Requires-Python: <=3.13,>3.11
 Requires-Dist: aioftp>=0.25.1
 Requires-Dist: aiohttp>=3.11.18
 Requires-Dist: gcsfs>=2025.7.0
 Requires-Dist: google-cloud-storage>=3.1.1
 Requires-Dist: loguru>=0.7.3
-Requires-Dist: opentargets-otter>=25.0.2
+Requires-Dist: opentargets-otter>=25.0.15
 Requires-Dist: polars[fsspec]>=1.31.0
 Requires-Dist: pydantic>=2.10.6
 Requires-Dist: tqdm>=4.67.1
@@ -99,6 +99,7 @@ steps:
       previous_curation: gs://gwas_catalog_inputs/curation/latest/curated/GWAS_Catalog_study_curation.tsv
       studies: gs://gwas_catalog_inputs/gentroutils/latest/gwas_catalog_download_studies.tsv
       destination_template: gs://gwas_catalog_inputs/gentroutils/curation/{release_date}/GWAS_Catalog_study_curation.tsv
+      summary_statistics_glob: gs://gwas_catalog_inputs/raw_summary_statistics/*.h.tsv.gz
       promote: true
 ```
@@ -164,7 +165,7 @@ This task fetches the GWAS Catalog associations file from the specified FTP serv
 This task fetches the GWAS Catalog studies file from the specified FTP server and saves it to the specified destination.
-> [!NOTE]
+> [!NOTE]
 > **Task parameters**
 >
 > - The `stats_uri` is used to fetch the latest release date and other metadata.
@@ -186,7 +187,7 @@ This task fetches the GWAS Catalog studies file from the specified FTP server an
 This task fetches the GWAS Catalog ancestries file from the specified FTP server and saves it to the specified destination.
-> [!NOTE]
+> [!NOTE]
 > **Task parameters**
 >
 > - The `stats_uri` is used to fetch the latest release date and other metadata.
@@ -205,6 +206,7 @@ This task fetches the GWAS Catalog ancestries file from the specified FTP server
       previous_curation: gs://gwas_catalog_inputs/curation/latest/curated/GWAS_Catalog_study_curation.tsv
       studies: gs://gwas_catalog_inputs/gentroutils/latest/gwas_catalog_download_studies.tsv
       destination_template: gs://gwas_catalog_inputs/curation/{release_date}/raw/gwas_catalog_study_curation.tsv
+      summary_statistics_glob: gs://gwas_catalog_inputs/raw_summary_statistics/*.h.tsv.gz
       promote: true
 ```
@@ -218,24 +220,26 @@ This task is used to build the GWAS Catalog curation file that is later used as
 > - The `studies` field is the path to the studies file that was fetched in the `fetch studies` task. This file is used to build the curation file.
 > - The `destination_template` is where the curation file will be saved, and it uses the `{release_date}` placeholder to specify the release date dynamically. The release date is fetched from the `stats_uri` endpoint.
 > - The `promote` field is set to `true`, which means the output will be promoted to the latest release. Meaning that the file will be saved under `gs://gwas_catalog_inputs/curation/latest/raw/gwas_catalog_study_curation.tsv` after the task is completed. If the `promote` field is set to `false`, the file will not be promoted and will be saved under the specified path with the release date.
+> The `summary_statistics_glob` field is used to specify the glob pattern to list all synced summary statistics files from GCS. This is used to identify which studies have summary statistics available.
 ---
 ## Curation process
-The base of the curation process for GWAS Catalog data is defined in the [docs/gwas_catalog_curation.md](docs/gwas_catalog_curation.md). The original solution uses R script to prepare the data for curation and then manually curates the data. The solution proposed in the `curation` task autommates the preparation of the data for curation and provides a template for manual curation. The manual curation process is still required, but the data preparation is automated.
+The base of the curation process for GWAS Catalog data is defined in the [docs/gwas_catalog_curation.md](docs/gwas_catalog_curation.md). The original solution uses R script to prepare the data for curation and then manually curates the data. The solution proposed in the `curation` task automates the preparation of the data for curation and provides a template for manual curation. The manual curation process is still required, but the data preparation is automated.
 The automated process includes:
 1. Reading `download studies` file with the list of studies that are currently comming from the latest GWAS Catalog release.
 2. Reading `previous curation` file that contains the list of the curated studies from the previous release.
-3. Comparing the two datasets with following logic:
+3. Listing all synced summary statistics files from the `summary_statistics_glob` parameter to identify which studies have summary statistics available. Note that this can be more then the list of studies in the `download studies` file as syncing also involves the unpublished studies.
+4. Comparing the three datasets with following logic:
    - In case the study is present in the `previous curation` and `download studies`, the study is marked as `curated`
-   * In case the study is present in the `download studies` but not in the `previous curation`, the study is marked as `new`
-   * In case the study is present in the `previous curation` but not in the `download studies`, the study is marked as `removed`
-4. The output of the curation process is a file that contains the list of studies with their status (curated, new, removed) and the fields that are required for manual curation. The output file is saved to the `destination_template` path specified in the task configuration. The file is saved under `gs://gwas_catalog_inputs/curation/{release_date}/raw/gwas_catalog_study_curation.tsv` path.
-5. The output file is then promoted to the latest release path `gs://gwas_catalog_inputs/curation/latest/raw/gwas_catalog_study_curation.tsv` so that it can be used for manual curation.
-6. The manual curation process is then performed on the `gs://gwas_catalog_inputs/curation/latest/raw/gwas_catalog_study_curation.tsv` file. The manual curation process is not automated and requires manual intervention. The output from the manual curation process should be saved then to the `gs://gwas_catalog_inputs/curation/latest/curated/GWAS_Catalog_study_curation.tsv` and `gs://gwas_catalog_inputs/curation/{release_date}/curated/GWAS_Catalog_study_curation.tsv` file. This file is then used for the [Open Targets Staging Dags](https://github.com/opentargets/orchestration).
+   - In case the study is present in the `download studies` but not in the `previous curation`, the study is marked as `to_curate` or `has_no_sumstats` depending on the presence of summary statistics files
+   - In case the study is present in the `previous curation` but not in the `download studies`, the study is marked as `removed`
+5. The output of the curation process is a file that contains the list of studies with their status (curated, new, removed) and the fields that are required for manual curation. The output file is saved to the `destination_template` path specified in the task configuration. The file is saved under `gs://gwas_catalog_inputs/curation/{release_date}/raw/gwas_catalog_study_curation.tsv` path.
+6. The output file is then promoted to the latest release path `gs://gwas_catalog_inputs/curation/latest/raw/gwas_catalog_study_curation.tsv` so that it can be used for manual curation.
+7. The manual curation process is then performed on the `gs://gwas_catalog_inputs/curation/latest/raw/gwas_catalog_study_curation.tsv` file. The manual curation process is not automated and requires manual intervention. The output from the manual curation process should be saved then to the `gs://gwas_catalog_inputs/curation/latest/curated/GWAS_Catalog_study_curation.tsv` and `gs://gwas_catalog_inputs/curation/{release_date}/curated/GWAS_Catalog_study_curation.tsv` file. This file is then used for the [Open Targets Staging Dags](https://github.com/opentargets/orchestration).
 ---

{gentroutils-3.1.0 → gentroutils-4.0.0}/README.md RENAMED Viewed

@@ -73,6 +73,7 @@ steps:
       previous_curation: gs://gwas_catalog_inputs/curation/latest/curated/GWAS_Catalog_study_curation.tsv
       studies: gs://gwas_catalog_inputs/gentroutils/latest/gwas_catalog_download_studies.tsv
       destination_template: gs://gwas_catalog_inputs/gentroutils/curation/{release_date}/GWAS_Catalog_study_curation.tsv
+      summary_statistics_glob: gs://gwas_catalog_inputs/raw_summary_statistics/*.h.tsv.gz
       promote: true
 ```
@@ -138,7 +139,7 @@ This task fetches the GWAS Catalog associations file from the specified FTP serv
 This task fetches the GWAS Catalog studies file from the specified FTP server and saves it to the specified destination.
-> [!NOTE]
+> [!NOTE]
 > **Task parameters**
 >
 > - The `stats_uri` is used to fetch the latest release date and other metadata.
@@ -160,7 +161,7 @@ This task fetches the GWAS Catalog studies file from the specified FTP server an
 This task fetches the GWAS Catalog ancestries file from the specified FTP server and saves it to the specified destination.
-> [!NOTE]
+> [!NOTE]
 > **Task parameters**
 >
 > - The `stats_uri` is used to fetch the latest release date and other metadata.
@@ -179,6 +180,7 @@ This task fetches the GWAS Catalog ancestries file from the specified FTP server
       previous_curation: gs://gwas_catalog_inputs/curation/latest/curated/GWAS_Catalog_study_curation.tsv
       studies: gs://gwas_catalog_inputs/gentroutils/latest/gwas_catalog_download_studies.tsv
       destination_template: gs://gwas_catalog_inputs/curation/{release_date}/raw/gwas_catalog_study_curation.tsv
+      summary_statistics_glob: gs://gwas_catalog_inputs/raw_summary_statistics/*.h.tsv.gz
       promote: true
 ```
@@ -192,24 +194,26 @@ This task is used to build the GWAS Catalog curation file that is later used as
 > - The `studies` field is the path to the studies file that was fetched in the `fetch studies` task. This file is used to build the curation file.
 > - The `destination_template` is where the curation file will be saved, and it uses the `{release_date}` placeholder to specify the release date dynamically. The release date is fetched from the `stats_uri` endpoint.
 > - The `promote` field is set to `true`, which means the output will be promoted to the latest release. Meaning that the file will be saved under `gs://gwas_catalog_inputs/curation/latest/raw/gwas_catalog_study_curation.tsv` after the task is completed. If the `promote` field is set to `false`, the file will not be promoted and will be saved under the specified path with the release date.
+> The `summary_statistics_glob` field is used to specify the glob pattern to list all synced summary statistics files from GCS. This is used to identify which studies have summary statistics available.
 ---
 ## Curation process
-The base of the curation process for GWAS Catalog data is defined in the [docs/gwas_catalog_curation.md](docs/gwas_catalog_curation.md). The original solution uses R script to prepare the data for curation and then manually curates the data. The solution proposed in the `curation` task autommates the preparation of the data for curation and provides a template for manual curation. The manual curation process is still required, but the data preparation is automated.
+The base of the curation process for GWAS Catalog data is defined in the [docs/gwas_catalog_curation.md](docs/gwas_catalog_curation.md). The original solution uses R script to prepare the data for curation and then manually curates the data. The solution proposed in the `curation` task automates the preparation of the data for curation and provides a template for manual curation. The manual curation process is still required, but the data preparation is automated.
 The automated process includes:
 1. Reading `download studies` file with the list of studies that are currently comming from the latest GWAS Catalog release.
 2. Reading `previous curation` file that contains the list of the curated studies from the previous release.
-3. Comparing the two datasets with following logic:
+3. Listing all synced summary statistics files from the `summary_statistics_glob` parameter to identify which studies have summary statistics available. Note that this can be more then the list of studies in the `download studies` file as syncing also involves the unpublished studies.
+4. Comparing the three datasets with following logic:
    - In case the study is present in the `previous curation` and `download studies`, the study is marked as `curated`
-   * In case the study is present in the `download studies` but not in the `previous curation`, the study is marked as `new`
-   * In case the study is present in the `previous curation` but not in the `download studies`, the study is marked as `removed`
-4. The output of the curation process is a file that contains the list of studies with their status (curated, new, removed) and the fields that are required for manual curation. The output file is saved to the `destination_template` path specified in the task configuration. The file is saved under `gs://gwas_catalog_inputs/curation/{release_date}/raw/gwas_catalog_study_curation.tsv` path.
-5. The output file is then promoted to the latest release path `gs://gwas_catalog_inputs/curation/latest/raw/gwas_catalog_study_curation.tsv` so that it can be used for manual curation.
-6. The manual curation process is then performed on the `gs://gwas_catalog_inputs/curation/latest/raw/gwas_catalog_study_curation.tsv` file. The manual curation process is not automated and requires manual intervention. The output from the manual curation process should be saved then to the `gs://gwas_catalog_inputs/curation/latest/curated/GWAS_Catalog_study_curation.tsv` and `gs://gwas_catalog_inputs/curation/{release_date}/curated/GWAS_Catalog_study_curation.tsv` file. This file is then used for the [Open Targets Staging Dags](https://github.com/opentargets/orchestration).
+   - In case the study is present in the `download studies` but not in the `previous curation`, the study is marked as `to_curate` or `has_no_sumstats` depending on the presence of summary statistics files
+   - In case the study is present in the `previous curation` but not in the `download studies`, the study is marked as `removed`
+5. The output of the curation process is a file that contains the list of studies with their status (curated, new, removed) and the fields that are required for manual curation. The output file is saved to the `destination_template` path specified in the task configuration. The file is saved under `gs://gwas_catalog_inputs/curation/{release_date}/raw/gwas_catalog_study_curation.tsv` path.
+6. The output file is then promoted to the latest release path `gs://gwas_catalog_inputs/curation/latest/raw/gwas_catalog_study_curation.tsv` so that it can be used for manual curation.
+7. The manual curation process is then performed on the `gs://gwas_catalog_inputs/curation/latest/raw/gwas_catalog_study_curation.tsv` file. The manual curation process is not automated and requires manual intervention. The output from the manual curation process should be saved then to the `gs://gwas_catalog_inputs/curation/latest/curated/GWAS_Catalog_study_curation.tsv` and `gs://gwas_catalog_inputs/curation/{release_date}/curated/GWAS_Catalog_study_curation.tsv` file. This file is then used for the [Open Targets Staging Dags](https://github.com/opentargets/orchestration).
 ---

gentroutils-4.0.0/config.yaml ADDED Viewed

@@ -0,0 +1,41 @@
+---
+work_path: ./work
+log_level: DEBUG
+scratchpad:
+  gc_stats_uri: "https://www.ebi.ac.uk/gwas/api/search/stats"
+  gc_bucket: "gs://gwas_catalog_inputs"
+  gc_ftp: "ftp://ftp.ebi.ac.uk/pub/databases/gwas/releases"
+steps:
+  gwas_catalog_release:
+    - name: crawl release metadata
+      stats_uri: ${gc_stats_uri}
+      destination_template: '${gc_bucket}/gentroutils/{release_date}/stats.json'
+      promote: true
+    - name: fetch studies
+      stats_uri: ${gc_stats_uri}
+      source_template: '${gc_ftp}/{release_date}/gwas-catalog-download-studies-v1.0.3.1.txt'
+      destination_template: '${gc_bucket}/gentroutils/{release_date}/gwas_catalog_download_studies.tsv'
+      promote: true
+    - name: fetch ancestries
+      stats_uri: ${gc_stats_uri}
+      source_template: '${gc_ftp}/{release_date}/gwas-catalog-download-ancestries-v1.0.3.1.txt'
+      destination_template: '${gc_bucket}/gentroutils/{release_date}/gwas_catalog_download_ancestries.tsv'
+      promote: true
+    - name: fetch associations
+      stats_uri: ${gc_stats_uri}
+      source_template: '${gc_ftp}/{release_date}/gwas-catalog-associations_ontology-annotated-full.zip'
+      destination_template: '${gc_bucket}/gentroutils/{release_date}/gwas_catalog_associations_ontology_annotated.tsv'
+      promote: true
+    - name: curation study
+      requires:
+        - fetch studies
+      previous_curation: '${gc_bucket}/curation/latest/curated/GWAS_Catalog_study_curation.tsv'
+      studies: '${gc_bucket}/gentroutils/latest/gwas_catalog_download_studies.tsv'
+      summary_statistics_glob: '${gc_bucket}/raw_summary_statistics/**.h.tsv.gz'
+      destination_template: '${gc_bucket}/curation/{release_date}/raw/GWAS_Catalog_study_curation.tsv'
+      promote: true

{gentroutils-3.1.0 → gentroutils-4.0.0}/pyproject.toml RENAMED Viewed

@@ -1,7 +1,7 @@
 [project]
 authors = [{ name = "Szymon Szyszkowski", email = "ss60@sanger.ac.uk" }]
 name = "gentroutils"
-version = "3.1.0"
+version = "4.0.0"
 description = "Open Targets python genetics utility CLI tools"
 dependencies = [
     "aiohttp>=3.11.18",
@@ -10,12 +10,12 @@ dependencies = [
     "pydantic>=2.10.6",
     "loguru>=0.7.3",
     "tqdm>=4.67.1",
-    "opentargets-otter>=25.0.2",
+    "opentargets-otter>=25.0.15",
     "google-cloud-storage>=3.1.1",
     "gcsfs>=2025.7.0",
 ]
 readme = "README.md"
-requires-python = ">=3.13"
+requires-python = ">3.11,<=3.13"
 license = "Apache-2.0"
 classifiers = [
     "Development Status :: 3 - Alpha",
@@ -50,7 +50,6 @@ dev = [
     "gcloud-storage-emulator>=0.5.0",
     "types-requests>=2.32.0.20240712",
     "pyftpdlib>=2.0.1",
-    "python-semantic-release>=9.19.1",
     "pandas-stubs>=2.2.3.250308",
     "ipython>=8.36.0",
     "pytest-asyncio>=1.1.0",

gentroutils-4.0.0/src/gentroutils/io/transfer/ftp_to_gcs.py ADDED Viewed

@@ -0,0 +1,143 @@
+"""Transfer files from FTP to Google Cloud Storage (GCS)."""
+import asyncio
+import io
+import re
+from typing import Annotated
+import aioftp
+from google.cloud import storage
+from loguru import logger
+from pydantic import AfterValidator
+from gentroutils.io.path import FTPPath, GCSPath
+from gentroutils.io.transfer.model import TransferableObject
+class FTPtoGCPTransferableObject(TransferableObject):
+    """A class to represent an object that can be transferred from FTP to GCP."""
+    source: Annotated[str, AfterValidator(lambda x: str(FTPPath(x)))]
+    destination: Annotated[str, AfterValidator(lambda x: str(GCSPath(x)))]
+    async def transfer(self) -> None:
+        """Transfer files from FTP to GCP.
+        This function fetches the data for the file provided in the local FTP path, collects the
+        data asynchronously to buffer, and uploads it to the provided GCP bucket blob.
+        Implements retry logic with exponential backoff for handling transient network errors.
+        """
+        max_retries = 3
+        retry_delay = 1  # Initial delay in seconds
+        for attempt in range(max_retries):
+            try:
+                await self._perform_transfer()
+                return  # Success, exit the retry loop
+            except (ConnectionResetError, OSError, aioftp.errors.AIOFTPException) as e:
+                if attempt < max_retries - 1:
+                    wait_time = retry_delay * (2**attempt)  # Exponential backoff
+                    logger.warning(
+                        f"Transfer attempt {attempt + 1}/{max_retries} failed for {self.source}: {e}. "
+                        f"Retrying in {wait_time}s..."
+                    )
+                    await asyncio.sleep(wait_time)
+                else:
+                    logger.error(f"Transfer failed after {max_retries} attempts for {self.source}: {e}")
+                    raise
+            except Exception as e:
+                # For non-retryable exceptions, log and raise immediately
+                logger.error(f"Non-retryable error during transfer from {self.source} to {self.destination}: {e}")
+                raise
+    async def _perform_transfer(self) -> None:
+        """Perform the actual transfer operation.
+        This is separated from the transfer method to allow for retry logic.
+        """
+        logger.info(f"Attempting to transfer data from {self.source} to {self.destination}.")
+        gcs_obj = GCSPath(self.destination)
+        ftp_obj = FTPPath(self.source)
+        async with aioftp.Client.context(ftp_obj.server, user="anonymous", password="anonymous") as ftp:  # noqa: S106
+            bucket = storage.Client().bucket(gcs_obj.bucket)
+            blob = bucket.blob(gcs_obj.object)
+            logger.info(f"Searching for the release date in the provided ftp path: {ftp_obj.base_dir}.")
+            dir_match = re.match(r"^.*(?P<release_date>\d{4}\/\d{2}\/\d{2}){1}$", str(ftp_obj.base_dir))
+            if dir_match:
+                logger.info(f"Found release date to search in the ftp {dir_match.group('release_date')}.")
+                release_date = dir_match.group("release_date")
+                try:
+                    logger.debug(f"We are in the directory: {await ftp.get_current_directory()}")
+                    logger.debug(f"Changing directory to: {ftp_obj.base_dir}")
+                    await ftp.change_directory(ftp_obj.base_dir)
+                    logger.success(f"Successfully changed directory to: {ftp_obj.base_dir}")
+                except aioftp.StatusCodeError as e:
+                    logger.warning(f"Failed to change directory to {ftp_obj.base_dir}: {e}")
+                    logger.warning(f"Probably the release date {release_date} is out of sync with the api endpoint.")
+                    try:
+                        logger.warning("Attempting to load the `latest` release.")
+                        ftp_obj = FTPPath(self.source.replace(release_date, "latest"))
+                        await ftp.change_directory(ftp_obj.base_dir)
+                        logger.success(f"Successfully changed directory to: {ftp_obj.base_dir}")
+                    except aioftp.StatusCodeError as e:
+                        logger.error(f"Failed to find the latest release under {ftp_obj}")
+                        raise
+                logger.debug("Creating in-memory buffer to store downloaded data.")
+                buffer = io.BytesIO()
+                logger.debug(f"Downloading data from FTP path: {ftp_obj.filename}")
+                stream = await ftp.download_stream(ftp_obj.filename)
+                logger.info("Successfully connected to the FTP stream, beginning data transfer to buffer.")
+                async with stream:
+                    async for block in stream.iter_by_block():
+                        buffer.write(block)
+                buffer.seek(0)
+                if ftp_obj.filename.endswith(".zip"):
+                    logger.info("Uploading zipped content to GCS blob.")
+                    logger.info("Unzipping content before upload.")
+                    content = unzip_buffer(buffer)
+                    blob.upload_from_string(content)
+                else:
+                    content = buffer.getvalue()
+                    buffer.close()
+                    blob.upload_from_string(content)
+            else:
+                logger.error(f"Failed to extract release date from the provided ftp path: {ftp_obj.base_dir}.")
+                raise ValueError("Release date could not be extracted from the FTP path.")
+def unzip_buffer(buffer: io.BytesIO) -> bytes:
+    """Unzip a BytesIO buffer and return a dictionary of file names to their content.
+    Args:
+        buffer (io.BytesIO): The in-memory buffer containing zipped data.
+    Returns:
+        bytes: The unzipped content of the single file.
+    Raises:
+        ValueError: If multiple files are found in the zipped buffer or if no files are found.
+    """
+    import zipfile
+    unzipped_files: dict[str, bytes] = {}
+    with zipfile.ZipFile(buffer) as z:
+        for file_info in z.infolist():
+            with z.open(file_info) as unzipped_file:
+                unzipped_files[file_info.filename] = unzipped_file.read()
+    if len(unzipped_files) == 0:
+        logger.error("No files were found in the zipped buffer.")
+        raise ValueError("No files were found in the zipped buffer.")
+    if len(unzipped_files) != 1:
+        logger.error("Multiple files were found in the zipped buffer.")
+        raise ValueError("Multiple files were found in the zipped buffer.")
+    keys = list(unzipped_files.keys())
+    logger.info(f"Unzipped file: {keys[0]} with size {len(unzipped_files[keys[0]])} bytes.")
+    return unzipped_files[keys[0]]

{gentroutils-3.1.0 → gentroutils-4.0.0}/src/gentroutils/io/transfer/polars_to_gcs.py RENAMED Viewed

@@ -16,5 +16,5 @@ class PolarsDataFrameToGCSTransferableObject(TransferableObject):
         """Transfer the Polars DataFrame to the specified GCS destination."""
         # Convert Polars DataFrame to CSV and upload to GCS
         logger.info(f"Transferring Polars DataFrame to {self.destination}.")
-        self.source.write_csv(self.destination)
+        self.source.write_csv(self.destination, separator="\t", include_header=True)
         logger.info(f"Uploading DataFrame to {self.destination}")

{gentroutils-3.1.0 → gentroutils-4.0.0}/src/gentroutils/parsers/curation.py RENAMED Viewed

@@ -5,6 +5,7 @@ from __future__ import annotations
 from enum import StrEnum
 import polars as pl
+from google.cloud.storage import Client
 from loguru import logger
 from gentroutils.errors import GentroutilsError, GentroutilsErrorMessage
@@ -69,31 +70,102 @@ class DownloadStudiesSchema(StrEnum):
         return [member.value for member in cls]
+class SyncedSummaryStatisticsSchema(StrEnum):
+    """Enum to define the columns for synced summary statistics."""
+    FILE_PATH = "filePath"
+    """The GCS file path of the summary statistics file."""
+    SYNCED = "isSynced"
+    """Flag indicating whether the file has been synced."""
+    STUDY_ID = "studyId"
+    """The unique identifier for a study."""
+    @classmethod
+    def columns(cls) -> list[str]:
+        """Get the list of columns defined in the schema."""
+        return [member.value for member in cls]
 class CuratedStudyStatus(StrEnum):
     """Enum to define the status of a curated study."""
     REMOVED = "removed"
     """The study has been removed from the GWAS Catalog."""
-    NEW = "new"
-    """The study is new in the GWAS Catalog."""
+    TO_CURATE = "to_curate"
+    """The study is new and needs to be curated."""
     CURATED = "curated"
     """The study has been curated and is still in the GWAS Catalog."""
+    NO_SUMSTATS = "no_summary_statistics"
+    """The study has no associated summary statistics."""
+class GCSSummaryStatisticsFileCrawler:
+    """Class to crawl GCS for summary statistics files."""
+    def __init__(self, gcs_glob: str):
+        """Initialize the GCSSummaryStatisticsFileCrawler with a GCS glob pattern."""
+        self.gcs_glob = gcs_glob
+        logger.debug("Initialized GCSSummaryStatisticsFileCrawler with glob: {}", gcs_glob)
+    def _fetch_paths(self) -> list[str]:
+        """Fetch file paths from GCS based on the glob pattern."""
+        # Implementation to fetch file paths from GCS
+        c = Client()
+        bucket_name = self.gcs_glob.split("/")[2]
+        prefix = "/".join(self.gcs_glob.split("/")[3:-1])
+        suffix = self.gcs_glob.split("/")[-1].replace("*", "")
+        logger.debug("Crawling GCS bucket: {}, prefix: {}, suffix: {}", bucket_name, prefix, suffix)
+        bucket = c.bucket(bucket_name)
+        blobs = bucket.list_blobs(prefix=prefix)
+        return [f"gs://{bucket_name}/{blob.name}" for blob in blobs if blob.name.endswith(suffix)]
+    def crawl(self) -> pl.DataFrame:
+        """Crawl GCS and return a DataFrame of summary statistics files."""
+        # Implementation to crawl GCS and return a DataFrame
+        file_paths = self._fetch_paths()
+        logger.debug("Found {} summary statistics files.", len(file_paths))
+        data = pl.DataFrame({
+            SyncedSummaryStatisticsSchema.FILE_PATH: file_paths,
+            SyncedSummaryStatisticsSchema.SYNCED: [True] * len(file_paths),
+        }).with_columns(
+            pl.col(SyncedSummaryStatisticsSchema.FILE_PATH)
+            .str.extract(r"\/(GCST\d+)\/", 1)
+            .alias(SyncedSummaryStatisticsSchema.STUDY_ID)
+        )
+        # Post check to find if there are any studies with multiple files.
+        multi_files = data.group_by(SyncedSummaryStatisticsSchema.STUDY_ID).len().filter(pl.col("len") > 1)
+        if not multi_files.is_empty():
+            logger.warning("Studies with multiple summary statistics files found: {}", multi_files)
+            logger.warning("DataFrame shape before deduplication: {}", data.shape)
+            logger.warning("Synced data preview:\n{}", data.head())
+            data = data.unique(subset=SyncedSummaryStatisticsSchema.STUDY_ID)
+            logger.warning("Synced data after deduplication:\n{}", data.shape)
+        return data
 class GWASCatalogCuration:
     """Class to handle the curation of GWAS Catalog data."""
-    def __init__(self, previous_curation: pl.DataFrame, studies: pl.DataFrame):
+    def __init__(self, previous_curation: pl.DataFrame, studies: pl.DataFrame, synced: pl.DataFrame):
         """Initialize the GWASCatalogCuration with previous curation and studies data."""
         logger.debug("Initializing GWASCatalogCuration with previous curation and studies data.")
         self.previous_curation = previous_curation
         logger.debug("Previous curation data loaded with shape: {}", previous_curation.shape)
         self.studies = studies
         logger.debug("Studies data loaded with shape: {}", studies.shape)
+        self.synced = synced
+        logger.debug("Synced summary statistics data loaded with shape: {}", synced.shape)
     @classmethod
-    def from_prev_curation(cls, previous_curation_path: str, download_studies_path: str) -> GWASCatalogCuration:
+    def from_prev_curation(
+        cls,
+        previous_curation_path: str,
+        download_studies_path: str,
+        summary_statistics_glob: str,
+    ) -> GWASCatalogCuration:
         """Create a GWASCatalogCuration instance from previous curation and studies."""
+        crawled_summary_statistics = GCSSummaryStatisticsFileCrawler(summary_statistics_glob).crawl()
         previous_curation_df = pl.read_csv(
             previous_curation_path,
             separator="\t",
@@ -112,7 +184,7 @@ class GWASCatalogCuration:
         if studies_df.is_empty():
             raise GentroutilsError(GentroutilsErrorMessage.DOWNLOAD_STUDIES_EMPTY, path=download_studies_path)
         studies_df = studies_df.rename(mapping=DownloadStudiesSchema.mapping())
-        return cls(previous_curation_df, studies_df)
+        return cls(previous_curation_df, studies_df, crawled_summary_statistics)
     @property
     def result(self) -> pl.DataFrame:
@@ -144,7 +216,11 @@ class GWASCatalogCuration:
         assert all(prev_studies.select(CurationSchema.STUDY_ID).is_unique()), "Study IDs must be unique after merging."
         # Studies that are new in the GWAS Catalog
-        new_studies = self.studies.join(self.previous_curation, on=CurationSchema.STUDY_ID, how="anti").select(
+        new_studies = self.studies.join(self.previous_curation, on=CurationSchema.STUDY_ID, how="anti")
+        # Annotate new studies with info if they have summary statistics synced to the GCS bucket
+        new_studies_annotated = new_studies.join(self.synced, on=CurationSchema.STUDY_ID, how="left")
+        # Assign status NO_SUMSTATS to new studies without synced summary statistics (left join to drop info about already curated studies)
+        new_studies_annotated = new_studies_annotated.select(
             CurationSchema.STUDY_ID,
             pl.lit(None).alias(CurationSchema.STUDY_TYPE),
             pl.lit(None).alias(CurationSchema.ANALYSIS_FLAG),
@@ -153,12 +229,16 @@ class GWASCatalogCuration:
             CurationSchema.PUBMED_ID,
             CurationSchema.PUBLICATION_TITLE,
             CurationSchema.TRAIT_FROM_SOURCE,
-            pl.lit(CuratedStudyStatus.NEW).alias("status"),
+            pl.when(pl.col(SyncedSummaryStatisticsSchema.SYNCED).is_null())
+            .then(pl.lit(CuratedStudyStatus.NO_SUMSTATS))
+            .otherwise(pl.lit(CuratedStudyStatus.TO_CURATE))
+            .alias("status"),
         )
         logger.debug("New studies identified: {}", new_studies.shape[0])
         # Union of new studies and previously curated studies
-        all_studies = pl.concat([prev_studies, new_studies], how="vertical")
+        all_studies = pl.concat([prev_studies, new_studies_annotated], how="vertical")
         logger.debug("All studies after combining new and previous: {}", all_studies.shape[0])
         # Ensure the contract on the output dataframe

gentroutils 3.1.0__tar.gz → 4.0.0__tar.gz

gentroutils 3.1.0tar.gz → 4.0.0tar.gz