PyPI - genal-python - Versions diffs - 1.4.7__tar.gz → 1.4.9__tar.gz - Mend

genal-python 1.4.7tar.gz → 1.4.9tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (43) hide show

{genal_python-1.4.7 → genal_python-1.4.9}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: genal-python
-Version: 1.4.7
+Version: 1.4.9
 Summary: A python toolkit for polygenic risk scoring and mendelian randomization.
 Author-email: Cyprien Rivier <riviercyprien@gmail.com>
 Requires-Python: >=3.8
@@ -218,6 +218,7 @@ What preprocessing typically does (depending on options):
 - validates types, formats, and values of CHR/POS/EA/NEA/BETA/SE/P/EAF columns
 - detects OR vs beta columns (and log-transforms OR when needed)
 - fills missing columns (e.g., rsID from CHR/POS, SE from BETA+P, P from BETA+SE)
+- computes **FSTAT** (F-statistic) from BETA and SE when possible, with a fallback method when only P is present
 - handles duplicates and invalid rows under `"Fill_delete"`
 You can inspect the standardized dataset at any time:
@@ -379,7 +380,7 @@ G_adj.association_test(
 )
 ```
-This updates `SBP_adj.data[["BETA","SE","P"]]` with cohort-specific estimates.
+This updates `G_adj.data[["BETA","SE","P"]]` with cohort-specific estimates and recomputes `FSTAT` to be consistent with the updated values.
 ### 8) Lift to a different build

{genal_python-1.4.7 → genal_python-1.4.9}/README.md RENAMED Viewed

@@ -190,6 +190,7 @@ What preprocessing typically does (depending on options):
 - validates types, formats, and values of CHR/POS/EA/NEA/BETA/SE/P/EAF columns
 - detects OR vs beta columns (and log-transforms OR when needed)
 - fills missing columns (e.g., rsID from CHR/POS, SE from BETA+P, P from BETA+SE)
+- computes **FSTAT** (F-statistic) from BETA and SE when possible, with a fallback method when only P is present
 - handles duplicates and invalid rows under `"Fill_delete"`
 You can inspect the standardized dataset at any time:
@@ -351,7 +352,7 @@ G_adj.association_test(
 )
 ```
-This updates `SBP_adj.data[["BETA","SE","P"]]` with cohort-specific estimates.
+This updates `G_adj.data[["BETA","SE","P"]]` with cohort-specific estimates and recomputes `FSTAT` to be consistent with the updated values.
 ### 8) Lift to a different build

{genal_python-1.4.7 → genal_python-1.4.9}/docs/source/introduction.md RENAMED Viewed

@@ -2,7 +2,7 @@
 `genal` is a Python toolkit for common GWAS-derived workflows:
-- **Preprocess** GWAS summary statistics into a consistent SNP table (column validation, allele checks, optional filling of missing `SNP`/`CHR`/`POS`/`EA`/`NEA`/`SE`/`P` using reference data).
+- **Preprocess** GWAS summary statistics into a consistent SNP table (column validation, allele checks, optional filling of missing `SNP`/`CHR`/`POS`/`EA`/`NEA`/`SE`/`P` using reference data, and computation of per-variant F-statistic `FSTAT`).
 - **Select instruments** via LD clumping (PLINK 2).
 - **Compute PRS** on individual-level genotype data (PLINK 2), with optional **proxy SNP** support.
 - **Run two-sample MR** (multiple estimators + sensitivity analyses), with plotting helpers.

{genal_python-1.4.7 → genal_python-1.4.9}/docs/source/methods.md RENAMED Viewed

@@ -2,6 +2,15 @@
 This page documents the main statistical models implemented in `genal`. It is intentionally brief and focuses on what is implemented in the codebase.
+## Per-variant F-statistic (FSTAT)
+Implementation: {py:func}`genal.geno_tools.fill_fstatistic`
+`genal` computes a per-variant F-statistic (`FSTAT`) during preprocessing and after association testing. This statistic measures instrument strength for each variant.
+- **Primary:** `FSTAT = (BETA / SE)²` when `BETA` and `SE` are available and `SE > 0`.
+- **Fallback:** `FSTAT = χ²_isf(P, df=1)` (equivalent to `Z²` for a two-sided p-value) when `BETA/SE` are unavailable but `P` is present.
 ## MR harmonization
 MR workflows rely on aligning exposure and outcome effects to the same effect allele.

{genal_python-1.4.7 → genal_python-1.4.9}/docs/source/workflows.md RENAMED Viewed

@@ -59,6 +59,7 @@ Key arguments you commonly tune:
 - `effect_column="OR"` forces log-transforming odds ratios into betas and adjusts SE accordingly.
 - `fill_snpids` / `fill_coordinates`: override the default logic if you want to force filling rsIDs from `CHR+POS` or vice-versa.
 - `keep_indel` / `keep_dups`: keep indels or duplicated IDs (generally you keep these `False` unless you have a reason).
+- `fill_f=True`: force recomputation of the F-statistic (`FSTAT`) column even if it already exists. By default, FSTAT is created if missing or only missing values are filled.
 ## 3) Select independent instruments via LD clumping
@@ -237,6 +238,7 @@ What you typically tune / watch:
 - `covar`: covariate names (must be present in `pheno_df` and numeric; constant covariates are dropped).
 - `standardize=True`: for quantitative traits; set `False` if you want raw-scale effects.
 - Variant matching: if `CHR+POS` are present in `G.data`, genal will map to cohort SNP IDs before running PLINK (reduces ID mismatch losses).
+- After updating `BETA`, `SE`, and `P`, the F-statistic (`FSTAT`) column is automatically recomputed to remain consistent with the updated estimates.
 ### Liftover between builds

{genal_python-1.4.7 → genal_python-1.4.9}/genal/Geno.py RENAMED Viewed

@@ -34,6 +34,7 @@ from .geno_tools import (
     check_beta_column,
     check_p_column,
     fill_se_p,
+    fill_fstatistic,
     check_allele_column,
     check_snp_column,
     remove_na,
@@ -145,15 +146,9 @@ class Geno:
         # List to keep track of checks performed
         self.checks = CHECKS_DICT.copy()
-        # Set the maximal amount of ram/cpu to be used by the methods and dask chunksize
-        self.cpus = int(os.environ.get("SLURM_CPUS_PER_TASK", default=os.cpu_count())) - 1
-        non_hpc_ram_per_cpu = psutil.virtual_memory().total  / (
-            1024**2 * self.cpus
-        )
-        ram_per_cpu = int(
-            os.environ.get("SLURM_MEM_PER_CPU", default=non_hpc_ram_per_cpu)
-        )
-        self.ram = int(ram_per_cpu * self.cpus * 0.8)
+        # Set the maximal amount of ram/cpu to be used by the methods
+        self.cpus = os.cpu_count() - 1
+        self.ram = int(psutil.virtual_memory().total / 1024**2 * 0.8)
         create_tmp()
@@ -168,6 +163,7 @@ class Geno:
         keep_dups=None,
         fill_snpids=None,
         fill_coordinates=None,
+        fill_f=False,
     ):
         """
         Clean and preprocess the main dataframe of Single Nucleotide Polymorphisms (SNP) data.
@@ -187,6 +183,9 @@ class Geno:
             keep_dups (bool, optional): Determines if rows with duplicate SNP IDs should be kept. If None, defers to preprocessing value. Defaults to None.
             fill_snpids (bool, optional): Decides if the SNP (rsID) column should be created or replaced based on CHR/POS columns and a reference genome. If None, defers to preprocessing value. Defaults to None.
             fill_coordinates (bool, optional): Decides if CHR and/or POS should be created or replaced based on SNP column and a reference genome. If None, defers to preprocessing value. Defaults to None.
+            fill_f (bool, optional): If True, force recomputation/overwrite of the FSTAT column even if it
+                already exists. If False (default), FSTAT is created if missing, or only missing values
+                are filled if the column already exists.
         Note:
             If you pass a standard reference_panel name (e.g. "EUR_37"), it will be converted to "37".
@@ -256,8 +255,8 @@ class Geno:
             data = fill_ea_nea(data, self.get_reference_panel(reference_panel))
         # Convert effect column to Beta estimates if present
-        if "BETA" in data.columns:
-            check_beta_column(data, effect_column, preprocessing)
+        if "BETA" in data.columns and preprocessing in ['Fill', 'Fill_delete']:
+            check_beta_column(data, effect_column)
             self.checks["BETA"] = True
         # Ensure P column contains valid values
@@ -269,6 +268,12 @@ class Geno:
         if preprocessing in ['Fill', 'Fill_delete']:
             fill_se_p(data)
+        # Compute or fill the FSTAT column
+        # Under normal preprocessing: compute/create/fill FSTAT
+        # If preprocessing == "None": only compute FSTAT when fill_f=True
+        if preprocessing in ['Fill', 'Fill_delete'] or fill_f:
+            fill_fstatistic(data, overwrite=fill_f)
         # Process allele columns
         for allele_col in ["EA", "NEA"]:
             check_allele_condition = (allele_col in data.columns) and (
@@ -590,7 +595,7 @@ class Geno:
         if not self.checks.get("EA"):
             check_allele_column(data_prs, "EA", keep_indel=False)
         if not self.checks.get("BETA"):
-            check_beta_column(data_prs, effect_column=None, preprocessing='Fill_delete')
+            check_beta_column(data_prs, effect_column=None)
         initial_rows = data_prs.shape[0]
         data_prs.dropna(subset=["SNP", "EA", "BETA"], inplace=True)
@@ -740,8 +745,8 @@ class Geno:
                                           to make results more interpretable. Default is True.
         Returns:
-            None: Updates the BETA, SE, and P columns of the data attribute based on the results
-                  of the association tests.
+            None: Updates the BETA, SE, P, and FSTAT columns of the data attribute based on the
+                  results of the association tests.
         Note:
             This method requires the phenotype to be set using the set_phenotype() function.
@@ -796,6 +801,9 @@ class Geno:
         if n_updated < n_original:
             print(f"{n_original - n_updated}({(n_original - n_updated)/n_original*100:.3f}%) SNPs have been removed.")
+        # Recompute FSTAT to be consistent with the updated BETA/SE/P values
+        fill_fstatistic(updated_data, overwrite=True)
         # Update the instance data
         self.data = updated_data
         return

{genal_python-1.4.7 → genal_python-1.4.9}/genal/__init__.py RENAMED Viewed

@@ -5,7 +5,7 @@ from .geno_tools import Combine_Geno
 from .genes import filter_by_gene_func
 from .constants import CONFIG_DIR
-__version__ = "1.4.7"
+__version__ = "1.4.9"
 config_path = os.path.join(CONFIG_DIR, "config.json")

{genal_python-1.4.7 → genal_python-1.4.9}/genal/colocalization.py RENAMED Viewed

@@ -29,8 +29,8 @@ def coloc_abf_func(data1, data2, trait1_type="quant", trait2_type="quant",
     """
     # Ensure that the BETA columns are preprocessed
-    check_beta_column(data1, 'BETA', 'Fill')
-    check_beta_column(data2, 'BETA', 'Fill')
+    check_beta_column(data1, 'BETA')
+    check_beta_column(data2, 'BETA')
     # Adjust EAF column names before merging in case one of the datasets does not have it
     if 'EAF' in data1.columns:

{genal_python-1.4.7 → genal_python-1.4.9}/genal/extract_prs.py RENAMED Viewed

@@ -6,6 +6,8 @@ from concurrent.futures import ProcessPoolExecutor
 from .tools import check_bfiles, check_pfiles, setup_genetic_path, get_plink_path
+MIN_RAM_PER_WORKER_MB = 3500 # Minimum RAM per PLINK process (conservative for large genotype files)
 ### ____________________
 ### PRS functions
 ### ____________________
@@ -126,7 +128,7 @@ def extract_snps_func(snp_list, name=None, path=None, ram=20000, cpus=4):
     path, filetype = setup_genetic_path(path)
     # Prepare the SNP list
-    snp_list = snp_list.dropna()
+    snp_list = snp_list.dropna().drop_duplicates()
     snp_list_name = f"{name}_list.txt"
     snp_list_path = os.path.join("tmp_GENAL", snp_list_name)
     snp_list.to_csv(snp_list_path, sep=" ", index=False, header=None)
@@ -136,16 +138,29 @@ def extract_snps_func(snp_list, name=None, path=None, ram=20000, cpus=4):
     filetype_split = "split" if "$" in path else "combined"
     output_path = os.path.join("tmp_GENAL", f"{name}_allchr")
+    # Guard against empty SNP list (applies to both split and combined)
+    if nrow == 0:
+        print("The SNP list is empty after deduplication.")
+        return "FAILED"
     if filetype_split == "split":
-        ram_estimate_per_cpu = nrow/(1.5*10**2)
-        n_cpus = max(1, int(ram // ram_estimate_per_cpu))
-        workers = min(n_cpus, cpus)
+        # Calculate workers based on memory budget (not SNP count)
+        max_workers_by_ram = max(1, int(ram // MIN_RAM_PER_WORKER_MB))
+        workers = max(1, min(max_workers_by_ram, cpus, 22))  # Cap at 22 chromosomes, min 1
+        # Allocate RAM per worker
+        per_worker_ram = int(ram // workers)
+        #print(f"Parallelizing extraction across {workers} workers with {per_worker_ram}MB RAM each")
         merge_command, bedlist_path = extract_snps_from_split_data(
-            name, path, output_path, snp_list_path, filetype, workers=workers
+            name, path, output_path, snp_list_path, filetype,
+            workers=workers, per_worker_ram=per_worker_ram, ram=ram
         )
         handle_multiallelic_variants(name, merge_command, bedlist_path)
     else:
-        extract_snps_from_combined_data(name, path, output_path, snp_list_path, filetype)
+        extract_snps_from_combined_data(name, path, output_path, snp_list_path, filetype, ram=ram)
     #Check that at least 1 variant has been extracted. If not, return "FAILED" to warn downstream functions (prs, association_test)
     log_path = output_path + ".log"
@@ -164,7 +179,7 @@ def extract_snps_func(snp_list, name=None, path=None, ram=20000, cpus=4):
     return output_path
-def extract_command_parallel(task_id, name, path, snp_list_path, filetype):
+def extract_command_parallel(task_id, name, path, snp_list_path, filetype, per_worker_ram=4000):
     """
     Helper function to run SNP extraction in parallel for different chromosomes.
     Args:
@@ -173,8 +188,11 @@ def extract_command_parallel(task_id, name, path, snp_list_path, filetype):
         path (str): Path to the data set.
         snp_list_path (str): Path to the list of SNPs to extract.
         filetype (str): Type of genetic files ("bed" or "pgen")
+        per_worker_ram (int): RAM limit in MB for this PLINK process.
     Returns:
         int: Returns the task_id if no valid files are found.
+        dict: Returns error dict {'failed': True, 'chr': task_id, ...} if extraction fails.
+        None: Returns None on success.
     """
     input_path = path.replace("$", str(task_id))
@@ -185,65 +203,108 @@ def extract_command_parallel(task_id, name, path, snp_list_path, filetype):
         return task_id
     output_path = os.path.join("tmp_GENAL", f"{name}_extract_chr{task_id}")
     # Build command based on filetype
     base_cmd = f"{get_plink_path()}"
     if filetype == "bed":
         base_cmd += f" --bfile {input_path}"
     else:  # pgen
         base_cmd += f" --pfile {input_path}"
-    command = f"{base_cmd} --extract {snp_list_path} --rm-dup force-first --make-pgen --out {output_path}"
-    subprocess.run(
+    command = f"{base_cmd} --extract {snp_list_path} --memory {per_worker_ram} --threads 1 --rm-dup force-first --make-pgen --out {output_path}"
+    result = subprocess.run(
         command, shell=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL
     )
+    # Check for failures and return diagnostic info (diagnostics are in .log file)
+    if result.returncode != 0:
+        return {'failed': True, 'chr': task_id, 'log': f"{output_path}.log", 'returncode': result.returncode}
-def create_bedlist(bedlist_path, output_name, not_found):
-    """
-    Creates a bedlist file for SNP extraction.
-    Args:
-        bedlist_path (str): Path to save the bedlist file.
-        output_name (str): Base name for the output files.
-        not_found (List[int]): List of chromosome numbers for which no files were found.
-    """
-    with open(bedlist_path, "w+") as bedlist_file:
-        found = []
-        for i in range(1, 23):
-            if i in not_found:
-                print(f"bed/bim/fam or pgen/pvar/psam files not found for chr{i}.")
-            elif check_pfiles(f"{output_name}_chr{i}"):
-                bedlist_file.write(f"{output_name}_chr{i}\n")
-                found.append(i)
-                print(f"SNPs extracted for chr{i}.")
-            else:
-                print(f"No SNPs extracted for chr{i}.")
-    return found
+    # Also check if output files were created
+    if not check_pfiles(output_path):
+        return {'failed': True, 'chr': task_id, 'log': f"{output_path}.log", 'returncode': -1}
+    return None  # Success
-def extract_snps_from_split_data(name, path, output_path, snp_list_path, filetype, workers=4):
+def extract_snps_from_split_data(name, path, output_path, snp_list_path, filetype, workers=4, per_worker_ram=4000, ram=20000):
     """Extract SNPs from data split by chromosome."""
     print("Extracting SNPs for each chromosome...")
     num_tasks = 22
     partial_extract_command_parallel = partial(
-        extract_command_parallel,
-        name=name,
-        path=path,
+        extract_command_parallel,
+        name=name,
+        path=path,
         snp_list_path=snp_list_path,
-        filetype=filetype
+        filetype=filetype,
+        per_worker_ram=per_worker_ram
     )  # Wrapper function
+    # First attempt with calculated workers
+    results = []
     with ProcessPoolExecutor(max_workers=workers) as executor:
-        not_found = list(
+        results = list(
             executor.map(partial_extract_command_parallel, range(1, num_tasks + 1))
         )
+    # Check for failures (non-None returns indicate errors)
+    failed_chrs = [r for r in results if r is not None and isinstance(r, dict) and r.get('failed')]
+    not_found = [r for r in results if r is not None and not isinstance(r, dict)]
+    # Retry failed chromosomes with reduced workers if any failures occurred
+    if failed_chrs and workers > 1:
+        print(f"{len(failed_chrs)} chromosome(s) failed. Retrying with reduced parallelization...")
+        retry_workers = max(1, workers // 2)
+        # Recalculate RAM per worker based on original total budget
+        total_ram_budget = per_worker_ram * workers
+        per_worker_ram_retry = int(total_ram_budget // retry_workers)
+        partial_retry = partial(
+            extract_command_parallel,
+            name=name,
+            path=path,
+            snp_list_path=snp_list_path,
+            filetype=filetype,
+            per_worker_ram=per_worker_ram_retry
+        )
+        failed_chr_ids = [r['chr'] for r in failed_chrs]
+        with ProcessPoolExecutor(max_workers=retry_workers) as executor:
+            retry_results = list(executor.map(partial_retry, failed_chr_ids))
+        # Update results - surface errors for persistent failures
+        for orig_id, retry_result in zip(failed_chr_ids, retry_results):
+            if retry_result is not None and isinstance(retry_result, dict) and retry_result.get('failed'):
+                # Still failed - surface the error
+                log_path = os.path.join("tmp_GENAL", f"{name}_extract_chr{orig_id}.log")
+                if os.path.exists(log_path):
+                    print(f"Chr{orig_id} failed after retry. Check log: {log_path}")
+                    try:
+                        with open(log_path, 'r') as f:
+                            lines = f.readlines()
+                            print(f"Last 10 lines of log:\n{''.join(lines[-10:])}")
+                    except Exception:
+                        pass
     # Merge extracted SNPs from each chromosome
     bedlist_name = f"{name}_bedlist.txt"
     bedlist_path = os.path.join("tmp_GENAL", bedlist_name)
-    found = create_bedlist(
-        bedlist_path, os.path.join("tmp_GENAL", f"{name}_extract"), not_found
-    )
+    # Create the bedlist file
+    output_name = os.path.join("tmp_GENAL", f"{name}_extract")
+    with open(bedlist_path, "w+") as bedlist_file:
+        found = []
+        for i in range(1, 23):
+            if i in not_found:
+                print(f"bed/bim/fam or pgen/pvar/psam files not found for chr{i}.")
+            elif check_pfiles(f"{output_name}_chr{i}"):
+                bedlist_file.write(f"{output_name}_chr{i}\n")
+                found.append(i)
+                print(f"SNPs extracted for chr{i}.")
+            else:
+                print(f"No SNPs extracted for chr{i}.")
     if len(found) == 0:
         raise Warning("No SNPs were extracted from any chromosome.")
@@ -255,7 +316,7 @@ def extract_snps_from_split_data(name, path, output_path, snp_list_path, filetyp
         return None, bedlist_path
     print("Merging SNPs extracted from each chromosome...")
-    merge_command = f"{get_plink_path()} --pmerge-list {bedlist_path} pfile --out {output_path}"
+    merge_command = f"{get_plink_path()} --memory {ram} --pmerge-list {bedlist_path} pfile --out {output_path}"
     try:
         subprocess.run(
             merge_command, shell=True, capture_output=True, text=True, check=True
@@ -269,18 +330,18 @@ def extract_snps_from_split_data(name, path, output_path, snp_list_path, filetyp
     return merge_command, bedlist_path
-def extract_snps_from_combined_data(name, path, output_path, snp_list_path, filetype):
+def extract_snps_from_combined_data(name, path, output_path, snp_list_path, filetype, ram=20000):
     """Extract SNPs from combined data."""
     print("Extracting SNPs...")
     # Build command based on filetype
     base_cmd = f"{get_plink_path()}"
     if filetype == "bed":
         base_cmd += f" --bfile {path}"
     else:  # pgen
         base_cmd += f" --pfile {path}"
-    extract_command = f"{base_cmd} --extract {snp_list_path} --rm-dup force-first --make-pgen --out {output_path}"
+    extract_command = f"{base_cmd} --memory {ram} --extract {snp_list_path} --rm-dup force-first --make-pgen --out {output_path}"
     subprocess.run(
         extract_command,

{genal_python-1.4.7 → genal_python-1.4.9}/genal/geno_tools.py RENAMED Viewed

@@ -68,10 +68,21 @@ def check_allele_column(data, allele_col, keep_indel):
 def fill_se_p(data):
     """If either P or SE is missing but the other and BETA are present, fill it."""
+    # Ensure SE is numeric and non-negative
+    if ("SE" in data.columns):
+        data["SE"] = pd.to_numeric(data["SE"], errors="coerce")
+        data.loc[data["SE"] < 0, "SE"] = np.nan
+        n_missing = data["SE"].isna().sum()
+        if n_missing > 0:
+            print(
+                f"{n_missing}({n_missing/data.shape[0]*100:.3f}%) values in the SE column have been set to nan for being missing, negative or non-numeric."
+            )
     # If SE is missing
     if ("P" in data.columns) & ("BETA" in data.columns) & ("SE" not in data.columns):
-        data["SE"] = np.where(
-            data["P"] < 1, np.abs(data.BETA / st.norm.ppf(data.P / 2)), 0
+        data["SE"] = np.select(
+            [data["P"] == 0, data["P"] >= 1],
+            [0, np.nan],
+            default=np.abs(data.BETA / st.norm.ppf(data.P / 2)),
         )
         print("The SE (Standard Error) column has been created.")
     # If P is missing
@@ -83,6 +94,104 @@ def fill_se_p(data):
     return
+def fill_fstatistic(data, overwrite=False):
+    """
+    Compute or fill the per-variant F-statistic (FSTAT) column.
+    The F-statistic is computed as:
+    - Primary: FSTAT = (BETA / SE)² when BETA and SE are available and SE > 0.
+      For SE=0 (extremely significant variants), FSTAT is set to inf.
+    - Fallback: FSTAT = χ²_isf(P, df=1) when BETA/SE are unavailable but P is present.
+      For P=0 (extremely significant variants), this produces inf.
+    Args:
+        data (pd.DataFrame): SNP-level DataFrame.
+        overwrite (bool): If False (default), only fill missing FSTAT values if the column
+            exists; if it doesn't exist, create it. If True, recompute FSTAT for all rows
+            where computable, overwriting existing values; non-computable rows become NaN.
+    Returns:
+        None: Modifies data in place.
+    Note:
+        FSTAT is NOT added to STANDARD_COLUMNS to avoid row deletion due to missing values.
+    """
+    nrows = data.shape[0]
+    column_created = False
+    # Determine which rows need computation
+    if "FSTAT" not in data.columns:
+        data["FSTAT"] = np.nan
+        column_created = True
+        rows_to_compute = pd.Series([True] * nrows, index=data.index)
+    elif overwrite:
+        # Clear FSTAT first so non-computable rows become NaN (not stale values)
+        data["FSTAT"] = np.nan
+        rows_to_compute = pd.Series([True] * nrows, index=data.index)
+    else:
+        # Only compute for rows with missing FSTAT
+        rows_to_compute = data["FSTAT"].isna()
+    if not rows_to_compute.any():
+        return
+    # Track how many values are assigned
+    n_assigned = 0
+    # Primary route: FSTAT = (BETA / SE)² when BETA and SE are available (SE=0 produces inf)
+    beta_se_computable = pd.Series([False] * nrows, index=data.index)
+    if "BETA" in data.columns and "SE" in data.columns:
+        method = "BETA/SE"
+        beta_se_computable = (
+            rows_to_compute &
+            data["BETA"].notna() &
+            data["SE"].notna() &
+            (data["SE"] >= 0)
+        )
+        if beta_se_computable.any():
+            data.loc[beta_se_computable, "FSTAT"] = (
+                data.loc[beta_se_computable, "BETA"] / data.loc[beta_se_computable, "SE"]
+            ) ** 2
+            n_assigned += beta_se_computable.sum()
+    # Fallback route: FSTAT = χ²_isf(P, df=1) for remaining rows where P is present
+    # Allow P=0 (produces inf for extremely significant variants)
+    if "P" in data.columns:
+        method = "P-values"
+        p_fallback_computable = (
+            rows_to_compute &
+            ~beta_se_computable &
+            data["P"].notna() &
+            (data["P"] >= 0) &
+            (data["P"] <= 1)
+        )
+        if p_fallback_computable.any():
+            data.loc[p_fallback_computable, "FSTAT"] = st.chi2.isf(
+                data.loc[p_fallback_computable, "P"], df=1
+            )
+            n_assigned += p_fallback_computable.sum()
+    # Logging
+    if column_created:
+        print(
+            f"The FSTAT (F-statistic) column has been created using {method}. "
+            f"{n_assigned}({n_assigned/nrows*100:.3f}%) values computed."
+        )
+    elif overwrite:
+        print(
+            f"The FSTAT (F-statistic) column has been re-created using {method}. "
+            f"{n_assigned}({n_assigned/nrows*100:.3f}%) values computed."
+        )
+    else:
+        if n_assigned > 0:
+            print(
+                f"The FSTAT (F-statistic) column: {n_assigned}({n_assigned/nrows*100:.3f}%)"
+                f"missing values have been filled using {method}."
+            )
+    return
 def check_p_column(data):
     """Verify that the P column contains numeric values in the range [0,1]. Set inappropriate values to NA."""
     nrows = data.shape[0]
@@ -96,14 +205,15 @@ def check_p_column(data):
     return
-def check_beta_column(data, effect_column, preprocessing):
+def check_beta_column(data, effect_column):
     """
     If the BETA column is a column of odds ratios, log-transform it.
     If no effect_column argument is specified, determine if the BETA column are beta estimates or odds ratios.
     """
+    # Ensure the BETA column is numeric
+    data["BETA"] = pd.to_numeric(data["BETA"], errors="coerce")
     if effect_column is None:
-        if preprocessing == 'None':
-            return data
         median = data.BETA.median()
         has_negative = (data.BETA < 0).any()
@@ -126,7 +236,7 @@ def check_beta_column(data, effect_column, preprocessing):
         )
     if effect_column == "OR":
         data["BETA"] = np.log(data["BETA"].clip(lower=0.01))
-        data.drop(columns="SE", errors="ignore", inplace=True)
+        data.drop(columns=["SE"], errors="ignore", inplace=True)
         print("The BETA column has been log-transformed to obtain Beta estimates.")
     return

{genal_python-1.4.7 → genal_python-1.4.9}/genal/lift.py RENAMED Viewed

@@ -51,6 +51,8 @@ def lift_data(
     # Prepare the data for lifting: handle missing values in CHR, POS columns
     nrows = data.shape[0]
     data.dropna(subset=["CHR", "POS"], inplace=True)
+    # Remove absurd positions
+    data.drop(data[data.POS >= 300_000_000].index, inplace=True)
     data.reset_index(drop=True, inplace=True)
     n_na = nrows - data.shape[0]
     if n_na:

{genal_python-1.4.7 → genal_python-1.4.9}/genal/snp_query.py RENAMED Viewed

@@ -2,7 +2,7 @@ import aiohttp
 import asyncio
 import numpy as np
 import nest_asyncio
-from tqdm.auto import tqdm
+from tqdm import tqdm
 # Using nest_asyncio to allow execution in notebooks
 nest_asyncio.apply()

{genal_python-1.4.7 → genal_python-1.4.9}/genal/tools.py RENAMED Viewed

@@ -113,7 +113,11 @@ def create_tmp():
 def delete_tmp():
     """Delete the tmp folder."""
     if os.path.isdir("tmp_GENAL"):
-        shutil.rmtree("tmp_GENAL")
+        def _onerror(func, path, exc_info):
+            if isinstance(exc_info[1], FileNotFoundError):
+                return
+            raise exc_info[1]
+        shutil.rmtree("tmp_GENAL", onerror=_onerror)
         print("The tmp_GENAL folder has been successfully deleted.")
     else:
         print("There is no tmp_GENAL folder to delete in the current directory.")

{genal_python-1.4.7 → genal_python-1.4.9}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "flit_core.buildapi"
 [project]
 name = "genal-python"  # Updated name for PyPI
-version = "1.4.7"
+version = "1.4.9"
 authors = [{name = "Cyprien Rivier", email = "riviercyprien@gmail.com"}]
 description = "A python toolkit for polygenic risk scoring and mendelian randomization."
 readme = "README.md"