PyPI - gffkit - Versions diffs - 0.3__tar.gz → 0.4.0__tar.gz - Mend

gffkit 0.3tar.gz → 0.4.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (21) hide show

{gffkit-0.3/src/gffkit.egg-info → gffkit-0.4.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: gffkit
-Version: 0.3
+Version: 0.4.0
 Summary: Region-aware GFF annotation integration toolkit
 Author: Qunjie Zhang
 License: MIT
@@ -26,11 +26,12 @@ License-File: LICENSE
 # gffkit
 `gffkit` is a lightweight toolkit for region-aware GFF/GTF annotation integration.
-It combines three utilities:
+It combines four utilities:
 1. `detect-bridge`: detect suspicious merged-gene artifacts caused by bridge transcripts.
 2. `complement`: complement/merge annotations, with optional region-swap mode.
 3. `add-utr`: reconstruct `five_prime_UTR` and `three_prime_UTR` features from exon/CDS coordinates.
+4. `rename-sort`: rename gene/transcript/child IDs with a prefix and sort the final GFF3.
 ## Installation
@@ -48,20 +49,25 @@ gffkit integrate \
   --annotation-a EviAnn.gff3 \
   --annotation-b ANNEVO.gff3 \
   --outdir gffkit_out \
-  --prefix sample
+  --prefix sample \
+  -t 8
 ```
 Outputs:
 - `gffkit_out/sample.suspicious.tsv`
 - `gffkit_out/sample.merged.gff3`
+- `gffkit_out/sample.final.withUTR.gff3.pre_rename.gff3`
 - `gffkit_out/sample.final.withUTR.gff3`
+- `gffkit_out/sample.final.withUTR.gff3.id_map.tsv`
+In `integrate`, `--prefix sample` is also used for final ID renaming. Final gene IDs are written like `sample_C01g00001`, transcript IDs like `sample_C01g00001.t1`, and child IDs like `sample_C01g00001.t1.exon1`.
 ### Step-by-step usage
 ```bash
 # 1. Detect suspicious merged genes in Annotation A
-gffkit detect-bridge -i EviAnn.gff3 -o suspicious.tsv
+gffkit detect-bridge -i EviAnn.gff3 -o suspicious.tsv -t 8
 # 2. Use A as the global reference, but switch to B in suspicious regions
 gffkit complement \
@@ -69,10 +75,31 @@ gffkit complement \
   --add ANNEVO.gff3 \
   --swap_region_tsv suspicious.tsv \
   --swap_region_flank 100 \
-  --output merged.gff3
+  --output merged.gff3 \
+  -t 8
 # 3. Add UTR features
-gffkit add-utr -i merged.gff3 -o final.annotation.withUTR.gff3
+gffkit add-utr -i merged.gff3 -o final.annotation.withUTR.pre_rename.gff3
+# 4. Rename IDs, drop unplaced seqids, and sort the final GFF3
+gffkit rename-sort \
+  -i final.annotation.withUTR.pre_rename.gff3 \
+  -o final.annotation.withUTR.gff3 \
+  --prefix sample
+```
+### Merge three or more annotations
+Use repeated `--add` arguments. Files are merged in the order provided.
+```bash
+gffkit complement \
+  --ref EviAnn.gff3 \
+  --add ANNEVO.gff3 \
+  --add Helixer.gff3 \
+  --add PASA.gff3 \
+  --output merged.multi.gff3 \
+  -t 8
 ```
 ## Command overview
@@ -82,14 +109,50 @@ gffkit --help
 gffkit detect-bridge --help
 gffkit complement --help
 gffkit add-utr --help
+gffkit rename-sort --help
 gffkit integrate --help
 ```
+## Threads
+Version 0.3 and later add `-t/--threads`.
+- `detect-bridge` analyzes genes in parallel.
+- `complement` pre-parses multiple `--add` files in parallel, then merges them in the original command-line order.
+- `integrate` passes the thread count to the detect and complement steps.
+Example:
+```bash
+gffkit integrate --annotation-a EviAnn.gff3 --annotation-b ANNEVO.gff3 -t 16
+```
 ## Annotation integration strategy
 - Annotation A, for example EviAnn/RNA-seq-supported GFF, is used as the global primary reference.
 - Annotation B, for example ANNEVO/deep-learning GFF, is used as the local primary reference only in suspicious merged-gene regions.
 - UTR features are reconstructed after merging using an exon-minus-CDS strategy.
+- Version 0.4.0 and later run `rename-sort` as the final `integrate` step. The final GFF3 keeps chromosome-mounted records, removes unplaced/scaffold/contig records, sorts features, rewrites `ID`/`Parent`, and writes an ID map next to the output.
+- When multiple tools annotate the same gene locus, the GFF source column is combined with `|`, for example `EviAnn|ANNEVO`.
+## Rename and Sort
+Run this step independently when you already have a merged GFF3:
+```bash
+gffkit rename-sort \
+  -i merged.withUTR.gff3 \
+  -o sample.renamed.sorted.gff3 \
+  --prefix sample \
+  --digits 5 \
+  --keep-old-ids
+```
+This writes `sample.renamed.sorted.gff3` and `sample.renamed.sorted.gff3.id_map.tsv`.
+## Maintainer notes
+When command-line options or behavior changes, update this `README.md` in the versioned package directory before building and uploading to PyPI.
 ## License

gffkit-0.4.0/README.md ADDED Viewed

@@ -0,0 +1,134 @@
+# gffkit
+`gffkit` is a lightweight toolkit for region-aware GFF/GTF annotation integration.
+It combines four utilities:
+1. `detect-bridge`: detect suspicious merged-gene artifacts caused by bridge transcripts.
+2. `complement`: complement/merge annotations, with optional region-swap mode.
+3. `add-utr`: reconstruct `five_prime_UTR` and `three_prime_UTR` features from exon/CDS coordinates.
+4. `rename-sort`: rename gene/transcript/child IDs with a prefix and sort the final GFF3.
+## Installation
+```bash
+pip install gffkit
+```
+## Quick start
+### Full integration pipeline
+```bash
+gffkit integrate \
+  --annotation-a EviAnn.gff3 \
+  --annotation-b ANNEVO.gff3 \
+  --outdir gffkit_out \
+  --prefix sample \
+  -t 8
+```
+Outputs:
+- `gffkit_out/sample.suspicious.tsv`
+- `gffkit_out/sample.merged.gff3`
+- `gffkit_out/sample.final.withUTR.gff3.pre_rename.gff3`
+- `gffkit_out/sample.final.withUTR.gff3`
+- `gffkit_out/sample.final.withUTR.gff3.id_map.tsv`
+In `integrate`, `--prefix sample` is also used for final ID renaming. Final gene IDs are written like `sample_C01g00001`, transcript IDs like `sample_C01g00001.t1`, and child IDs like `sample_C01g00001.t1.exon1`.
+### Step-by-step usage
+```bash
+# 1. Detect suspicious merged genes in Annotation A
+gffkit detect-bridge -i EviAnn.gff3 -o suspicious.tsv -t 8
+# 2. Use A as the global reference, but switch to B in suspicious regions
+gffkit complement \
+  --ref EviAnn.gff3 \
+  --add ANNEVO.gff3 \
+  --swap_region_tsv suspicious.tsv \
+  --swap_region_flank 100 \
+  --output merged.gff3 \
+  -t 8
+# 3. Add UTR features
+gffkit add-utr -i merged.gff3 -o final.annotation.withUTR.pre_rename.gff3
+# 4. Rename IDs, drop unplaced seqids, and sort the final GFF3
+gffkit rename-sort \
+  -i final.annotation.withUTR.pre_rename.gff3 \
+  -o final.annotation.withUTR.gff3 \
+  --prefix sample
+```
+### Merge three or more annotations
+Use repeated `--add` arguments. Files are merged in the order provided.
+```bash
+gffkit complement \
+  --ref EviAnn.gff3 \
+  --add ANNEVO.gff3 \
+  --add Helixer.gff3 \
+  --add PASA.gff3 \
+  --output merged.multi.gff3 \
+  -t 8
+```
+## Command overview
+```bash
+gffkit --help
+gffkit detect-bridge --help
+gffkit complement --help
+gffkit add-utr --help
+gffkit rename-sort --help
+gffkit integrate --help
+```
+## Threads
+Version 0.3 and later add `-t/--threads`.
+- `detect-bridge` analyzes genes in parallel.
+- `complement` pre-parses multiple `--add` files in parallel, then merges them in the original command-line order.
+- `integrate` passes the thread count to the detect and complement steps.
+Example:
+```bash
+gffkit integrate --annotation-a EviAnn.gff3 --annotation-b ANNEVO.gff3 -t 16
+```
+## Annotation integration strategy
+- Annotation A, for example EviAnn/RNA-seq-supported GFF, is used as the global primary reference.
+- Annotation B, for example ANNEVO/deep-learning GFF, is used as the local primary reference only in suspicious merged-gene regions.
+- UTR features are reconstructed after merging using an exon-minus-CDS strategy.
+- Version 0.4.0 and later run `rename-sort` as the final `integrate` step. The final GFF3 keeps chromosome-mounted records, removes unplaced/scaffold/contig records, sorts features, rewrites `ID`/`Parent`, and writes an ID map next to the output.
+- When multiple tools annotate the same gene locus, the GFF source column is combined with `|`, for example `EviAnn|ANNEVO`.
+## Rename and Sort
+Run this step independently when you already have a merged GFF3:
+```bash
+gffkit rename-sort \
+  -i merged.withUTR.gff3 \
+  -o sample.renamed.sorted.gff3 \
+  --prefix sample \
+  --digits 5 \
+  --keep-old-ids
+```
+This writes `sample.renamed.sorted.gff3` and `sample.renamed.sorted.gff3.id_map.tsv`.
+## Maintainer notes
+When command-line options or behavior changes, update this `README.md` in the versioned package directory before building and uploading to PyPI.
+## License
+MIT License.

{gffkit-0.3 → gffkit-0.4.0}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "gffkit"
-version = "0.3"
+version = "0.4.0"
 description = "Region-aware GFF annotation integration toolkit"
 readme = "README.md"
 requires-python = ">=3.8"
@@ -38,6 +38,7 @@ gffkit = "gffkit.main:main"
 gffkit-detect-bridge = "gffkit.detect_bridge_merged_genes:main"
 gffkit-complement = "gffkit.complement_annotations:main"
 gffkit-add-utr = "gffkit.add_utr:main"
+gffkit-rename-sort = "gffkit.rename_sort_gff3:main"
 [tool.setuptools]
 package-dir = {"" = "src"}

{gffkit-0.3 → gffkit-0.4.0}/src/gffkit/__init__.py RENAMED Viewed

@@ -1,3 +1,3 @@
 """gffkit: region-aware GFF annotation integration utilities."""
-__version__ = "0.3"
+__version__ = "0.4.0"

{gffkit-0.3 → gffkit-0.4.0}/src/gffkit/main.py RENAMED Viewed

@@ -49,11 +49,23 @@ def cmd_add_utr(args: argparse.Namespace, extra: List[str]) -> int:
     return _run_legacy_main(mod.main, "gffkit add-utr", cli)
+def cmd_rename_sort(args: argparse.Namespace, extra: List[str]) -> int:
+    from . import rename_sort_gff3 as mod
+    cli = ["-i", args.input, "-o", args.output, "--prefix", args.prefix]
+    if args.digits is not None:
+        cli += ["--digits", str(args.digits)]
+    if args.keep_old_ids:
+        cli.append("--keep-old-ids")
+    cli += extra
+    return _run_legacy_main(mod.main, "gffkit rename-sort", cli)
 def cmd_integrate(args: argparse.Namespace, extra: List[str]) -> int:
-    """Run the full three-step annotation integration workflow."""
+    """Run the full annotation integration workflow."""
     from . import detect_bridge_merged_genes as detect_mod
     from . import complement_annotations as complement_mod
     from . import add_utr as utr_mod
+    from . import rename_sort_gff3 as rename_mod
     outdir = Path(args.outdir)
     outdir.mkdir(parents=True, exist_ok=True)
@@ -61,8 +73,9 @@ def cmd_integrate(args: argparse.Namespace, extra: List[str]) -> int:
     suspicious_tsv = Path(args.suspicious_tsv) if args.suspicious_tsv else outdir / f"{args.prefix}.suspicious.tsv"
     merged_gff = Path(args.merged_gff) if args.merged_gff else outdir / f"{args.prefix}.merged.gff3"
     final_gff = Path(args.output) if args.output else outdir / f"{args.prefix}.final.withUTR.gff3"
+    pre_rename_gff = final_gff.with_suffix(final_gff.suffix + ".pre_rename.gff3")
-    print("[gffkit] Step 1/3: detecting suspicious merged genes", file=sys.stderr)
+    print("[gffkit] Step 1/4: detecting suspicious merged genes", file=sys.stderr)
     detect_cli = [
         "-i", args.annotation_a,
         "-o", str(suspicious_tsv),
@@ -78,7 +91,7 @@ def cmd_integrate(args: argparse.Namespace, extra: List[str]) -> int:
     if ret != 0:
         return ret
-    print("[gffkit] Step 2/3: region-aware annotation merging", file=sys.stderr)
+    print("[gffkit] Step 2/4: region-aware annotation merging", file=sys.stderr)
     complement_cli = [
         "--ref", args.annotation_a,
         "--add", args.annotation_b,
@@ -92,18 +105,33 @@ def cmd_integrate(args: argparse.Namespace, extra: List[str]) -> int:
     if ret != 0:
         return ret
-    print("[gffkit] Step 3/3: adding UTR features", file=sys.stderr)
-    utr_cli = ["-i", str(merged_gff), "-o", str(final_gff), "--id-prefix", args.utr_id_prefix]
+    print("[gffkit] Step 3/4: adding UTR features", file=sys.stderr)
+    utr_cli = ["-i", str(merged_gff), "-o", str(pre_rename_gff), "--id-prefix", args.utr_id_prefix]
     if args.replace_existing_utrs:
         utr_cli.append("--replace-existing-utrs")
     ret = _run_legacy_main(utr_mod.main, "gffkit add-utr", utr_cli)
     if ret != 0:
         return ret
+    print("[gffkit] Step 4/4: renaming IDs and sorting GFF3", file=sys.stderr)
+    rename_cli = [
+        "-i", str(pre_rename_gff),
+        "-o", str(final_gff),
+        "--prefix", args.prefix,
+        "--digits", str(args.rename_digits),
+    ]
+    if args.keep_old_ids:
+        rename_cli.append("--keep-old-ids")
+    ret = _run_legacy_main(rename_mod.main, "gffkit rename-sort", rename_cli)
+    if ret != 0:
+        return ret
     print("[gffkit] Done", file=sys.stderr)
     print(f"[gffkit] suspicious TSV: {suspicious_tsv}", file=sys.stderr)
     print(f"[gffkit] merged GFF3:     {merged_gff}", file=sys.stderr)
+    print(f"[gffkit] pre-rename GFF3: {pre_rename_gff}", file=sys.stderr)
     print(f"[gffkit] final GFF3:      {final_gff}", file=sys.stderr)
+    print(f"[gffkit] ID map TSV:      {final_gff}.id_map.tsv", file=sys.stderr)
     return 0
@@ -146,12 +174,24 @@ def build_parser() -> argparse.ArgumentParser:
     p.add_argument("-o", "--output", required=True, help="Output GFF3 file.")
     p.set_defaults(handler=cmd_add_utr)
+    p = subparsers.add_parser(
+        "rename-sort",
+        help="Rename GFF3 IDs with a prefix, keep chromosome records, and sort features.",
+        description="Rename gene/transcript/child IDs, update Parent links, remove unplaced seqids, and sort GFF3.",
+    )
+    p.add_argument("-i", "--input", required=True, help="Input GFF3/GTF-like file; .gz is supported.")
+    p.add_argument("-o", "--output", required=True, help="Output renamed and sorted GFF3 file; .gz is supported.")
+    p.add_argument("--prefix", required=True, help="Prefix used in new gene IDs, e.g. sample_C01g00001.")
+    p.add_argument("--digits", type=int, default=5, help="Gene number width. Default: 5.")
+    p.add_argument("--keep-old-ids", action="store_true", help="Keep old ID/Parent as old_ID/old_Parent.")
+    p.set_defaults(handler=cmd_rename_sort)
     p = subparsers.add_parser(
         "integrate",
-        help="Run the full A/B region-aware integration workflow.",
+        help="Run the full A/B region-aware integration, UTR, rename, and sort workflow.",
         description=(
             "Detect suspicious merged-gene regions in Annotation A, use Annotation B as the local "
-            "primary reference in those regions, then add UTR features."
+            "primary reference in those regions, add UTR features, then rename and sort the final GFF3."
         ),
     )
     p.add_argument("--annotation-a", "--a", required=True, help="Annotation A: EviAnn/RNA-seq-supported GFF3.")
@@ -173,6 +213,8 @@ def build_parser() -> argparse.ArgumentParser:
     p.add_argument("--size-min", type=int, default=0, help="Minimum CDS size for non-overlapping supplementary roots.")
     p.add_argument("--replace-existing-utrs", action="store_true", help="Remove existing UTRs and recreate them.")
     p.add_argument("--utr-id-prefix", default="gffkit_utr_", help="Prefix for newly created UTR IDs.")
+    p.add_argument("--rename-digits", type=int, default=5, help="Gene number width used by final rename/sort step.")
+    p.add_argument("--keep-old-ids", action="store_true", help="Keep old ID/Parent as old_ID/old_Parent in the final GFF3.")
     p.set_defaults(handler=cmd_integrate)
     return parser

gffkit-0.4.0/src/gffkit/rename_sort_gff3.py ADDED Viewed

@@ -0,0 +1,568 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+rename_sort_gff3_v3_chronly.py
+Version: 2026-06-18-chronly
+This version fixes sorting of un/scaffold/contig sequences:
+    chr01/chr1/1/01 are sorted as numbered chromosomes.
+    un001/scaffold01/contig01 are NOT parsed as chromosome 1; they are placed after chromosomes.
+Rename gene/transcript/child IDs in a merged GFF3 and sort it.
+Gene ID format:
+    {species}_C{chromosome_number:02d}g{gene_number:05d}
+Example:
+    EIL31_C01g00001
+Transcript ID format:
+    {gene_id}.t1, {gene_id}.t2 ...
+Child feature ID format:
+    {transcript_id}.{feature_type}{index}
+Example:
+    EIL31_C01g00001.t1.exon1
+    EIL31_C01g00001.t1.CDS1
+    EIL31_C01g00001.t1.five_prime_UTR1
+Main features:
+    - rewrite ID and Parent relationships
+    - remove genes/features on unplaced scaffolds/contigs before renaming
+    - keep most original attributes except old ID/Parent by default
+    - sort genes by chromosome and start coordinate
+    - sort transcripts and child features within each gene
+Usage:
+    python rename_sort_gff3.py -i input.gff3 -o output.renamed.sorted.gff3 --species EIL31
+Optional:
+    python rename_sort_gff3.py -i input.gff3 -o output.gff3 --species EIL31 --keep-old-ids
+"""
+import argparse
+import gzip
+import re
+import sys
+from collections import defaultdict, Counter
+from dataclasses import dataclass, field
+from typing import Dict, List, Optional, Tuple
+TRANSCRIPT_TYPES = {
+    "mrna", "transcript", "lnc_rna", "ncrna", "rrna", "trna",
+    "snrna", "snorna", "mirna", "primary_transcript"
+}
+CHILD_RANK = {
+    "exon": 10,
+    "five_prime_utr": 20,
+    "utr": 25,
+    "cds": 30,
+    "three_prime_utr": 40,
+}
+@dataclass
+class Feature:
+    seqid: str
+    source: str
+    ftype: str
+    start: int
+    end: int
+    score: str
+    strand: str
+    phase: str
+    attrs: Dict[str, List[str]] = field(default_factory=dict)
+    line_no: int = 0
+    def id(self) -> Optional[str]:
+        vals = self.attrs.get("ID")
+        return vals[0] if vals else None
+    def parents(self) -> List[str]:
+        return self.attrs.get("Parent", [])
+    def to_line(self) -> str:
+        return "\t".join([
+            self.seqid, self.source, self.ftype, str(self.start), str(self.end),
+            self.score, self.strand, self.phase, format_attrs(self.attrs)
+        ])
+def open_text(path: str, mode: str = "rt"):
+    if path == "-":
+        return sys.stdin if "r" in mode else sys.stdout
+    if path.endswith(".gz"):
+        return gzip.open(path, mode)
+    return open(path, mode, encoding="utf-8")
+def parse_attrs(s: str) -> Dict[str, List[str]]:
+    attrs: Dict[str, List[str]] = {}
+    s = s.strip()
+    if not s or s == ".":
+        return attrs
+    for part in [x.strip() for x in s.rstrip(";").split(";") if x.strip()]:
+        if "=" in part:
+            k, v = part.split("=", 1)
+            attrs[k.strip()] = [x.strip() for x in v.split(",") if x.strip()]
+        else:
+            # GTF-like: gene_id "xxx"; transcript_id "yyy";
+            fields = part.split(None, 1)
+            if len(fields) == 2:
+                attrs[fields[0].strip()] = [fields[1].strip().strip('"')]
+    return attrs
+def format_attrs(attrs: Dict[str, List[str]]) -> str:
+    if not attrs:
+        return "."
+    preferred = [
+        "ID", "Parent", "Name",
+        "old_ID", "old_Parent",
+        "gene_id", "transcript_id",
+        "geneID", "gene_biotype", "Class", "Evidence",
+        "EvidenceProteinID", "EvidenceTranscriptID",
+        "Num_exons", "StartCodon", "StopCodon",
+    ]
+    keys = [k for k in preferred if k in attrs]
+    keys.extend(sorted(k for k in attrs if k not in keys))
+    out = []
+    for k in keys:
+        vals = attrs.get(k, [])
+        if vals:
+            out.append(f"{k}={','.join(vals)}")
+    return ";".join(out) if out else "."
+def read_gff(path: str) -> Tuple[List[str], List[Feature]]:
+    headers, features = [], []
+    with open_text(path, "rt") as fh:
+        for i, line in enumerate(fh, 1):
+            line = line.rstrip("\n")
+            if not line:
+                continue
+            if line.startswith("#"):
+                # Do not keep embedded FASTA sequence. This script is for annotation table only.
+                if line.startswith("##FASTA"):
+                    break
+                headers.append(line)
+                continue
+            cols = line.split("\t")
+            if len(cols) != 9:
+                print(f"[WARN] skip line {i}: not 9 columns", file=sys.stderr)
+                continue
+            try:
+                start, end = int(cols[3]), int(cols[4])
+            except ValueError:
+                print(f"[WARN] skip line {i}: start/end not integer", file=sys.stderr)
+                continue
+            features.append(Feature(
+                seqid=cols[0], source=cols[1], ftype=cols[2],
+                start=start, end=end, score=cols[5], strand=cols[6],
+                phase=cols[7], attrs=parse_attrs(cols[8]), line_no=i
+            ))
+    return headers, features
+def _numbered_chr_match(seqid: str) -> Optional[int]:
+    """
+    Return chromosome number only for true chromosome names.
+    Accepted as numbered chromosomes:
+        chr01, Chr01, chr1, 1, 01
+    NOT accepted as numbered chromosomes:
+        un001, scaffold01, contig_01, chr01_random
+    The important point is that names such as un001 must not be parsed as C01.
+    """
+    low = seqid.strip().lower()
+    # Pure numeric sequence names: 1, 01, 001
+    if re.fullmatch(r"0*[0-9]+", low):
+        return int(low)
+    # Standard chromosome names only: chr1, chr01, chromosome1, chromosome01
+    m = re.fullmatch(r"(?:chr|chromosome)0*([0-9]+)", low)
+    if m:
+        return int(m.group(1))
+    return None
+def _un_number(seqid: str) -> int:
+    """Return numeric suffix for un001-like names, only for sorting among un contigs."""
+    m = re.search(r"([0-9]+)$", seqid.strip())
+    return int(m.group(1)) if m else 10**9
+def chr_key(seqid: str) -> Tuple[int, int, str]:
+    """
+    Sorting priority:
+        0: numbered chromosomes, chr01 -> chr10
+        1: sex chromosomes, if present
+        2: organelle chromosomes, if present
+        9: un/scaffold/contig/other sequences after all chromosomes
+    """
+    s = seqid.strip()
+    low = s.lower()
+    n = _numbered_chr_match(s)
+    if n is not None:
+        return (0, n, s)
+    if low in {"chrx", "x"}:
+        return (1, 23, s)
+    if low in {"chry", "y"}:
+        return (1, 24, s)
+    if low in {"chrm", "mt", "m", "mitochondria"}:
+        return (2, 1, s)
+    if low in {"chrc", "pt", "chloroplast"}:
+        return (2, 2, s)
+    # Put un/scaffold/contig after numbered chromosomes.
+    # Sort un001, un002, ... naturally by their numeric suffix.
+    return (9, _un_number(s), s)
+def is_chromosome_seqid(seqid: str) -> bool:
+    """
+    Return True for chromosome-mounted sequence names recognized by chr_key().
+    Numbered chromosomes such as chr01/chr1/1/01 are kept. Unplaced names such
+    as un001/scaffold01/contig01 and unknown seqids are removed.
+    """
+    return chr_key(seqid)[0] in {0, 1, 2}
+def filter_chromosome_features(features: List[Feature]) -> Tuple[List[Feature], Counter]:
+    """Drop all records whose seqid is not recognized as chromosome-mounted."""
+    stats = Counter()
+    kept: List[Feature] = []
+    for f in features:
+        stats["features_total"] += 1
+        if f.ftype.lower() == "gene":
+            stats["genes_total"] += 1
+        if is_chromosome_seqid(f.seqid):
+            kept.append(f)
+            stats["features_kept"] += 1
+            if f.ftype.lower() == "gene":
+                stats["genes_kept"] += 1
+        else:
+            stats["features_removed"] += 1
+            stats[f"removed_seqid:{f.seqid}"] += 1
+            if f.ftype.lower() == "gene":
+                stats["genes_removed"] += 1
+    return kept, stats
+def chr_code(seqid: str) -> str:
+    """
+    chr01 -> C01
+    chr1  -> C01
+    1     -> C01
+    un001 -> Cun001
+    scaffold01 -> Cscaffold01
+    """
+    n = _numbered_chr_match(seqid)
+    if n is not None:
+        return f"C{n:02d}"
+    clean = re.sub(r"[^A-Za-z0-9]+", "", seqid.strip())
+    return f"C{clean}"
+def is_transcript_type(ftype: str) -> bool:
+    return ftype.lower() in TRANSCRIPT_TYPES
+def feature_sort_key(f: Feature) -> Tuple:
+    return (
+        chr_key(f.seqid),
+        f.start,
+        CHILD_RANK.get(f.ftype.lower(), 100),
+        f.end,
+        f.line_no,
+    )
+def update_id_attr(f: Feature, new_id: Optional[str], keep_old: bool):
+    old = f.id()
+    if old and keep_old and old != new_id:
+        f.attrs["old_ID"] = [old]
+    if new_id is None:
+        f.attrs.pop("ID", None)
+    else:
+        f.attrs["ID"] = [new_id]
+def update_parent_attr(f: Feature, new_parents: List[str], keep_old: bool):
+    old = f.parents()
+    if old and keep_old and old != new_parents:
+        f.attrs["old_Parent"] = old
+    if new_parents:
+        f.attrs["Parent"] = new_parents
+    else:
+        f.attrs.pop("Parent", None)
+def write_id_map(path: str, rows: List[Dict[str, str]]):
+    fields = [
+        "feature_type",
+        "old_id",
+        "new_id",
+        "old_parent",
+        "new_parent",
+        "seqid",
+        "start",
+        "end",
+        "strand",
+    ]
+    with open_text(path, "wt") as out:
+        print("\t".join(fields), file=out)
+        for row in rows:
+            print("\t".join(row.get(k, "") for k in fields), file=out)
+def main():
+    ap = argparse.ArgumentParser(
+        description="Keep chromosome-mounted records, rename gene IDs to species_Cxxgxxxxx format, and sort GFF3."
+    )
+    ap.add_argument("-i", "--input", required=True, help="Input GFF3/GTF-like file, .gz supported")
+    ap.add_argument("-o", "--output", required=True, help="Output GFF3 file, .gz supported")
+    ap.add_argument(
+        "--species", "--prefix",
+        dest="species",
+        required=True,
+        help="Species/accession prefix used for renamed IDs, e.g. EIL31"
+    )
+    ap.add_argument("--digits", type=int, default=5, help="Gene number width, default 5")
+    ap.add_argument("--keep-old-ids", action="store_true", help="Keep old ID/Parent as old_ID/old_Parent")
+    args = ap.parse_args()
+    headers, features = read_gff(args.input)
+    features, filter_stats = filter_chromosome_features(features)
+    id_to_feature = {}
+    for f in features:
+        fid = f.id()
+        if fid:
+            if fid in id_to_feature:
+                print(f"[WARN] duplicated ID found: {fid}", file=sys.stderr)
+            id_to_feature[fid] = f
+    genes = [f for f in features if f.ftype.lower() == "gene"]
+    genes_sorted = sorted(genes, key=lambda f: (chr_key(f.seqid), f.start, f.end, f.line_no))
+    old_gene_to_new = {}
+    gene_counter_by_chr = Counter()
+    for g in genes_sorted:
+        old_gid = g.id()
+        if not old_gid:
+            # give an internal key for rare gene records without ID
+            old_gid = f"__gene_without_ID_line_{g.line_no}"
+            g.attrs["ID"] = [old_gid]
+        ccode = chr_code(g.seqid)
+        gene_counter_by_chr[ccode] += 1
+        new_gid = f"{args.species}_{ccode}g{gene_counter_by_chr[ccode]:0{args.digits}d}"
+        old_gene_to_new[old_gid] = new_gid
+    # transcript old ID -> new ID
+    old_tx_to_new = {}
+    tx_by_gene = defaultdict(list)
+    for f in features:
+        if is_transcript_type(f.ftype):
+            parents = f.parents()
+            parent_gene_old = parents[0] if parents else None
+            if parent_gene_old in old_gene_to_new:
+                tx_by_gene[parent_gene_old].append(f)
+    for old_gid, txs in tx_by_gene.items():
+        txs_sorted = sorted(txs, key=lambda f: (f.start, f.end, f.line_no))
+        new_gid = old_gene_to_new[old_gid]
+        for i, tx in enumerate(txs_sorted, 1):
+            old_tid = tx.id()
+            if not old_tid:
+                old_tid = f"__tx_without_ID_line_{tx.line_no}"
+                tx.attrs["ID"] = [old_tid]
+            old_tx_to_new[old_tid] = f"{new_gid}.t{i}"
+    # Rename genes
+    id_map_rows = []
+    for g in genes:
+        old_gid = g.id()
+        new_gid = old_gene_to_new.get(old_gid)
+        if new_gid:
+            id_map_rows.append({
+                "feature_type": g.ftype,
+                "old_id": old_gid or "",
+                "new_id": new_gid,
+                "old_parent": "",
+                "new_parent": "",
+                "seqid": g.seqid,
+                "start": str(g.start),
+                "end": str(g.end),
+                "strand": g.strand,
+            })
+            update_id_attr(g, new_gid, args.keep_old_ids)
+            # Standardize common gene-level aliases
+            g.attrs["Name"] = [new_gid]
+            if "gene_id" in g.attrs:
+                g.attrs["gene_id"] = [new_gid]
+            if "geneID" in g.attrs:
+                g.attrs["geneID"] = [new_gid]
+    # Rename transcripts
+    for txs in tx_by_gene.values():
+        for tx in txs:
+            old_tid = tx.id()
+            new_tid = old_tx_to_new.get(old_tid)
+            old_parent = tx.parents()[0] if tx.parents() else None
+            new_parent = old_gene_to_new.get(old_parent)
+            if new_tid:
+                id_map_rows.append({
+                    "feature_type": tx.ftype,
+                    "old_id": old_tid or "",
+                    "new_id": new_tid,
+                    "old_parent": old_parent or "",
+                    "new_parent": new_parent or "",
+                    "seqid": tx.seqid,
+                    "start": str(tx.start),
+                    "end": str(tx.end),
+                    "strand": tx.strand,
+                })
+                update_id_attr(tx, new_tid, args.keep_old_ids)
+                tx.attrs["Name"] = [new_tid]
+                if "transcript_id" in tx.attrs:
+                    tx.attrs["transcript_id"] = [new_tid]
+            if new_parent:
+                update_parent_attr(tx, [new_parent], args.keep_old_ids)
+                if "gene_id" in tx.attrs:
+                    tx.attrs["gene_id"] = [new_parent]
+                if "geneID" in tx.attrs:
+                    tx.attrs["geneID"] = [new_parent]
+    # Rename child features whose Parent is a transcript
+    child_count_by_tx_type = Counter()
+    for f in sorted(features, key=feature_sort_key):
+        if f.ftype.lower() == "gene" or is_transcript_type(f.ftype):
+            continue
+        old_parents = f.parents()
+        new_parents = [old_tx_to_new[p] for p in old_parents if p in old_tx_to_new]
+        if new_parents:
+            update_parent_attr(f, new_parents, args.keep_old_ids)
+            # Most exon/CDS lines in your merged GFF have no ID; this creates stable IDs.
+            # If one child has multiple parents, use the first parent for its own ID.
+            p0 = new_parents[0]
+            key = (p0, f.ftype)
+            child_count_by_tx_type[key] += 1
+            safe_type = re.sub(r'[^A-Za-z0-9_]+', '_', f.ftype)
+            new_child_id = f"{p0}.{safe_type}{child_count_by_tx_type[key]}"
+            id_map_rows.append({
+                "feature_type": f.ftype,
+                "old_id": f.id() or "",
+                "new_id": new_child_id,
+                "old_parent": ",".join(old_parents),
+                "new_parent": ",".join(new_parents),
+                "seqid": f.seqid,
+                "start": str(f.start),
+                "end": str(f.end),
+                "strand": f.strand,
+            })
+            update_id_attr(f, new_child_id, args.keep_old_ids)
+            if "transcript_id" in f.attrs:
+                f.attrs["transcript_id"] = [p0]
+            # infer gene_id from transcript ID before .tN
+            gene_new = re.sub(r'\.t[0-9]+$', '', p0)
+            if "gene_id" in f.attrs:
+                f.attrs["gene_id"] = [gene_new]
+            if "geneID" in f.attrs:
+                f.attrs["geneID"] = [gene_new]
+    # Build output order: gene -> transcripts -> transcript children
+    genes_by_new_id = {}
+    txs_by_new_gene = defaultdict(list)
+    children_by_new_tx = defaultdict(list)
+    used_in_hierarchy = set()
+    for f in features:
+        if f.ftype.lower() == "gene" and f.id():
+            genes_by_new_id[f.id()] = f
+    for f in features:
+        if is_transcript_type(f.ftype) and f.parents():
+            txs_by_new_gene[f.parents()[0]].append(f)
+    for f in features:
+        if f.ftype.lower() == "gene" or is_transcript_type(f.ftype):
+            continue
+        for p in f.parents():
+            children_by_new_tx[p].append(f)
+    output_features = []
+    for g in sorted(genes, key=lambda f: (chr_key(f.seqid), f.start, f.end, f.id() or "", f.line_no)):
+        output_features.append(g)
+        used_in_hierarchy.add(id(g))
+        txs = sorted(txs_by_new_gene.get(g.id(), []), key=lambda f: (f.start, f.end, f.line_no))
+        for tx in txs:
+            output_features.append(tx)
+            used_in_hierarchy.add(id(tx))
+            kids = sorted(children_by_new_tx.get(tx.id(), []), key=feature_sort_key)
+            for k in kids:
+                output_features.append(k)
+                used_in_hierarchy.add(id(k))
+    # Keep any orphan/unrecognized features instead of silently dropping them
+    orphans = [f for f in features if id(f) not in used_in_hierarchy]
+    if orphans:
+        print(f"[WARN] {len(orphans)} orphan/unhierarchical features kept at end", file=sys.stderr)
+        output_features.extend(sorted(orphans, key=feature_sort_key))
+    with open_text(args.output, "wt") as out:
+        if not any(h.startswith("##gff-version") for h in headers):
+            print("##gff-version 3", file=out)
+        for h in headers:
+            print(h, file=out)
+        for f in output_features:
+            print(f.to_line(), file=out)
+    id_map_path = args.output + ".id_map.tsv"
+    write_id_map(id_map_path, id_map_rows)
+    removed_seqids = {
+        k.split(":", 1)[1]: v
+        for k, v in filter_stats.items()
+        if k.startswith("removed_seqid:")
+    }
+    top_removed_seqids = dict(sorted(
+        removed_seqids.items(),
+        key=lambda kv: (-kv[1], chr_key(kv[0]), kv[0])
+    )[:20])
+    print("[VERSION] rename_sort_gff3_v3_chronly.py 2026-06-18-chronly", file=sys.stderr)
+    print("[DONE] renamed and sorted GFF3 written to:", args.output, file=sys.stderr)
+    print("[DONE] ID map written to:", id_map_path, file=sys.stderr)
+    print("[INFO] genes kept:", len(genes), file=sys.stderr)
+    print("[INFO] genes removed from unplaced seqids:", filter_stats["genes_removed"], file=sys.stderr)
+    print("[INFO] features kept:", filter_stats["features_kept"], file=sys.stderr)
+    print("[INFO] features removed from unplaced seqids:", filter_stats["features_removed"], file=sys.stderr)
+    if top_removed_seqids:
+        print("[INFO] top removed seqids:", top_removed_seqids, file=sys.stderr)
+    print("[INFO] transcripts:", len(old_tx_to_new), file=sys.stderr)
+    print("[INFO] chromosome gene counts:", dict(gene_counter_by_chr), file=sys.stderr)
+if __name__ == "__main__":
+    main()

{gffkit-0.3 → gffkit-0.4.0/src/gffkit.egg-info}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: gffkit
-Version: 0.3
+Version: 0.4.0
 Summary: Region-aware GFF annotation integration toolkit
 Author: Qunjie Zhang
 License: MIT
@@ -26,11 +26,12 @@ License-File: LICENSE
 # gffkit
 `gffkit` is a lightweight toolkit for region-aware GFF/GTF annotation integration.
-It combines three utilities:
+It combines four utilities:
 1. `detect-bridge`: detect suspicious merged-gene artifacts caused by bridge transcripts.
 2. `complement`: complement/merge annotations, with optional region-swap mode.
 3. `add-utr`: reconstruct `five_prime_UTR` and `three_prime_UTR` features from exon/CDS coordinates.
+4. `rename-sort`: rename gene/transcript/child IDs with a prefix and sort the final GFF3.
 ## Installation
@@ -48,20 +49,25 @@ gffkit integrate \
   --annotation-a EviAnn.gff3 \
   --annotation-b ANNEVO.gff3 \
   --outdir gffkit_out \
-  --prefix sample
+  --prefix sample \
+  -t 8
 ```
 Outputs:
 - `gffkit_out/sample.suspicious.tsv`
 - `gffkit_out/sample.merged.gff3`
+- `gffkit_out/sample.final.withUTR.gff3.pre_rename.gff3`
 - `gffkit_out/sample.final.withUTR.gff3`
+- `gffkit_out/sample.final.withUTR.gff3.id_map.tsv`
+In `integrate`, `--prefix sample` is also used for final ID renaming. Final gene IDs are written like `sample_C01g00001`, transcript IDs like `sample_C01g00001.t1`, and child IDs like `sample_C01g00001.t1.exon1`.
 ### Step-by-step usage
 ```bash
 # 1. Detect suspicious merged genes in Annotation A
-gffkit detect-bridge -i EviAnn.gff3 -o suspicious.tsv
+gffkit detect-bridge -i EviAnn.gff3 -o suspicious.tsv -t 8
 # 2. Use A as the global reference, but switch to B in suspicious regions
 gffkit complement \
@@ -69,10 +75,31 @@ gffkit complement \
   --add ANNEVO.gff3 \
   --swap_region_tsv suspicious.tsv \
   --swap_region_flank 100 \
-  --output merged.gff3
+  --output merged.gff3 \
+  -t 8
 # 3. Add UTR features
-gffkit add-utr -i merged.gff3 -o final.annotation.withUTR.gff3
+gffkit add-utr -i merged.gff3 -o final.annotation.withUTR.pre_rename.gff3
+# 4. Rename IDs, drop unplaced seqids, and sort the final GFF3
+gffkit rename-sort \
+  -i final.annotation.withUTR.pre_rename.gff3 \
+  -o final.annotation.withUTR.gff3 \
+  --prefix sample
+```
+### Merge three or more annotations
+Use repeated `--add` arguments. Files are merged in the order provided.
+```bash
+gffkit complement \
+  --ref EviAnn.gff3 \
+  --add ANNEVO.gff3 \
+  --add Helixer.gff3 \
+  --add PASA.gff3 \
+  --output merged.multi.gff3 \
+  -t 8
 ```
 ## Command overview
@@ -82,14 +109,50 @@ gffkit --help
 gffkit detect-bridge --help
 gffkit complement --help
 gffkit add-utr --help
+gffkit rename-sort --help
 gffkit integrate --help
 ```
+## Threads
+Version 0.3 and later add `-t/--threads`.
+- `detect-bridge` analyzes genes in parallel.
+- `complement` pre-parses multiple `--add` files in parallel, then merges them in the original command-line order.
+- `integrate` passes the thread count to the detect and complement steps.
+Example:
+```bash
+gffkit integrate --annotation-a EviAnn.gff3 --annotation-b ANNEVO.gff3 -t 16
+```
 ## Annotation integration strategy
 - Annotation A, for example EviAnn/RNA-seq-supported GFF, is used as the global primary reference.
 - Annotation B, for example ANNEVO/deep-learning GFF, is used as the local primary reference only in suspicious merged-gene regions.
 - UTR features are reconstructed after merging using an exon-minus-CDS strategy.
+- Version 0.4.0 and later run `rename-sort` as the final `integrate` step. The final GFF3 keeps chromosome-mounted records, removes unplaced/scaffold/contig records, sorts features, rewrites `ID`/`Parent`, and writes an ID map next to the output.
+- When multiple tools annotate the same gene locus, the GFF source column is combined with `|`, for example `EviAnn|ANNEVO`.
+## Rename and Sort
+Run this step independently when you already have a merged GFF3:
+```bash
+gffkit rename-sort \
+  -i merged.withUTR.gff3 \
+  -o sample.renamed.sorted.gff3 \
+  --prefix sample \
+  --digits 5 \
+  --keep-old-ids
+```
+This writes `sample.renamed.sorted.gff3` and `sample.renamed.sorted.gff3.id_map.tsv`.
+## Maintainer notes
+When command-line options or behavior changes, update this `README.md` in the versioned package directory before building and uploading to PyPI.
 ## License

{gffkit-0.3 → gffkit-0.4.0}/src/gffkit.egg-info/SOURCES.txt RENAMED Viewed

@@ -8,9 +8,11 @@ src/gffkit/add_utr.py
 src/gffkit/complement_annotations.py
 src/gffkit/detect_bridge_merged_genes.py
 src/gffkit/main.py
+src/gffkit/rename_sort_gff3.py
 src/gffkit.egg-info/PKG-INFO
 src/gffkit.egg-info/SOURCES.txt
 src/gffkit.egg-info/dependency_links.txt
 src/gffkit.egg-info/entry_points.txt
 src/gffkit.egg-info/top_level.txt
-tests/test_complement_sources.py
+tests/test_complement_sources.py
+tests/test_rename_sort.py

{gffkit-0.3 → gffkit-0.4.0}/src/gffkit.egg-info/entry_points.txt RENAMED Viewed

@@ -3,3 +3,4 @@ gffkit = gffkit.main:main
 gffkit-add-utr = gffkit.add_utr:main
 gffkit-complement = gffkit.complement_annotations:main
 gffkit-detect-bridge = gffkit.detect_bridge_merged_genes:main
+gffkit-rename-sort = gffkit.rename_sort_gff3:main

gffkit-0.4.0/tests/test_rename_sort.py ADDED Viewed

@@ -0,0 +1,47 @@
+import sys
+from gffkit import rename_sort_gff3
+def test_rename_sort_uses_prefix_and_writes_id_map(tmp_path):
+    input_gff = tmp_path / "input.gff3"
+    output_gff = tmp_path / "sample.renamed.gff3"
+    input_gff.write_text(
+        "\n".join(
+            [
+                "##gff-version 3",
+                "chr1\tEviAnn|ANNEVO\tgene\t100\t300\t.\t+\t.\tID=old_gene",
+                "chr1\tEviAnn|ANNEVO\tmRNA\t100\t300\t.\t+\t.\tID=old_tx;Parent=old_gene",
+                "chr1\tEviAnn|ANNEVO\texon\t100\t300\t.\t+\t.\tParent=old_tx",
+                "scaffold01\tEviAnn\tgene\t1\t100\t.\t+\t.\tID=drop_me",
+                "",
+            ]
+        ),
+        encoding="utf-8",
+    )
+    old_argv = sys.argv[:]
+    sys.argv = [
+        "gffkit rename-sort",
+        "-i",
+        str(input_gff),
+        "-o",
+        str(output_gff),
+        "--prefix",
+        "sample",
+    ]
+    try:
+        rename_sort_gff3.main()
+    finally:
+        sys.argv = old_argv
+    text = output_gff.read_text(encoding="utf-8")
+    id_map = output_gff.with_name(output_gff.name + ".id_map.tsv")
+    assert "ID=sample_C01g00001" in text
+    assert "ID=sample_C01g00001.t1;Parent=sample_C01g00001" in text
+    assert "ID=sample_C01g00001.t1.exon1;Parent=sample_C01g00001.t1" in text
+    assert "drop_me" not in text
+    assert id_map.exists()
+    assert "old_gene\tsample_C01g00001" in id_map.read_text(encoding="utf-8")

gffkit-0.3/README.md DELETED Viewed

@@ -1,71 +0,0 @@
-# gffkit
-`gffkit` is a lightweight toolkit for region-aware GFF/GTF annotation integration.
-It combines three utilities:
-1. `detect-bridge`: detect suspicious merged-gene artifacts caused by bridge transcripts.
-2. `complement`: complement/merge annotations, with optional region-swap mode.
-3. `add-utr`: reconstruct `five_prime_UTR` and `three_prime_UTR` features from exon/CDS coordinates.
-## Installation
-```bash
-pip install gffkit
-```
-## Quick start
-### Full integration pipeline
-```bash
-gffkit integrate \
-  --annotation-a EviAnn.gff3 \
-  --annotation-b ANNEVO.gff3 \
-  --outdir gffkit_out \
-  --prefix sample
-```
-Outputs:
-- `gffkit_out/sample.suspicious.tsv`
-- `gffkit_out/sample.merged.gff3`
-- `gffkit_out/sample.final.withUTR.gff3`
-### Step-by-step usage
-```bash
-# 1. Detect suspicious merged genes in Annotation A
-gffkit detect-bridge -i EviAnn.gff3 -o suspicious.tsv
-# 2. Use A as the global reference, but switch to B in suspicious regions
-gffkit complement \
-  --ref EviAnn.gff3 \
-  --add ANNEVO.gff3 \
-  --swap_region_tsv suspicious.tsv \
-  --swap_region_flank 100 \
-  --output merged.gff3
-# 3. Add UTR features
-gffkit add-utr -i merged.gff3 -o final.annotation.withUTR.gff3
-```
-## Command overview
-```bash
-gffkit --help
-gffkit detect-bridge --help
-gffkit complement --help
-gffkit add-utr --help
-gffkit integrate --help
-```
-## Annotation integration strategy
-- Annotation A, for example EviAnn/RNA-seq-supported GFF, is used as the global primary reference.
-- Annotation B, for example ANNEVO/deep-learning GFF, is used as the local primary reference only in suspicious merged-gene regions.
-- UTR features are reconstructed after merging using an exon-minus-CDS strategy.
-## License
-MIT License.