gffkit 0.3__tar.gz → 0.4.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.1
2
2
  Name: gffkit
3
- Version: 0.3
3
+ Version: 0.4.0
4
4
  Summary: Region-aware GFF annotation integration toolkit
5
5
  Author: Qunjie Zhang
6
6
  License: MIT
@@ -26,11 +26,12 @@ License-File: LICENSE
26
26
  # gffkit
27
27
 
28
28
  `gffkit` is a lightweight toolkit for region-aware GFF/GTF annotation integration.
29
- It combines three utilities:
29
+ It combines four utilities:
30
30
 
31
31
  1. `detect-bridge`: detect suspicious merged-gene artifacts caused by bridge transcripts.
32
32
  2. `complement`: complement/merge annotations, with optional region-swap mode.
33
33
  3. `add-utr`: reconstruct `five_prime_UTR` and `three_prime_UTR` features from exon/CDS coordinates.
34
+ 4. `rename-sort`: rename gene/transcript/child IDs with a prefix and sort the final GFF3.
34
35
 
35
36
  ## Installation
36
37
 
@@ -48,20 +49,25 @@ gffkit integrate \
48
49
  --annotation-a EviAnn.gff3 \
49
50
  --annotation-b ANNEVO.gff3 \
50
51
  --outdir gffkit_out \
51
- --prefix sample
52
+ --prefix sample \
53
+ -t 8
52
54
  ```
53
55
 
54
56
  Outputs:
55
57
 
56
58
  - `gffkit_out/sample.suspicious.tsv`
57
59
  - `gffkit_out/sample.merged.gff3`
60
+ - `gffkit_out/sample.final.withUTR.gff3.pre_rename.gff3`
58
61
  - `gffkit_out/sample.final.withUTR.gff3`
62
+ - `gffkit_out/sample.final.withUTR.gff3.id_map.tsv`
63
+
64
+ In `integrate`, `--prefix sample` is also used for final ID renaming. Final gene IDs are written like `sample_C01g00001`, transcript IDs like `sample_C01g00001.t1`, and child IDs like `sample_C01g00001.t1.exon1`.
59
65
 
60
66
  ### Step-by-step usage
61
67
 
62
68
  ```bash
63
69
  # 1. Detect suspicious merged genes in Annotation A
64
- gffkit detect-bridge -i EviAnn.gff3 -o suspicious.tsv
70
+ gffkit detect-bridge -i EviAnn.gff3 -o suspicious.tsv -t 8
65
71
 
66
72
  # 2. Use A as the global reference, but switch to B in suspicious regions
67
73
  gffkit complement \
@@ -69,10 +75,31 @@ gffkit complement \
69
75
  --add ANNEVO.gff3 \
70
76
  --swap_region_tsv suspicious.tsv \
71
77
  --swap_region_flank 100 \
72
- --output merged.gff3
78
+ --output merged.gff3 \
79
+ -t 8
73
80
 
74
81
  # 3. Add UTR features
75
- gffkit add-utr -i merged.gff3 -o final.annotation.withUTR.gff3
82
+ gffkit add-utr -i merged.gff3 -o final.annotation.withUTR.pre_rename.gff3
83
+
84
+ # 4. Rename IDs, drop unplaced seqids, and sort the final GFF3
85
+ gffkit rename-sort \
86
+ -i final.annotation.withUTR.pre_rename.gff3 \
87
+ -o final.annotation.withUTR.gff3 \
88
+ --prefix sample
89
+ ```
90
+
91
+ ### Merge three or more annotations
92
+
93
+ Use repeated `--add` arguments. Files are merged in the order provided.
94
+
95
+ ```bash
96
+ gffkit complement \
97
+ --ref EviAnn.gff3 \
98
+ --add ANNEVO.gff3 \
99
+ --add Helixer.gff3 \
100
+ --add PASA.gff3 \
101
+ --output merged.multi.gff3 \
102
+ -t 8
76
103
  ```
77
104
 
78
105
  ## Command overview
@@ -82,14 +109,50 @@ gffkit --help
82
109
  gffkit detect-bridge --help
83
110
  gffkit complement --help
84
111
  gffkit add-utr --help
112
+ gffkit rename-sort --help
85
113
  gffkit integrate --help
86
114
  ```
87
115
 
116
+ ## Threads
117
+
118
+ Version 0.3 and later add `-t/--threads`.
119
+
120
+ - `detect-bridge` analyzes genes in parallel.
121
+ - `complement` pre-parses multiple `--add` files in parallel, then merges them in the original command-line order.
122
+ - `integrate` passes the thread count to the detect and complement steps.
123
+
124
+ Example:
125
+
126
+ ```bash
127
+ gffkit integrate --annotation-a EviAnn.gff3 --annotation-b ANNEVO.gff3 -t 16
128
+ ```
129
+
88
130
  ## Annotation integration strategy
89
131
 
90
132
  - Annotation A, for example EviAnn/RNA-seq-supported GFF, is used as the global primary reference.
91
133
  - Annotation B, for example ANNEVO/deep-learning GFF, is used as the local primary reference only in suspicious merged-gene regions.
92
134
  - UTR features are reconstructed after merging using an exon-minus-CDS strategy.
135
+ - Version 0.4.0 and later run `rename-sort` as the final `integrate` step. The final GFF3 keeps chromosome-mounted records, removes unplaced/scaffold/contig records, sorts features, rewrites `ID`/`Parent`, and writes an ID map next to the output.
136
+ - When multiple tools annotate the same gene locus, the GFF source column is combined with `|`, for example `EviAnn|ANNEVO`.
137
+
138
+ ## Rename and Sort
139
+
140
+ Run this step independently when you already have a merged GFF3:
141
+
142
+ ```bash
143
+ gffkit rename-sort \
144
+ -i merged.withUTR.gff3 \
145
+ -o sample.renamed.sorted.gff3 \
146
+ --prefix sample \
147
+ --digits 5 \
148
+ --keep-old-ids
149
+ ```
150
+
151
+ This writes `sample.renamed.sorted.gff3` and `sample.renamed.sorted.gff3.id_map.tsv`.
152
+
153
+ ## Maintainer notes
154
+
155
+ When command-line options or behavior changes, update this `README.md` in the versioned package directory before building and uploading to PyPI.
93
156
 
94
157
  ## License
95
158
 
gffkit-0.4.0/README.md ADDED
@@ -0,0 +1,134 @@
1
+ # gffkit
2
+
3
+ `gffkit` is a lightweight toolkit for region-aware GFF/GTF annotation integration.
4
+ It combines four utilities:
5
+
6
+ 1. `detect-bridge`: detect suspicious merged-gene artifacts caused by bridge transcripts.
7
+ 2. `complement`: complement/merge annotations, with optional region-swap mode.
8
+ 3. `add-utr`: reconstruct `five_prime_UTR` and `three_prime_UTR` features from exon/CDS coordinates.
9
+ 4. `rename-sort`: rename gene/transcript/child IDs with a prefix and sort the final GFF3.
10
+
11
+ ## Installation
12
+
13
+ ```bash
14
+ pip install gffkit
15
+ ```
16
+
17
+
18
+ ## Quick start
19
+
20
+ ### Full integration pipeline
21
+
22
+ ```bash
23
+ gffkit integrate \
24
+ --annotation-a EviAnn.gff3 \
25
+ --annotation-b ANNEVO.gff3 \
26
+ --outdir gffkit_out \
27
+ --prefix sample \
28
+ -t 8
29
+ ```
30
+
31
+ Outputs:
32
+
33
+ - `gffkit_out/sample.suspicious.tsv`
34
+ - `gffkit_out/sample.merged.gff3`
35
+ - `gffkit_out/sample.final.withUTR.gff3.pre_rename.gff3`
36
+ - `gffkit_out/sample.final.withUTR.gff3`
37
+ - `gffkit_out/sample.final.withUTR.gff3.id_map.tsv`
38
+
39
+ In `integrate`, `--prefix sample` is also used for final ID renaming. Final gene IDs are written like `sample_C01g00001`, transcript IDs like `sample_C01g00001.t1`, and child IDs like `sample_C01g00001.t1.exon1`.
40
+
41
+ ### Step-by-step usage
42
+
43
+ ```bash
44
+ # 1. Detect suspicious merged genes in Annotation A
45
+ gffkit detect-bridge -i EviAnn.gff3 -o suspicious.tsv -t 8
46
+
47
+ # 2. Use A as the global reference, but switch to B in suspicious regions
48
+ gffkit complement \
49
+ --ref EviAnn.gff3 \
50
+ --add ANNEVO.gff3 \
51
+ --swap_region_tsv suspicious.tsv \
52
+ --swap_region_flank 100 \
53
+ --output merged.gff3 \
54
+ -t 8
55
+
56
+ # 3. Add UTR features
57
+ gffkit add-utr -i merged.gff3 -o final.annotation.withUTR.pre_rename.gff3
58
+
59
+ # 4. Rename IDs, drop unplaced seqids, and sort the final GFF3
60
+ gffkit rename-sort \
61
+ -i final.annotation.withUTR.pre_rename.gff3 \
62
+ -o final.annotation.withUTR.gff3 \
63
+ --prefix sample
64
+ ```
65
+
66
+ ### Merge three or more annotations
67
+
68
+ Use repeated `--add` arguments. Files are merged in the order provided.
69
+
70
+ ```bash
71
+ gffkit complement \
72
+ --ref EviAnn.gff3 \
73
+ --add ANNEVO.gff3 \
74
+ --add Helixer.gff3 \
75
+ --add PASA.gff3 \
76
+ --output merged.multi.gff3 \
77
+ -t 8
78
+ ```
79
+
80
+ ## Command overview
81
+
82
+ ```bash
83
+ gffkit --help
84
+ gffkit detect-bridge --help
85
+ gffkit complement --help
86
+ gffkit add-utr --help
87
+ gffkit rename-sort --help
88
+ gffkit integrate --help
89
+ ```
90
+
91
+ ## Threads
92
+
93
+ Version 0.3 and later add `-t/--threads`.
94
+
95
+ - `detect-bridge` analyzes genes in parallel.
96
+ - `complement` pre-parses multiple `--add` files in parallel, then merges them in the original command-line order.
97
+ - `integrate` passes the thread count to the detect and complement steps.
98
+
99
+ Example:
100
+
101
+ ```bash
102
+ gffkit integrate --annotation-a EviAnn.gff3 --annotation-b ANNEVO.gff3 -t 16
103
+ ```
104
+
105
+ ## Annotation integration strategy
106
+
107
+ - Annotation A, for example EviAnn/RNA-seq-supported GFF, is used as the global primary reference.
108
+ - Annotation B, for example ANNEVO/deep-learning GFF, is used as the local primary reference only in suspicious merged-gene regions.
109
+ - UTR features are reconstructed after merging using an exon-minus-CDS strategy.
110
+ - Version 0.4.0 and later run `rename-sort` as the final `integrate` step. The final GFF3 keeps chromosome-mounted records, removes unplaced/scaffold/contig records, sorts features, rewrites `ID`/`Parent`, and writes an ID map next to the output.
111
+ - When multiple tools annotate the same gene locus, the GFF source column is combined with `|`, for example `EviAnn|ANNEVO`.
112
+
113
+ ## Rename and Sort
114
+
115
+ Run this step independently when you already have a merged GFF3:
116
+
117
+ ```bash
118
+ gffkit rename-sort \
119
+ -i merged.withUTR.gff3 \
120
+ -o sample.renamed.sorted.gff3 \
121
+ --prefix sample \
122
+ --digits 5 \
123
+ --keep-old-ids
124
+ ```
125
+
126
+ This writes `sample.renamed.sorted.gff3` and `sample.renamed.sorted.gff3.id_map.tsv`.
127
+
128
+ ## Maintainer notes
129
+
130
+ When command-line options or behavior changes, update this `README.md` in the versioned package directory before building and uploading to PyPI.
131
+
132
+ ## License
133
+
134
+ MIT License.
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
4
4
 
5
5
  [project]
6
6
  name = "gffkit"
7
- version = "0.3"
7
+ version = "0.4.0"
8
8
  description = "Region-aware GFF annotation integration toolkit"
9
9
  readme = "README.md"
10
10
  requires-python = ">=3.8"
@@ -38,6 +38,7 @@ gffkit = "gffkit.main:main"
38
38
  gffkit-detect-bridge = "gffkit.detect_bridge_merged_genes:main"
39
39
  gffkit-complement = "gffkit.complement_annotations:main"
40
40
  gffkit-add-utr = "gffkit.add_utr:main"
41
+ gffkit-rename-sort = "gffkit.rename_sort_gff3:main"
41
42
 
42
43
  [tool.setuptools]
43
44
  package-dir = {"" = "src"}
@@ -1,3 +1,3 @@
1
1
  """gffkit: region-aware GFF annotation integration utilities."""
2
2
 
3
- __version__ = "0.3"
3
+ __version__ = "0.4.0"
@@ -49,11 +49,23 @@ def cmd_add_utr(args: argparse.Namespace, extra: List[str]) -> int:
49
49
  return _run_legacy_main(mod.main, "gffkit add-utr", cli)
50
50
 
51
51
 
52
+ def cmd_rename_sort(args: argparse.Namespace, extra: List[str]) -> int:
53
+ from . import rename_sort_gff3 as mod
54
+ cli = ["-i", args.input, "-o", args.output, "--prefix", args.prefix]
55
+ if args.digits is not None:
56
+ cli += ["--digits", str(args.digits)]
57
+ if args.keep_old_ids:
58
+ cli.append("--keep-old-ids")
59
+ cli += extra
60
+ return _run_legacy_main(mod.main, "gffkit rename-sort", cli)
61
+
62
+
52
63
  def cmd_integrate(args: argparse.Namespace, extra: List[str]) -> int:
53
- """Run the full three-step annotation integration workflow."""
64
+ """Run the full annotation integration workflow."""
54
65
  from . import detect_bridge_merged_genes as detect_mod
55
66
  from . import complement_annotations as complement_mod
56
67
  from . import add_utr as utr_mod
68
+ from . import rename_sort_gff3 as rename_mod
57
69
 
58
70
  outdir = Path(args.outdir)
59
71
  outdir.mkdir(parents=True, exist_ok=True)
@@ -61,8 +73,9 @@ def cmd_integrate(args: argparse.Namespace, extra: List[str]) -> int:
61
73
  suspicious_tsv = Path(args.suspicious_tsv) if args.suspicious_tsv else outdir / f"{args.prefix}.suspicious.tsv"
62
74
  merged_gff = Path(args.merged_gff) if args.merged_gff else outdir / f"{args.prefix}.merged.gff3"
63
75
  final_gff = Path(args.output) if args.output else outdir / f"{args.prefix}.final.withUTR.gff3"
76
+ pre_rename_gff = final_gff.with_suffix(final_gff.suffix + ".pre_rename.gff3")
64
77
 
65
- print("[gffkit] Step 1/3: detecting suspicious merged genes", file=sys.stderr)
78
+ print("[gffkit] Step 1/4: detecting suspicious merged genes", file=sys.stderr)
66
79
  detect_cli = [
67
80
  "-i", args.annotation_a,
68
81
  "-o", str(suspicious_tsv),
@@ -78,7 +91,7 @@ def cmd_integrate(args: argparse.Namespace, extra: List[str]) -> int:
78
91
  if ret != 0:
79
92
  return ret
80
93
 
81
- print("[gffkit] Step 2/3: region-aware annotation merging", file=sys.stderr)
94
+ print("[gffkit] Step 2/4: region-aware annotation merging", file=sys.stderr)
82
95
  complement_cli = [
83
96
  "--ref", args.annotation_a,
84
97
  "--add", args.annotation_b,
@@ -92,18 +105,33 @@ def cmd_integrate(args: argparse.Namespace, extra: List[str]) -> int:
92
105
  if ret != 0:
93
106
  return ret
94
107
 
95
- print("[gffkit] Step 3/3: adding UTR features", file=sys.stderr)
96
- utr_cli = ["-i", str(merged_gff), "-o", str(final_gff), "--id-prefix", args.utr_id_prefix]
108
+ print("[gffkit] Step 3/4: adding UTR features", file=sys.stderr)
109
+ utr_cli = ["-i", str(merged_gff), "-o", str(pre_rename_gff), "--id-prefix", args.utr_id_prefix]
97
110
  if args.replace_existing_utrs:
98
111
  utr_cli.append("--replace-existing-utrs")
99
112
  ret = _run_legacy_main(utr_mod.main, "gffkit add-utr", utr_cli)
100
113
  if ret != 0:
101
114
  return ret
102
115
 
116
+ print("[gffkit] Step 4/4: renaming IDs and sorting GFF3", file=sys.stderr)
117
+ rename_cli = [
118
+ "-i", str(pre_rename_gff),
119
+ "-o", str(final_gff),
120
+ "--prefix", args.prefix,
121
+ "--digits", str(args.rename_digits),
122
+ ]
123
+ if args.keep_old_ids:
124
+ rename_cli.append("--keep-old-ids")
125
+ ret = _run_legacy_main(rename_mod.main, "gffkit rename-sort", rename_cli)
126
+ if ret != 0:
127
+ return ret
128
+
103
129
  print("[gffkit] Done", file=sys.stderr)
104
130
  print(f"[gffkit] suspicious TSV: {suspicious_tsv}", file=sys.stderr)
105
131
  print(f"[gffkit] merged GFF3: {merged_gff}", file=sys.stderr)
132
+ print(f"[gffkit] pre-rename GFF3: {pre_rename_gff}", file=sys.stderr)
106
133
  print(f"[gffkit] final GFF3: {final_gff}", file=sys.stderr)
134
+ print(f"[gffkit] ID map TSV: {final_gff}.id_map.tsv", file=sys.stderr)
107
135
  return 0
108
136
 
109
137
 
@@ -146,12 +174,24 @@ def build_parser() -> argparse.ArgumentParser:
146
174
  p.add_argument("-o", "--output", required=True, help="Output GFF3 file.")
147
175
  p.set_defaults(handler=cmd_add_utr)
148
176
 
177
+ p = subparsers.add_parser(
178
+ "rename-sort",
179
+ help="Rename GFF3 IDs with a prefix, keep chromosome records, and sort features.",
180
+ description="Rename gene/transcript/child IDs, update Parent links, remove unplaced seqids, and sort GFF3.",
181
+ )
182
+ p.add_argument("-i", "--input", required=True, help="Input GFF3/GTF-like file; .gz is supported.")
183
+ p.add_argument("-o", "--output", required=True, help="Output renamed and sorted GFF3 file; .gz is supported.")
184
+ p.add_argument("--prefix", required=True, help="Prefix used in new gene IDs, e.g. sample_C01g00001.")
185
+ p.add_argument("--digits", type=int, default=5, help="Gene number width. Default: 5.")
186
+ p.add_argument("--keep-old-ids", action="store_true", help="Keep old ID/Parent as old_ID/old_Parent.")
187
+ p.set_defaults(handler=cmd_rename_sort)
188
+
149
189
  p = subparsers.add_parser(
150
190
  "integrate",
151
- help="Run the full A/B region-aware integration workflow.",
191
+ help="Run the full A/B region-aware integration, UTR, rename, and sort workflow.",
152
192
  description=(
153
193
  "Detect suspicious merged-gene regions in Annotation A, use Annotation B as the local "
154
- "primary reference in those regions, then add UTR features."
194
+ "primary reference in those regions, add UTR features, then rename and sort the final GFF3."
155
195
  ),
156
196
  )
157
197
  p.add_argument("--annotation-a", "--a", required=True, help="Annotation A: EviAnn/RNA-seq-supported GFF3.")
@@ -173,6 +213,8 @@ def build_parser() -> argparse.ArgumentParser:
173
213
  p.add_argument("--size-min", type=int, default=0, help="Minimum CDS size for non-overlapping supplementary roots.")
174
214
  p.add_argument("--replace-existing-utrs", action="store_true", help="Remove existing UTRs and recreate them.")
175
215
  p.add_argument("--utr-id-prefix", default="gffkit_utr_", help="Prefix for newly created UTR IDs.")
216
+ p.add_argument("--rename-digits", type=int, default=5, help="Gene number width used by final rename/sort step.")
217
+ p.add_argument("--keep-old-ids", action="store_true", help="Keep old ID/Parent as old_ID/old_Parent in the final GFF3.")
176
218
  p.set_defaults(handler=cmd_integrate)
177
219
 
178
220
  return parser
@@ -0,0 +1,568 @@
1
+ #!/usr/bin/env python3
2
+ # -*- coding: utf-8 -*-
3
+
4
+ """
5
+ rename_sort_gff3_v3_chronly.py
6
+
7
+ Version: 2026-06-18-chronly
8
+
9
+ This version fixes sorting of un/scaffold/contig sequences:
10
+ chr01/chr1/1/01 are sorted as numbered chromosomes.
11
+ un001/scaffold01/contig01 are NOT parsed as chromosome 1; they are placed after chromosomes.
12
+
13
+ Rename gene/transcript/child IDs in a merged GFF3 and sort it.
14
+
15
+ Gene ID format:
16
+ {species}_C{chromosome_number:02d}g{gene_number:05d}
17
+ Example:
18
+ EIL31_C01g00001
19
+
20
+ Transcript ID format:
21
+ {gene_id}.t1, {gene_id}.t2 ...
22
+
23
+ Child feature ID format:
24
+ {transcript_id}.{feature_type}{index}
25
+ Example:
26
+ EIL31_C01g00001.t1.exon1
27
+ EIL31_C01g00001.t1.CDS1
28
+ EIL31_C01g00001.t1.five_prime_UTR1
29
+
30
+ Main features:
31
+ - rewrite ID and Parent relationships
32
+ - remove genes/features on unplaced scaffolds/contigs before renaming
33
+ - keep most original attributes except old ID/Parent by default
34
+ - sort genes by chromosome and start coordinate
35
+ - sort transcripts and child features within each gene
36
+
37
+ Usage:
38
+ python rename_sort_gff3.py -i input.gff3 -o output.renamed.sorted.gff3 --species EIL31
39
+
40
+ Optional:
41
+ python rename_sort_gff3.py -i input.gff3 -o output.gff3 --species EIL31 --keep-old-ids
42
+ """
43
+
44
+ import argparse
45
+ import gzip
46
+ import re
47
+ import sys
48
+ from collections import defaultdict, Counter
49
+ from dataclasses import dataclass, field
50
+ from typing import Dict, List, Optional, Tuple
51
+
52
+
53
+ TRANSCRIPT_TYPES = {
54
+ "mrna", "transcript", "lnc_rna", "ncrna", "rrna", "trna",
55
+ "snrna", "snorna", "mirna", "primary_transcript"
56
+ }
57
+
58
+ CHILD_RANK = {
59
+ "exon": 10,
60
+ "five_prime_utr": 20,
61
+ "utr": 25,
62
+ "cds": 30,
63
+ "three_prime_utr": 40,
64
+ }
65
+
66
+ @dataclass
67
+ class Feature:
68
+ seqid: str
69
+ source: str
70
+ ftype: str
71
+ start: int
72
+ end: int
73
+ score: str
74
+ strand: str
75
+ phase: str
76
+ attrs: Dict[str, List[str]] = field(default_factory=dict)
77
+ line_no: int = 0
78
+
79
+ def id(self) -> Optional[str]:
80
+ vals = self.attrs.get("ID")
81
+ return vals[0] if vals else None
82
+
83
+ def parents(self) -> List[str]:
84
+ return self.attrs.get("Parent", [])
85
+
86
+ def to_line(self) -> str:
87
+ return "\t".join([
88
+ self.seqid, self.source, self.ftype, str(self.start), str(self.end),
89
+ self.score, self.strand, self.phase, format_attrs(self.attrs)
90
+ ])
91
+
92
+
93
+ def open_text(path: str, mode: str = "rt"):
94
+ if path == "-":
95
+ return sys.stdin if "r" in mode else sys.stdout
96
+ if path.endswith(".gz"):
97
+ return gzip.open(path, mode)
98
+ return open(path, mode, encoding="utf-8")
99
+
100
+
101
+ def parse_attrs(s: str) -> Dict[str, List[str]]:
102
+ attrs: Dict[str, List[str]] = {}
103
+ s = s.strip()
104
+ if not s or s == ".":
105
+ return attrs
106
+
107
+ for part in [x.strip() for x in s.rstrip(";").split(";") if x.strip()]:
108
+ if "=" in part:
109
+ k, v = part.split("=", 1)
110
+ attrs[k.strip()] = [x.strip() for x in v.split(",") if x.strip()]
111
+ else:
112
+ # GTF-like: gene_id "xxx"; transcript_id "yyy";
113
+ fields = part.split(None, 1)
114
+ if len(fields) == 2:
115
+ attrs[fields[0].strip()] = [fields[1].strip().strip('"')]
116
+ return attrs
117
+
118
+
119
+ def format_attrs(attrs: Dict[str, List[str]]) -> str:
120
+ if not attrs:
121
+ return "."
122
+ preferred = [
123
+ "ID", "Parent", "Name",
124
+ "old_ID", "old_Parent",
125
+ "gene_id", "transcript_id",
126
+ "geneID", "gene_biotype", "Class", "Evidence",
127
+ "EvidenceProteinID", "EvidenceTranscriptID",
128
+ "Num_exons", "StartCodon", "StopCodon",
129
+ ]
130
+ keys = [k for k in preferred if k in attrs]
131
+ keys.extend(sorted(k for k in attrs if k not in keys))
132
+ out = []
133
+ for k in keys:
134
+ vals = attrs.get(k, [])
135
+ if vals:
136
+ out.append(f"{k}={','.join(vals)}")
137
+ return ";".join(out) if out else "."
138
+
139
+
140
+ def read_gff(path: str) -> Tuple[List[str], List[Feature]]:
141
+ headers, features = [], []
142
+ with open_text(path, "rt") as fh:
143
+ for i, line in enumerate(fh, 1):
144
+ line = line.rstrip("\n")
145
+ if not line:
146
+ continue
147
+ if line.startswith("#"):
148
+ # Do not keep embedded FASTA sequence. This script is for annotation table only.
149
+ if line.startswith("##FASTA"):
150
+ break
151
+ headers.append(line)
152
+ continue
153
+ cols = line.split("\t")
154
+ if len(cols) != 9:
155
+ print(f"[WARN] skip line {i}: not 9 columns", file=sys.stderr)
156
+ continue
157
+ try:
158
+ start, end = int(cols[3]), int(cols[4])
159
+ except ValueError:
160
+ print(f"[WARN] skip line {i}: start/end not integer", file=sys.stderr)
161
+ continue
162
+ features.append(Feature(
163
+ seqid=cols[0], source=cols[1], ftype=cols[2],
164
+ start=start, end=end, score=cols[5], strand=cols[6],
165
+ phase=cols[7], attrs=parse_attrs(cols[8]), line_no=i
166
+ ))
167
+ return headers, features
168
+
169
+
170
+ def _numbered_chr_match(seqid: str) -> Optional[int]:
171
+ """
172
+ Return chromosome number only for true chromosome names.
173
+
174
+ Accepted as numbered chromosomes:
175
+ chr01, Chr01, chr1, 1, 01
176
+
177
+ NOT accepted as numbered chromosomes:
178
+ un001, scaffold01, contig_01, chr01_random
179
+
180
+ The important point is that names such as un001 must not be parsed as C01.
181
+ """
182
+ low = seqid.strip().lower()
183
+
184
+ # Pure numeric sequence names: 1, 01, 001
185
+ if re.fullmatch(r"0*[0-9]+", low):
186
+ return int(low)
187
+
188
+ # Standard chromosome names only: chr1, chr01, chromosome1, chromosome01
189
+ m = re.fullmatch(r"(?:chr|chromosome)0*([0-9]+)", low)
190
+ if m:
191
+ return int(m.group(1))
192
+
193
+ return None
194
+
195
+
196
+ def _un_number(seqid: str) -> int:
197
+ """Return numeric suffix for un001-like names, only for sorting among un contigs."""
198
+ m = re.search(r"([0-9]+)$", seqid.strip())
199
+ return int(m.group(1)) if m else 10**9
200
+
201
+
202
+ def chr_key(seqid: str) -> Tuple[int, int, str]:
203
+ """
204
+ Sorting priority:
205
+ 0: numbered chromosomes, chr01 -> chr10
206
+ 1: sex chromosomes, if present
207
+ 2: organelle chromosomes, if present
208
+ 9: un/scaffold/contig/other sequences after all chromosomes
209
+ """
210
+ s = seqid.strip()
211
+ low = s.lower()
212
+
213
+ n = _numbered_chr_match(s)
214
+ if n is not None:
215
+ return (0, n, s)
216
+
217
+ if low in {"chrx", "x"}:
218
+ return (1, 23, s)
219
+ if low in {"chry", "y"}:
220
+ return (1, 24, s)
221
+ if low in {"chrm", "mt", "m", "mitochondria"}:
222
+ return (2, 1, s)
223
+ if low in {"chrc", "pt", "chloroplast"}:
224
+ return (2, 2, s)
225
+
226
+ # Put un/scaffold/contig after numbered chromosomes.
227
+ # Sort un001, un002, ... naturally by their numeric suffix.
228
+ return (9, _un_number(s), s)
229
+
230
+
231
+ def is_chromosome_seqid(seqid: str) -> bool:
232
+ """
233
+ Return True for chromosome-mounted sequence names recognized by chr_key().
234
+
235
+ Numbered chromosomes such as chr01/chr1/1/01 are kept. Unplaced names such
236
+ as un001/scaffold01/contig01 and unknown seqids are removed.
237
+ """
238
+ return chr_key(seqid)[0] in {0, 1, 2}
239
+
240
+
241
+ def filter_chromosome_features(features: List[Feature]) -> Tuple[List[Feature], Counter]:
242
+ """Drop all records whose seqid is not recognized as chromosome-mounted."""
243
+ stats = Counter()
244
+ kept: List[Feature] = []
245
+
246
+ for f in features:
247
+ stats["features_total"] += 1
248
+ if f.ftype.lower() == "gene":
249
+ stats["genes_total"] += 1
250
+
251
+ if is_chromosome_seqid(f.seqid):
252
+ kept.append(f)
253
+ stats["features_kept"] += 1
254
+ if f.ftype.lower() == "gene":
255
+ stats["genes_kept"] += 1
256
+ else:
257
+ stats["features_removed"] += 1
258
+ stats[f"removed_seqid:{f.seqid}"] += 1
259
+ if f.ftype.lower() == "gene":
260
+ stats["genes_removed"] += 1
261
+
262
+ return kept, stats
263
+
264
+
265
+ def chr_code(seqid: str) -> str:
266
+ """
267
+ chr01 -> C01
268
+ chr1 -> C01
269
+ 1 -> C01
270
+ un001 -> Cun001
271
+ scaffold01 -> Cscaffold01
272
+ """
273
+ n = _numbered_chr_match(seqid)
274
+ if n is not None:
275
+ return f"C{n:02d}"
276
+
277
+ clean = re.sub(r"[^A-Za-z0-9]+", "", seqid.strip())
278
+ return f"C{clean}"
279
+
280
+
281
+ def is_transcript_type(ftype: str) -> bool:
282
+ return ftype.lower() in TRANSCRIPT_TYPES
283
+
284
+
285
+ def feature_sort_key(f: Feature) -> Tuple:
286
+ return (
287
+ chr_key(f.seqid),
288
+ f.start,
289
+ CHILD_RANK.get(f.ftype.lower(), 100),
290
+ f.end,
291
+ f.line_no,
292
+ )
293
+
294
+
295
+ def update_id_attr(f: Feature, new_id: Optional[str], keep_old: bool):
296
+ old = f.id()
297
+ if old and keep_old and old != new_id:
298
+ f.attrs["old_ID"] = [old]
299
+ if new_id is None:
300
+ f.attrs.pop("ID", None)
301
+ else:
302
+ f.attrs["ID"] = [new_id]
303
+
304
+
305
+ def update_parent_attr(f: Feature, new_parents: List[str], keep_old: bool):
306
+ old = f.parents()
307
+ if old and keep_old and old != new_parents:
308
+ f.attrs["old_Parent"] = old
309
+ if new_parents:
310
+ f.attrs["Parent"] = new_parents
311
+ else:
312
+ f.attrs.pop("Parent", None)
313
+
314
+
315
+ def write_id_map(path: str, rows: List[Dict[str, str]]):
316
+ fields = [
317
+ "feature_type",
318
+ "old_id",
319
+ "new_id",
320
+ "old_parent",
321
+ "new_parent",
322
+ "seqid",
323
+ "start",
324
+ "end",
325
+ "strand",
326
+ ]
327
+ with open_text(path, "wt") as out:
328
+ print("\t".join(fields), file=out)
329
+ for row in rows:
330
+ print("\t".join(row.get(k, "") for k in fields), file=out)
331
+
332
+
333
+ def main():
334
+ ap = argparse.ArgumentParser(
335
+ description="Keep chromosome-mounted records, rename gene IDs to species_Cxxgxxxxx format, and sort GFF3."
336
+ )
337
+ ap.add_argument("-i", "--input", required=True, help="Input GFF3/GTF-like file, .gz supported")
338
+ ap.add_argument("-o", "--output", required=True, help="Output GFF3 file, .gz supported")
339
+ ap.add_argument(
340
+ "--species", "--prefix",
341
+ dest="species",
342
+ required=True,
343
+ help="Species/accession prefix used for renamed IDs, e.g. EIL31"
344
+ )
345
+ ap.add_argument("--digits", type=int, default=5, help="Gene number width, default 5")
346
+ ap.add_argument("--keep-old-ids", action="store_true", help="Keep old ID/Parent as old_ID/old_Parent")
347
+ args = ap.parse_args()
348
+
349
+ headers, features = read_gff(args.input)
350
+ features, filter_stats = filter_chromosome_features(features)
351
+
352
+ id_to_feature = {}
353
+ for f in features:
354
+ fid = f.id()
355
+ if fid:
356
+ if fid in id_to_feature:
357
+ print(f"[WARN] duplicated ID found: {fid}", file=sys.stderr)
358
+ id_to_feature[fid] = f
359
+
360
+ genes = [f for f in features if f.ftype.lower() == "gene"]
361
+ genes_sorted = sorted(genes, key=lambda f: (chr_key(f.seqid), f.start, f.end, f.line_no))
362
+
363
+ old_gene_to_new = {}
364
+ gene_counter_by_chr = Counter()
365
+
366
+ for g in genes_sorted:
367
+ old_gid = g.id()
368
+ if not old_gid:
369
+ # give an internal key for rare gene records without ID
370
+ old_gid = f"__gene_without_ID_line_{g.line_no}"
371
+ g.attrs["ID"] = [old_gid]
372
+
373
+ ccode = chr_code(g.seqid)
374
+ gene_counter_by_chr[ccode] += 1
375
+ new_gid = f"{args.species}_{ccode}g{gene_counter_by_chr[ccode]:0{args.digits}d}"
376
+ old_gene_to_new[old_gid] = new_gid
377
+
378
+ # transcript old ID -> new ID
379
+ old_tx_to_new = {}
380
+ tx_by_gene = defaultdict(list)
381
+
382
+ for f in features:
383
+ if is_transcript_type(f.ftype):
384
+ parents = f.parents()
385
+ parent_gene_old = parents[0] if parents else None
386
+ if parent_gene_old in old_gene_to_new:
387
+ tx_by_gene[parent_gene_old].append(f)
388
+
389
+ for old_gid, txs in tx_by_gene.items():
390
+ txs_sorted = sorted(txs, key=lambda f: (f.start, f.end, f.line_no))
391
+ new_gid = old_gene_to_new[old_gid]
392
+ for i, tx in enumerate(txs_sorted, 1):
393
+ old_tid = tx.id()
394
+ if not old_tid:
395
+ old_tid = f"__tx_without_ID_line_{tx.line_no}"
396
+ tx.attrs["ID"] = [old_tid]
397
+ old_tx_to_new[old_tid] = f"{new_gid}.t{i}"
398
+
399
+ # Rename genes
400
+ id_map_rows = []
401
+
402
+ for g in genes:
403
+ old_gid = g.id()
404
+ new_gid = old_gene_to_new.get(old_gid)
405
+ if new_gid:
406
+ id_map_rows.append({
407
+ "feature_type": g.ftype,
408
+ "old_id": old_gid or "",
409
+ "new_id": new_gid,
410
+ "old_parent": "",
411
+ "new_parent": "",
412
+ "seqid": g.seqid,
413
+ "start": str(g.start),
414
+ "end": str(g.end),
415
+ "strand": g.strand,
416
+ })
417
+ update_id_attr(g, new_gid, args.keep_old_ids)
418
+ # Standardize common gene-level aliases
419
+ g.attrs["Name"] = [new_gid]
420
+ if "gene_id" in g.attrs:
421
+ g.attrs["gene_id"] = [new_gid]
422
+ if "geneID" in g.attrs:
423
+ g.attrs["geneID"] = [new_gid]
424
+
425
+ # Rename transcripts
426
+ for txs in tx_by_gene.values():
427
+ for tx in txs:
428
+ old_tid = tx.id()
429
+ new_tid = old_tx_to_new.get(old_tid)
430
+ old_parent = tx.parents()[0] if tx.parents() else None
431
+ new_parent = old_gene_to_new.get(old_parent)
432
+ if new_tid:
433
+ id_map_rows.append({
434
+ "feature_type": tx.ftype,
435
+ "old_id": old_tid or "",
436
+ "new_id": new_tid,
437
+ "old_parent": old_parent or "",
438
+ "new_parent": new_parent or "",
439
+ "seqid": tx.seqid,
440
+ "start": str(tx.start),
441
+ "end": str(tx.end),
442
+ "strand": tx.strand,
443
+ })
444
+ update_id_attr(tx, new_tid, args.keep_old_ids)
445
+ tx.attrs["Name"] = [new_tid]
446
+ if "transcript_id" in tx.attrs:
447
+ tx.attrs["transcript_id"] = [new_tid]
448
+ if new_parent:
449
+ update_parent_attr(tx, [new_parent], args.keep_old_ids)
450
+ if "gene_id" in tx.attrs:
451
+ tx.attrs["gene_id"] = [new_parent]
452
+ if "geneID" in tx.attrs:
453
+ tx.attrs["geneID"] = [new_parent]
454
+
455
+ # Rename child features whose Parent is a transcript
456
+ child_count_by_tx_type = Counter()
457
+ for f in sorted(features, key=feature_sort_key):
458
+ if f.ftype.lower() == "gene" or is_transcript_type(f.ftype):
459
+ continue
460
+ old_parents = f.parents()
461
+ new_parents = [old_tx_to_new[p] for p in old_parents if p in old_tx_to_new]
462
+ if new_parents:
463
+ update_parent_attr(f, new_parents, args.keep_old_ids)
464
+
465
+ # Most exon/CDS lines in your merged GFF have no ID; this creates stable IDs.
466
+ # If one child has multiple parents, use the first parent for its own ID.
467
+ p0 = new_parents[0]
468
+ key = (p0, f.ftype)
469
+ child_count_by_tx_type[key] += 1
470
+ safe_type = re.sub(r'[^A-Za-z0-9_]+', '_', f.ftype)
471
+ new_child_id = f"{p0}.{safe_type}{child_count_by_tx_type[key]}"
472
+ id_map_rows.append({
473
+ "feature_type": f.ftype,
474
+ "old_id": f.id() or "",
475
+ "new_id": new_child_id,
476
+ "old_parent": ",".join(old_parents),
477
+ "new_parent": ",".join(new_parents),
478
+ "seqid": f.seqid,
479
+ "start": str(f.start),
480
+ "end": str(f.end),
481
+ "strand": f.strand,
482
+ })
483
+ update_id_attr(f, new_child_id, args.keep_old_ids)
484
+
485
+ if "transcript_id" in f.attrs:
486
+ f.attrs["transcript_id"] = [p0]
487
+ # infer gene_id from transcript ID before .tN
488
+ gene_new = re.sub(r'\.t[0-9]+$', '', p0)
489
+ if "gene_id" in f.attrs:
490
+ f.attrs["gene_id"] = [gene_new]
491
+ if "geneID" in f.attrs:
492
+ f.attrs["geneID"] = [gene_new]
493
+
494
+ # Build output order: gene -> transcripts -> transcript children
495
+ genes_by_new_id = {}
496
+ txs_by_new_gene = defaultdict(list)
497
+ children_by_new_tx = defaultdict(list)
498
+ used_in_hierarchy = set()
499
+
500
+ for f in features:
501
+ if f.ftype.lower() == "gene" and f.id():
502
+ genes_by_new_id[f.id()] = f
503
+
504
+ for f in features:
505
+ if is_transcript_type(f.ftype) and f.parents():
506
+ txs_by_new_gene[f.parents()[0]].append(f)
507
+
508
+ for f in features:
509
+ if f.ftype.lower() == "gene" or is_transcript_type(f.ftype):
510
+ continue
511
+ for p in f.parents():
512
+ children_by_new_tx[p].append(f)
513
+
514
+ output_features = []
515
+ for g in sorted(genes, key=lambda f: (chr_key(f.seqid), f.start, f.end, f.id() or "", f.line_no)):
516
+ output_features.append(g)
517
+ used_in_hierarchy.add(id(g))
518
+ txs = sorted(txs_by_new_gene.get(g.id(), []), key=lambda f: (f.start, f.end, f.line_no))
519
+ for tx in txs:
520
+ output_features.append(tx)
521
+ used_in_hierarchy.add(id(tx))
522
+ kids = sorted(children_by_new_tx.get(tx.id(), []), key=feature_sort_key)
523
+ for k in kids:
524
+ output_features.append(k)
525
+ used_in_hierarchy.add(id(k))
526
+
527
+ # Keep any orphan/unrecognized features instead of silently dropping them
528
+ orphans = [f for f in features if id(f) not in used_in_hierarchy]
529
+ if orphans:
530
+ print(f"[WARN] {len(orphans)} orphan/unhierarchical features kept at end", file=sys.stderr)
531
+ output_features.extend(sorted(orphans, key=feature_sort_key))
532
+
533
+ with open_text(args.output, "wt") as out:
534
+ if not any(h.startswith("##gff-version") for h in headers):
535
+ print("##gff-version 3", file=out)
536
+ for h in headers:
537
+ print(h, file=out)
538
+ for f in output_features:
539
+ print(f.to_line(), file=out)
540
+
541
+ id_map_path = args.output + ".id_map.tsv"
542
+ write_id_map(id_map_path, id_map_rows)
543
+
544
+ removed_seqids = {
545
+ k.split(":", 1)[1]: v
546
+ for k, v in filter_stats.items()
547
+ if k.startswith("removed_seqid:")
548
+ }
549
+ top_removed_seqids = dict(sorted(
550
+ removed_seqids.items(),
551
+ key=lambda kv: (-kv[1], chr_key(kv[0]), kv[0])
552
+ )[:20])
553
+
554
+ print("[VERSION] rename_sort_gff3_v3_chronly.py 2026-06-18-chronly", file=sys.stderr)
555
+ print("[DONE] renamed and sorted GFF3 written to:", args.output, file=sys.stderr)
556
+ print("[DONE] ID map written to:", id_map_path, file=sys.stderr)
557
+ print("[INFO] genes kept:", len(genes), file=sys.stderr)
558
+ print("[INFO] genes removed from unplaced seqids:", filter_stats["genes_removed"], file=sys.stderr)
559
+ print("[INFO] features kept:", filter_stats["features_kept"], file=sys.stderr)
560
+ print("[INFO] features removed from unplaced seqids:", filter_stats["features_removed"], file=sys.stderr)
561
+ if top_removed_seqids:
562
+ print("[INFO] top removed seqids:", top_removed_seqids, file=sys.stderr)
563
+ print("[INFO] transcripts:", len(old_tx_to_new), file=sys.stderr)
564
+ print("[INFO] chromosome gene counts:", dict(gene_counter_by_chr), file=sys.stderr)
565
+
566
+
567
+ if __name__ == "__main__":
568
+ main()
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.1
2
2
  Name: gffkit
3
- Version: 0.3
3
+ Version: 0.4.0
4
4
  Summary: Region-aware GFF annotation integration toolkit
5
5
  Author: Qunjie Zhang
6
6
  License: MIT
@@ -26,11 +26,12 @@ License-File: LICENSE
26
26
  # gffkit
27
27
 
28
28
  `gffkit` is a lightweight toolkit for region-aware GFF/GTF annotation integration.
29
- It combines three utilities:
29
+ It combines four utilities:
30
30
 
31
31
  1. `detect-bridge`: detect suspicious merged-gene artifacts caused by bridge transcripts.
32
32
  2. `complement`: complement/merge annotations, with optional region-swap mode.
33
33
  3. `add-utr`: reconstruct `five_prime_UTR` and `three_prime_UTR` features from exon/CDS coordinates.
34
+ 4. `rename-sort`: rename gene/transcript/child IDs with a prefix and sort the final GFF3.
34
35
 
35
36
  ## Installation
36
37
 
@@ -48,20 +49,25 @@ gffkit integrate \
48
49
  --annotation-a EviAnn.gff3 \
49
50
  --annotation-b ANNEVO.gff3 \
50
51
  --outdir gffkit_out \
51
- --prefix sample
52
+ --prefix sample \
53
+ -t 8
52
54
  ```
53
55
 
54
56
  Outputs:
55
57
 
56
58
  - `gffkit_out/sample.suspicious.tsv`
57
59
  - `gffkit_out/sample.merged.gff3`
60
+ - `gffkit_out/sample.final.withUTR.gff3.pre_rename.gff3`
58
61
  - `gffkit_out/sample.final.withUTR.gff3`
62
+ - `gffkit_out/sample.final.withUTR.gff3.id_map.tsv`
63
+
64
+ In `integrate`, `--prefix sample` is also used for final ID renaming. Final gene IDs are written like `sample_C01g00001`, transcript IDs like `sample_C01g00001.t1`, and child IDs like `sample_C01g00001.t1.exon1`.
59
65
 
60
66
  ### Step-by-step usage
61
67
 
62
68
  ```bash
63
69
  # 1. Detect suspicious merged genes in Annotation A
64
- gffkit detect-bridge -i EviAnn.gff3 -o suspicious.tsv
70
+ gffkit detect-bridge -i EviAnn.gff3 -o suspicious.tsv -t 8
65
71
 
66
72
  # 2. Use A as the global reference, but switch to B in suspicious regions
67
73
  gffkit complement \
@@ -69,10 +75,31 @@ gffkit complement \
69
75
  --add ANNEVO.gff3 \
70
76
  --swap_region_tsv suspicious.tsv \
71
77
  --swap_region_flank 100 \
72
- --output merged.gff3
78
+ --output merged.gff3 \
79
+ -t 8
73
80
 
74
81
  # 3. Add UTR features
75
- gffkit add-utr -i merged.gff3 -o final.annotation.withUTR.gff3
82
+ gffkit add-utr -i merged.gff3 -o final.annotation.withUTR.pre_rename.gff3
83
+
84
+ # 4. Rename IDs, drop unplaced seqids, and sort the final GFF3
85
+ gffkit rename-sort \
86
+ -i final.annotation.withUTR.pre_rename.gff3 \
87
+ -o final.annotation.withUTR.gff3 \
88
+ --prefix sample
89
+ ```
90
+
91
+ ### Merge three or more annotations
92
+
93
+ Use repeated `--add` arguments. Files are merged in the order provided.
94
+
95
+ ```bash
96
+ gffkit complement \
97
+ --ref EviAnn.gff3 \
98
+ --add ANNEVO.gff3 \
99
+ --add Helixer.gff3 \
100
+ --add PASA.gff3 \
101
+ --output merged.multi.gff3 \
102
+ -t 8
76
103
  ```
77
104
 
78
105
  ## Command overview
@@ -82,14 +109,50 @@ gffkit --help
82
109
  gffkit detect-bridge --help
83
110
  gffkit complement --help
84
111
  gffkit add-utr --help
112
+ gffkit rename-sort --help
85
113
  gffkit integrate --help
86
114
  ```
87
115
 
116
+ ## Threads
117
+
118
+ Version 0.3 and later add `-t/--threads`.
119
+
120
+ - `detect-bridge` analyzes genes in parallel.
121
+ - `complement` pre-parses multiple `--add` files in parallel, then merges them in the original command-line order.
122
+ - `integrate` passes the thread count to the detect and complement steps.
123
+
124
+ Example:
125
+
126
+ ```bash
127
+ gffkit integrate --annotation-a EviAnn.gff3 --annotation-b ANNEVO.gff3 -t 16
128
+ ```
129
+
88
130
  ## Annotation integration strategy
89
131
 
90
132
  - Annotation A, for example EviAnn/RNA-seq-supported GFF, is used as the global primary reference.
91
133
  - Annotation B, for example ANNEVO/deep-learning GFF, is used as the local primary reference only in suspicious merged-gene regions.
92
134
  - UTR features are reconstructed after merging using an exon-minus-CDS strategy.
135
+ - Version 0.4.0 and later run `rename-sort` as the final `integrate` step. The final GFF3 keeps chromosome-mounted records, removes unplaced/scaffold/contig records, sorts features, rewrites `ID`/`Parent`, and writes an ID map next to the output.
136
+ - When multiple tools annotate the same gene locus, the GFF source column is combined with `|`, for example `EviAnn|ANNEVO`.
137
+
138
+ ## Rename and Sort
139
+
140
+ Run this step independently when you already have a merged GFF3:
141
+
142
+ ```bash
143
+ gffkit rename-sort \
144
+ -i merged.withUTR.gff3 \
145
+ -o sample.renamed.sorted.gff3 \
146
+ --prefix sample \
147
+ --digits 5 \
148
+ --keep-old-ids
149
+ ```
150
+
151
+ This writes `sample.renamed.sorted.gff3` and `sample.renamed.sorted.gff3.id_map.tsv`.
152
+
153
+ ## Maintainer notes
154
+
155
+ When command-line options or behavior changes, update this `README.md` in the versioned package directory before building and uploading to PyPI.
93
156
 
94
157
  ## License
95
158
 
@@ -8,9 +8,11 @@ src/gffkit/add_utr.py
8
8
  src/gffkit/complement_annotations.py
9
9
  src/gffkit/detect_bridge_merged_genes.py
10
10
  src/gffkit/main.py
11
+ src/gffkit/rename_sort_gff3.py
11
12
  src/gffkit.egg-info/PKG-INFO
12
13
  src/gffkit.egg-info/SOURCES.txt
13
14
  src/gffkit.egg-info/dependency_links.txt
14
15
  src/gffkit.egg-info/entry_points.txt
15
16
  src/gffkit.egg-info/top_level.txt
16
- tests/test_complement_sources.py
17
+ tests/test_complement_sources.py
18
+ tests/test_rename_sort.py
@@ -3,3 +3,4 @@ gffkit = gffkit.main:main
3
3
  gffkit-add-utr = gffkit.add_utr:main
4
4
  gffkit-complement = gffkit.complement_annotations:main
5
5
  gffkit-detect-bridge = gffkit.detect_bridge_merged_genes:main
6
+ gffkit-rename-sort = gffkit.rename_sort_gff3:main
@@ -0,0 +1,47 @@
1
+ import sys
2
+
3
+ from gffkit import rename_sort_gff3
4
+
5
+
6
+ def test_rename_sort_uses_prefix_and_writes_id_map(tmp_path):
7
+ input_gff = tmp_path / "input.gff3"
8
+ output_gff = tmp_path / "sample.renamed.gff3"
9
+
10
+ input_gff.write_text(
11
+ "\n".join(
12
+ [
13
+ "##gff-version 3",
14
+ "chr1\tEviAnn|ANNEVO\tgene\t100\t300\t.\t+\t.\tID=old_gene",
15
+ "chr1\tEviAnn|ANNEVO\tmRNA\t100\t300\t.\t+\t.\tID=old_tx;Parent=old_gene",
16
+ "chr1\tEviAnn|ANNEVO\texon\t100\t300\t.\t+\t.\tParent=old_tx",
17
+ "scaffold01\tEviAnn\tgene\t1\t100\t.\t+\t.\tID=drop_me",
18
+ "",
19
+ ]
20
+ ),
21
+ encoding="utf-8",
22
+ )
23
+
24
+ old_argv = sys.argv[:]
25
+ sys.argv = [
26
+ "gffkit rename-sort",
27
+ "-i",
28
+ str(input_gff),
29
+ "-o",
30
+ str(output_gff),
31
+ "--prefix",
32
+ "sample",
33
+ ]
34
+ try:
35
+ rename_sort_gff3.main()
36
+ finally:
37
+ sys.argv = old_argv
38
+
39
+ text = output_gff.read_text(encoding="utf-8")
40
+ id_map = output_gff.with_name(output_gff.name + ".id_map.tsv")
41
+
42
+ assert "ID=sample_C01g00001" in text
43
+ assert "ID=sample_C01g00001.t1;Parent=sample_C01g00001" in text
44
+ assert "ID=sample_C01g00001.t1.exon1;Parent=sample_C01g00001.t1" in text
45
+ assert "drop_me" not in text
46
+ assert id_map.exists()
47
+ assert "old_gene\tsample_C01g00001" in id_map.read_text(encoding="utf-8")
gffkit-0.3/README.md DELETED
@@ -1,71 +0,0 @@
1
- # gffkit
2
-
3
- `gffkit` is a lightweight toolkit for region-aware GFF/GTF annotation integration.
4
- It combines three utilities:
5
-
6
- 1. `detect-bridge`: detect suspicious merged-gene artifacts caused by bridge transcripts.
7
- 2. `complement`: complement/merge annotations, with optional region-swap mode.
8
- 3. `add-utr`: reconstruct `five_prime_UTR` and `three_prime_UTR` features from exon/CDS coordinates.
9
-
10
- ## Installation
11
-
12
- ```bash
13
- pip install gffkit
14
- ```
15
-
16
-
17
- ## Quick start
18
-
19
- ### Full integration pipeline
20
-
21
- ```bash
22
- gffkit integrate \
23
- --annotation-a EviAnn.gff3 \
24
- --annotation-b ANNEVO.gff3 \
25
- --outdir gffkit_out \
26
- --prefix sample
27
- ```
28
-
29
- Outputs:
30
-
31
- - `gffkit_out/sample.suspicious.tsv`
32
- - `gffkit_out/sample.merged.gff3`
33
- - `gffkit_out/sample.final.withUTR.gff3`
34
-
35
- ### Step-by-step usage
36
-
37
- ```bash
38
- # 1. Detect suspicious merged genes in Annotation A
39
- gffkit detect-bridge -i EviAnn.gff3 -o suspicious.tsv
40
-
41
- # 2. Use A as the global reference, but switch to B in suspicious regions
42
- gffkit complement \
43
- --ref EviAnn.gff3 \
44
- --add ANNEVO.gff3 \
45
- --swap_region_tsv suspicious.tsv \
46
- --swap_region_flank 100 \
47
- --output merged.gff3
48
-
49
- # 3. Add UTR features
50
- gffkit add-utr -i merged.gff3 -o final.annotation.withUTR.gff3
51
- ```
52
-
53
- ## Command overview
54
-
55
- ```bash
56
- gffkit --help
57
- gffkit detect-bridge --help
58
- gffkit complement --help
59
- gffkit add-utr --help
60
- gffkit integrate --help
61
- ```
62
-
63
- ## Annotation integration strategy
64
-
65
- - Annotation A, for example EviAnn/RNA-seq-supported GFF, is used as the global primary reference.
66
- - Annotation B, for example ANNEVO/deep-learning GFF, is used as the local primary reference only in suspicious merged-gene regions.
67
- - UTR features are reconstructed after merging using an exon-minus-CDS strategy.
68
-
69
- ## License
70
-
71
- MIT License.
File without changes
File without changes
File without changes
File without changes
File without changes