PyPI - deeptaxa-rrna - Versions diffs - 1.0.0__tar.gz - Mend

deeptaxa-rrna 1.0.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (50) hide show

deeptaxa_rrna-1.0.0/.gitattributes +1 -0
deeptaxa_rrna-1.0.0/.gitignore +182 -0
deeptaxa_rrna-1.0.0/LICENSE +21 -0
deeptaxa_rrna-1.0.0/PKG-INFO +365 -0
deeptaxa_rrna-1.0.0/README.md +324 -0
deeptaxa_rrna-1.0.0/conda-recipe/README.md +65 -0
deeptaxa_rrna-1.0.0/conda-recipe/meta.yaml +66 -0
deeptaxa_rrna-1.0.0/deeptaxa/__init__.py +35 -0
deeptaxa_rrna-1.0.0/deeptaxa/cli.py +491 -0
deeptaxa_rrna-1.0.0/deeptaxa/config.py +64 -0
deeptaxa_rrna-1.0.0/deeptaxa/dataset.py +320 -0
deeptaxa_rrna-1.0.0/deeptaxa/describe.py +212 -0
deeptaxa_rrna-1.0.0/deeptaxa/models/__init__.py +21 -0
deeptaxa_rrna-1.0.0/deeptaxa/models/bert.py +109 -0
deeptaxa_rrna-1.0.0/deeptaxa/models/cnn.py +170 -0
deeptaxa_rrna-1.0.0/deeptaxa/models/hybrid.py +166 -0
deeptaxa_rrna-1.0.0/deeptaxa/models/losses.py +69 -0
deeptaxa_rrna-1.0.0/deeptaxa/predict.py +669 -0
deeptaxa_rrna-1.0.0/deeptaxa/train.py +1133 -0
deeptaxa_rrna-1.0.0/deeptaxa/tune.py +247 -0
deeptaxa_rrna-1.0.0/deeptaxa/utils.py +85 -0
deeptaxa_rrna-1.0.0/deeptaxa_rrna.egg-info/PKG-INFO +365 -0
deeptaxa_rrna-1.0.0/deeptaxa_rrna.egg-info/SOURCES.txt +48 -0
deeptaxa_rrna-1.0.0/deeptaxa_rrna.egg-info/dependency_links.txt +1 -0
deeptaxa_rrna-1.0.0/deeptaxa_rrna.egg-info/entry_points.txt +2 -0
deeptaxa_rrna-1.0.0/deeptaxa_rrna.egg-info/requires.txt +9 -0
deeptaxa_rrna-1.0.0/deeptaxa_rrna.egg-info/top_level.txt +1 -0
deeptaxa_rrna-1.0.0/pyproject.toml +41 -0
deeptaxa_rrna-1.0.0/scripts/calibration_diagnosis.sh +60 -0
deeptaxa_rrna-1.0.0/scripts/calibration_sweep.sh +62 -0
deeptaxa_rrna-1.0.0/scripts/deeptaxa_workflow.sh +194 -0
deeptaxa_rrna-1.0.0/scripts/run_ablation.sh +120 -0
deeptaxa_rrna-1.0.0/scripts/run_amplicon_eval.sh +96 -0
deeptaxa_rrna-1.0.0/scripts/run_experiment.sh +641 -0
deeptaxa_rrna-1.0.0/scripts/run_similarity_eval.sh +113 -0
deeptaxa_rrna-1.0.0/scripts/sequence_similarity.py +218 -0
deeptaxa_rrna-1.0.0/scripts/similarity_curve.py +313 -0
deeptaxa_rrna-1.0.0/scripts/simulate_amplicons.py +267 -0
deeptaxa_rrna-1.0.0/setup.cfg +4 -0
deeptaxa_rrna-1.0.0/tutorials/.gitignore +6 -0
deeptaxa_rrna-1.0.0/tutorials/Makefile +19 -0
deeptaxa_rrna-1.0.0/tutorials/_quarto.yml +47 -0
deeptaxa_rrna-1.0.0/tutorials/analysis.qmd +930 -0
deeptaxa_rrna-1.0.0/tutorials/architecture.qmd +434 -0
deeptaxa_rrna-1.0.0/tutorials/custom.css +17 -0
deeptaxa_rrna-1.0.0/tutorials/index.qmd +37 -0
deeptaxa_rrna-1.0.0/tutorials/prediction.qmd +376 -0
deeptaxa_rrna-1.0.0/tutorials/references.bib +66 -0
deeptaxa_rrna-1.0.0/tutorials/render_tutorials.sh +98 -0
deeptaxa_rrna-1.0.0/tutorials/training.qmd +411 -0

deeptaxa_rrna-1.0.0/.gitattributes ADDED Viewed

	@@ -0,0 +1 @@
1	+ *.pt filter=lfs diff=lfs merge=lfs -text

deeptaxa_rrna-1.0.0/.gitignore ADDED Viewed

@@ -0,0 +1,182 @@
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+# C extensions
+*.so
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+cover/
+# Translations
+*.mo
+*.pot
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+# Flask stuff:
+instance/
+.webassets-cache
+# Scrapy stuff:
+.scrapy
+# Sphinx documentation
+docs/_build/
+# PyBuilder
+.pybuilder/
+target/
+# Jupyter Notebook
+.ipynb_checkpoints
+# IPython
+profile_default/
+ipython_config.py
+# pyenv
+#   For a library or package, you might want to ignore these files since the code is
+#   intended to run in multiple environments; otherwise, check them in:
+# .python-version
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+# UV
+#   Similar to Pipfile.lock, it is generally recommended to include uv.lock in version control.
+#   This is especially recommended for binary packages to ensure reproducibility, and is more
+#   commonly ignored for libraries.
+#uv.lock
+# poetry
+#   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
+#   This is especially recommended for binary packages to ensure reproducibility, and is more
+#   commonly ignored for libraries.
+#   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
+#poetry.lock
+# pdm
+#   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
+#pdm.lock
+#   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
+#   in version control.
+#   https://pdm.fming.dev/latest/usage/project/#working-with-version-control
+.pdm.toml
+.pdm-python
+.pdm-build/
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
+__pypackages__/
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+# SageMath parsed files
+*.sage.py
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+# Spyder project settings
+.spyderproject
+.spyproject
+# Rope project settings
+.ropeproject
+# mkdocs documentation
+/site
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+# Pyre type checker
+.pyre/
+# pytype static type analyzer
+.pytype/
+# Cython debug symbols
+cython_debug/
+# PyCharm
+#  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
+#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
+#  and can be added to the global gitignore or merged into this file.  For a more nuclear
+#  option (not recommended) you can uncomment the following to ignore the entire idea folder.
+#.idea/
+# Ruff stuff:
+.ruff_cache/
+# PyPI configuration file
+.pypirc
+# DeepTaxa
+*.pt
+*.pth
+*.pkl
+.cache/huggingface/
+quarto-*.deb
+quarto-*.deb

deeptaxa_rrna-1.0.0/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2025 Systems Genomics Lab
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

deeptaxa_rrna-1.0.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,365 @@
+Metadata-Version: 2.4
+Name: deeptaxa-rrna
+Version: 1.0.0
+Summary: A deep learning framework for hierarchical taxonomy classification of 16S rRNA gene sequences.
+Author-email: Khlood Ramadan <khlood.ramadan@aucegypt.edu>, Lobna Ghonaim <lobnaghonaim@aucegypt.edu>, Rana Salah <rana_salah@aucegypt.edu>, Ahmed Moustafa <amoustafa@aucegypt.edu>
+License: MIT License
+        Copyright (c) 2025 Systems Genomics Lab
+        Permission is hereby granted, free of charge, to any person obtaining a copy
+        of this software and associated documentation files (the "Software"), to deal
+        in the Software without restriction, including without limitation the rights
+        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+        copies of the Software, and to permit persons to whom the Software is
+        furnished to do so, subject to the following conditions:
+        The above copyright notice and this permission notice shall be included in all
+        copies or substantial portions of the Software.
+        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+        SOFTWARE.
+Requires-Python: >=3.10
+Description-Content-Type: text/markdown
+License-File: LICENSE
+Requires-Dist: torch
+Requires-Dist: numpy
+Requires-Dist: transformers
+Requires-Dist: pandas
+Requires-Dist: tqdm
+Requires-Dist: scikit-learn
+Requires-Dist: biopython
+Requires-Dist: h5py
+Requires-Dist: optuna
+Dynamic: license-file
+# DeepTaxa
+[![License](https://img.shields.io/github/license/systems-genomics-lab/deeptaxa)](LICENSE)
+[![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue)](https://huggingface.co/systems-genomics-lab/deeptaxa)
+[![Tutorials](https://img.shields.io/badge/Tutorials-GitHub%20Pages-green)](https://systems-genomics-lab.github.io/deeptaxa/)
+[![Last Commit](https://img.shields.io/github/last-commit/systems-genomics-lab/deeptaxa)](https://github.com/systems-genomics-lab/deeptaxa/commits/main)
+[![Issues](https://img.shields.io/github/issues/systems-genomics-lab/deeptaxa)](https://github.com/systems-genomics-lab/deeptaxa/issues)
+[![GitHub Stars](https://img.shields.io/github/stars/systems-genomics-lab/deeptaxa?style=social)](https://github.com/systems-genomics-lab/deeptaxa/stargazers)
+**DeepTaxa** is a deep learning framework for hierarchical taxonomic classification of 16S rRNA gene sequences. It classifies sequences into all seven taxonomic ranks (Domain through Species) in a single forward pass, achieving 92.96% species-level accuracy (3-seed mean) on the Greengenes2 2024.09 test set.
+---
+## Table of Contents
+1. [Performance](#performance)
+2. [Installation](#installation)
+3. [Quick Start](#quick-start)
+4. [Data and Pre-Trained Models](#data-and-pre-trained-models)
+5. [Training](#training)
+6. [Experimentation](#experimentation)
+7. [Scripts](#scripts)
+8. [Tutorials](#tutorials)
+9. [License](#license)
+10. [Citation](#citation)
+11. [Contact](#contact)
+12. [Acknowledgements](#acknowledgements)
+---
+## Performance
+The published HybridCNNBERT checkpoint achieves the following on 69,335 held-out test sequences from Greengenes2 2024.09 (3-seed mean across seeds 42, 123, 456):
+| Rank | Accuracy | F1 | ECE |
+|------|----------|-----|-----|
+| Domain | 99.98% | 99.98% | 0.0001 |
+| Phylum | 99.69% | 99.68% | 0.0023 |
+| Class | 99.63% | 99.59% | 0.0024 |
+| Order | 99.07% | 98.97% | 0.0056 |
+| Family | 98.61% | 98.41% | 0.0075 |
+| Genus | 96.90% | 96.48% | 0.0144 |
+| Species | 92.96% | 92.12% | 0.0242 |
+Cross-seed standard deviation is at most 0.0008 F1 at every rank (species std 0.0008 F1 / 0.07 percentage points accuracy), demonstrating high reproducibility.
+### Architecture
+| Component | Configuration |
+|-----------|--------------|
+| CNN | embed_dim=896, 256 filters, kernels [3, 5, 7], 1 conv layer |
+| BERT | 4 layers, 7 heads, hidden=896, FFN=3584, GELU, random init |
+| Fusion | Learnable alpha/beta weights + BERT residual connection |
+| Training | Cross-entropy loss, LR=5e-4, batch=64, dropout=0.20, 10 epochs |
+Three architectures are available:
+- **HybridCNNBERTClassifier** (default): Fuses CNN local motif features with BERT global context. Used for the published checkpoints.
+- **CNNClassifier**: Multi-kernel convolutional network only. Faster training, slightly lower species accuracy.
+- **BERTClassifier**: Transformer encoder only. On its own, a from-scratch transformer underperforms substantially at the species rank; provided mainly for ablation.
+### Pre-Trained Checkpoints
+Two checkpoints are hosted on [Hugging Face](https://huggingface.co/systems-genomics-lab/deeptaxa):
+| Checkpoint | Training data | Species accuracy | Parameters |
+|-----------|--------------|-----------------|------------|
+| `deeptaxa-full-length-v1.pt` | Full-length 16S (277,336 sequences, ~1,500 bp) | 92.96% (3-seed mean) | 76.4 M |
+| `deeptaxa-v3v4-v1.pt` | In-silico V3-V4 amplicons (~420 bp, 273,003 amplicons) | 87.55% (seed 42) | 75.8 M |
+Both checkpoints share the same compact architecture (the small parameter difference reflects smaller per-rank classifier heads on the V3-V4 model, which has a smaller species vocabulary: 8,347 vs 16,909). A `config.json` with full model metadata is also available.
+---
+## Installation
+DeepTaxa requires Python 3.10 or later. We recommend using a Conda environment:
+```bash
+git clone https://github.com/systems-genomics-lab/deeptaxa.git
+cd deeptaxa
+conda create --name deeptaxa_env python=3.10 -y
+conda activate deeptaxa_env
+pip install .
+deeptaxa --version
+```
+Dependencies (torch, transformers, pandas, numpy, scikit-learn, h5py, etc.) are specified in [`pyproject.toml`](pyproject.toml) and installed automatically.
+> **Note**: For GPU support, install a CUDA-compatible PyTorch build before running `pip install .`. See the [PyTorch installation guide](https://pytorch.org/get-started/locally/).
+---
+## Quick Start
+**Predict** with the pre-trained model (no training data needed):
+```bash
+# Download the checkpoint
+mkdir -p ../deeptaxa-data/models
+wget -P ../deeptaxa-data/models \
+  https://huggingface.co/systems-genomics-lab/deeptaxa/resolve/main/deeptaxa-full-length-v1.pt
+# Classify sequences
+deeptaxa predict \
+  --fasta-file your_sequences.fna \
+  --checkpoint ../deeptaxa-data/models/deeptaxa-full-length-v1.pt \
+  --output-dir ../deeptaxa-outputs/predictions
+```
+**Evaluate** against known labels (adds per-rank accuracy, F1, ECE to the output):
+```bash
+deeptaxa predict \
+  --fasta-file ../deeptaxa-data/greengenes/gg_2024_09_testing.fna.gz \
+  --taxonomy-file ../deeptaxa-data/greengenes/gg_2024_09_testing.tsv.gz \
+  --checkpoint ../deeptaxa-data/models/deeptaxa-full-length-v1.pt \
+  --output-dir ../deeptaxa-outputs/evaluation
+```
+**Inspect** a checkpoint:
+```bash
+deeptaxa describe \
+  --checkpoint ../deeptaxa-data/models/deeptaxa-full-length-v1.pt
+```
+> **Tip**: Run `deeptaxa train --help` or `deeptaxa predict --help` for a full list of options.
+---
+## Data and Pre-Trained Models
+Datasets and checkpoints are hosted on [Hugging Face](https://huggingface.co/systems-genomics-lab/deeptaxa). Store them in a sibling directory outside the codebase:
+```
+working_directory/
+├── deeptaxa/              # This repository
+├── deeptaxa-data/         # Datasets and checkpoints
+│   ├── greengenes/
+│   │   ├── gg_2024_09_training.fna.gz    (277,336 sequences, ~96 MB)
+│   │   ├── gg_2024_09_training.tsv.gz    (taxonomy labels, ~2.6 MB)
+│   │   ├── gg_2024_09_testing.fna.gz     (69,335 sequences, ~24 MB)
+│   │   └── gg_2024_09_testing.tsv.gz     (taxonomy labels, ~0.8 MB)
+│   └── models/
+│       ├── deeptaxa-full-length-v1.pt
+│       └── deeptaxa-v3v4-v1.pt
+└── deeptaxa-outputs/      # Training and prediction outputs
+```
+DeepTaxa uses the [Greengenes2](https://greengenes2.ucsd.edu/) database (2024.09 release), reformatted and hosted on [Hugging Face](https://huggingface.co/datasets/systems-genomics-lab/greengenes).
+### Download
+```bash
+# Dataset
+mkdir -p deeptaxa-data/greengenes && cd deeptaxa-data/greengenes
+for f in gg_2024_09_training.fna.gz gg_2024_09_training.tsv.gz \
+         gg_2024_09_testing.fna.gz gg_2024_09_testing.tsv.gz; do
+  wget https://huggingface.co/datasets/systems-genomics-lab/greengenes/resolve/main/$f
+done
+# Checkpoints
+mkdir -p ../models && cd ../models
+wget https://huggingface.co/systems-genomics-lab/deeptaxa/resolve/main/deeptaxa-full-length-v1.pt
+wget https://huggingface.co/systems-genomics-lab/deeptaxa/resolve/main/deeptaxa-v3v4-v1.pt
+wget https://huggingface.co/systems-genomics-lab/deeptaxa/resolve/main/config.json
+```
+> **Note**: Checkpoint files use PyTorch's `pickle`-based serialization. Download them only from the official Hugging Face repository.
+---
+## Training
+All architecture hyperparameters default to the published (compact) configuration, so a minimal training command reproduces the published checkpoint:
+```bash
+deeptaxa train \
+  --fasta-file ../deeptaxa-data/greengenes/gg_2024_09_training.fna.gz \
+  --taxonomy-file ../deeptaxa-data/greengenes/gg_2024_09_training.tsv.gz \
+  --model-type hybridcnnbert \
+  --output-dir ../deeptaxa-outputs/
+```
+Training takes approximately 1 h 20 m on an NVIDIA RTX 4090 (or 2 h 35 m on an NVIDIA A40) for 10 epochs.
+### Output
+Each training run produces:
+- `checkpoints/deeptaxa_<uuid>_epoch<N>.pt`: Model weights, optimizer state, scheduler state, and label encoders for each epoch.
+- `metrics/deeptaxa_<uuid>_epoch<N>.json`: Per-epoch validation loss, accuracy, F1, precision, and recall at each rank.
+- `deeptaxa_uuid.txt`: The unique run identifier.
+### Early Stopping
+To stop training when validation loss plateaus:
+```bash
+deeptaxa train \
+  --fasta-file ../deeptaxa-data/greengenes/gg_2024_09_training.fna.gz \
+  --taxonomy-file ../deeptaxa-data/greengenes/gg_2024_09_training.tsv.gz \
+  --model-type hybridcnnbert \
+  --epochs 20 \
+  --early-stopping-patience 3 \
+  --output-dir ../deeptaxa-outputs/
+```
+Setting `--early-stopping-patience 0` (the default) disables early stopping.
+---
+## Experimentation
+The default configuration uses DNABERT-2 tokenization, cross-entropy loss, and uniform rank weighting. Each choice can be varied independently for ablation studies.
+### Encoding comparison
+```bash
+# Default: DNABERT-2 BPE tokenization
+deeptaxa train --model-type cnn --encoding dnabert ...
+# Ablation: one-hot nucleotide encoding (4-channel, no pretrained tokenizer)
+deeptaxa train --model-type cnn --encoding onehot ...
+```
+### Loss function comparison
+```bash
+# Default: cross-entropy
+deeptaxa train --model-type hybridcnnbert --loss-type cross_entropy ...
+# Ablation: focal loss (gamma=2.0)
+deeptaxa train --model-type hybridcnnbert --loss-type focal --focal-gamma 2.0 ...
+```
+### Architecture comparison
+Train CNN-only, BERT-only, or the hybrid under the same data and hyperparameters using `--model-type cnn`, `--model-type bert`, or `--model-type hybridcnnbert`.
+### Calibration
+When `--taxonomy-file` is provided at prediction time, DeepTaxa computes Expected Calibration Error (ECE) alongside accuracy, F1, precision, recall, and AUC. ECE measures the gap between predicted confidence and observed accuracy across 10 equal-width bins. All metrics are saved to `metrics.json`.
+---
+## Scripts
+The `scripts/` directory contains reusable tools for common workflows:
+| Script | Purpose |
+|--------|---------|
+| `deeptaxa_workflow.sh` | End-to-end workflow: train, resume, describe, predict |
+| `run_experiment.sh` | Central experiment runner with logging and timing |
+| `run_ablation.sh` | Ablation study: architecture, encoding, and loss variants |
+| `run_amplicon_eval.sh` | Simulated amplicon evaluation (V3-V4, V4) |
+| `run_similarity_eval.sh` | Similarity-stratified evaluation using vsearch |
+| `calibration_diagnosis.sh` | A/B comparison of temperature configurations |
+| `calibration_sweep.sh` | Multi-configuration temperature sweep |
+| `simulate_amplicons.py` | Extract amplicon regions via in-silico PCR |
+| `sequence_similarity.py` | Compute train-test nearest-neighbor identity |
+---
+## Tutorials
+Interactive tutorials with executable code are published at [systems-genomics-lab.github.io/deeptaxa](https://systems-genomics-lab.github.io/deeptaxa/):
+- [Prediction](https://systems-genomics-lab.github.io/deeptaxa/prediction.html): Classify sequences with the pre-trained model
+- [Training](https://systems-genomics-lab.github.io/deeptaxa/training.html): Train from scratch on Greengenes2
+- [Analysis](https://systems-genomics-lab.github.io/deeptaxa/analysis.html): Evaluate performance, calibration, and error patterns
+- [Architecture](https://systems-genomics-lab.github.io/deeptaxa/architecture.html): Model internals and extensibility
+---
+## License
+- **Code and models**: [MIT License](LICENSE)
+- **Greengenes dataset**: [Modified BSD License](https://huggingface.co/datasets/systems-genomics-lab/greengenes)
+---
+## Citation
+If DeepTaxa contributes to your research, please cite our paper in *Bioinformatics Advances*: [https://doi.org/10.1093/bioadv/vbag166](https://doi.org/10.1093/bioadv/vbag166)
+```bibtex
+@article{salah2026deeptaxa,
+  title={{DeepTaxa}: A Hybrid {CNN}-{BERT} Framework for {16S} {rRNA} Taxonomic Classification},
+  author={Salah, Rana and AbdElaal, Khlood R. and Ghonaim, Lobna and Awe, Olaitan I. and Moustafa, Ahmed},
+  journal={Bioinformatics Advances},
+  year={2026},
+  doi={10.1093/bioadv/vbag166},
+  publisher={Oxford University Press}
+}
+```
+For the Greengenes dataset:
+```bibtex
+@article{mcdonald2024greengenes,
+  title={Greengenes2 unifies microbial data in a single reference tree},
+  author={McDonald, Daniel and Jiang, Yueyu and Balaban, Metin and others},
+  journal={Nature Biotechnology},
+  volume={42},
+  pages={715--718},
+  year={2024},
+  doi={10.1038/s41587-023-01845-1}
+}
+```
+---
+## Contact
+To report bugs, suggest features, or contribute code, open an issue on [GitHub](https://github.com/systems-genomics-lab/deeptaxa/issues).
+---
+## Acknowledgements
+- **[Ahmed A. El Hosseiny](https://github.com/ahmedelhosseiny)** and the High-Performance Computing Team of the [School of Sciences and Engineering](https://sse.aucegypt.edu/) at the [American University in Cairo](https://www.aucegypt.edu/) for GPU access that enabled this work.
+- **[Hugging Face](https://huggingface.co/)** for hosting datasets and models.