PyPI - RiboParser - Versions diffs - 0.2.2__tar.gz → 0.2.5__tar.gz - Mend

RiboParser 0.2.2tar.gz → 0.2.5tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (163) hide show

{riboparser-0.2.2 → riboparser-0.2.5}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: RiboParser
-Version: 0.2.2
+Version: 0.2.5
 Summary: A pipeline for ribosome profiling data analysis
 Author-email: Ren Shuchao <rensc0718@163.com>
 License-Expression: GPL-3.0-or-later
@@ -2433,8 +2433,276 @@ RIBO_AAA.png # Metaplot of AAA codon
 ```
-## 6. Other toolkits
-### 6.1 Data shuffling
+## 6. smORF identification
+### 6.1 Transcriptome-wide ORF scanning
+1. Explanation of `smorf_scanner`
+Use the entire transcriptome sequences to scan all potential ORFs from scratch, outputting both known and novel ORFs.
+```bash
+$ smorf_scanner -h
+usage: smorf_scanner
+[-h] -g GENOME -a ANNOTATION [-o OUT_PREFIX] [--orf-prefix ORF_PREFIX] [--start-codons START_CODONS]
+[--min-aa MIN_AA] [--max-aa MAX_AA] [--scan-strand {sense,antisense,both}] [--kozak-up KOZAK_UP]
+[--kozak-down KOZAK_DOWN] [-t THREADS] [--mark-overlap] [--remove-discarded] [--include-stop]
+Scan transcript-centric smORFs from genome FASTA and genePred annotation.
+options:
+  -h, --help            show this help message and exit
+  -g, --genome GENOME   Input genome FASTA file.
+  -a, --annotation ANNOTATION
+                        Input genePred annotation file.
+  -o, --out-prefix OUT_PREFIX
+                        Output prefix.
+  --orf-prefix ORF_PREFIX
+                        Prefix for ORF IDs.
+  --start-codons START_CODONS
+                        Comma-separated start codons, such as ATG,CTG,GTG,TTG.
+  --min-aa MIN_AA       Minimum ORF length in amino acids.
+  --max-aa MAX_AA       Maximum ORF length in amino acids.
+  --scan-strand {sense,antisense,both}
+                        Scan sense, antisense, or both strands.
+  --kozak-up KOZAK_UP   Number of upstream nucleotides for Kozak sequence.
+  --kozak-down KOZAK_DOWN
+                        Number of downstream nucleotides after start codon for Kozak sequence.
+  -t, --threads THREADS
+                        Number of worker processes for parallel ORF scanning.
+  --mark-overlap        Mark nested or overlapping ORFs.
+  --remove-discarded    Remove same-frame internal ORFs.
+  --include-stop        Keep stop codon symbol in peptide sequence.
+# example
+smorf_scanner \
+  --genome ../genome/GCF_mine_genomic.fna \
+  --annotation ../norm/mine.genepred \
+  --out-prefix mine \
+  --start-codons ATG \
+  --min-aa 8 \
+  --max-aa 10000 \
+  --scan-strand both \
+  --kozak-up 6 \
+  --kozak-down 6 \
+  --mark-overlap \
+  --threads 20
+```
+2. Use `smorf_filter` to filter scanned ORFs
+The transcriptome-wide scan results are based only on open reading frame positions, so there are many false positives. Therefore, false positives need to be filtered.
+At the same time, some smORFs can be selected for study.
+```bash
+$ smorf_filter -h
+usage: smorf_filter
+[-h] -i INPUT [-o OUT_PREFIX] [--keep-start-codons KEEP_START_CODONS] [--min-aa MIN_AA] [--max-aa MAX_AA]
+[--keep-categories KEEP_CATEGORIES] [--remove-categories REMOVE_CATEGORIES] [--keep-antisense] [--keep-secondary]
+[--keep-partial] [--kozak-mode {none,annotated,builtin,pwm,sequence}]
+[--builtin-kozak {arabidopsis,drosophila,maize,plant,rice,terrestrial_plant,vertebrate,yeast}]
+[--kozak-pwm KOZAK_PWM] [--kozak-seq KOZAK_SEQ] [--annotated-categories ANNOTATED_CATEGORIES]
+[--min-annotated-kozak MIN_ANNOTATED_KOZAK]
+[--fallback-builtin-kozak {arabidopsis,drosophila,maize,plant,rice,terrestrial_plant,vertebrate,yeast}]
+[--no-kozak-fallback] [--min-kozak-pwm-score MIN_KOZAK_PWM_SCORE] [--export-kozak-pwm EXPORT_KOZAK_PWM]
+[--list-builtin-kozak]
+Filter ORF.message.txt using rule-based criteria and Kozak PWM scoring.
+options:
+  -h, --help            show this help message and exit
+  -i, --input INPUT     Input ORF.message.txt file.
+  -o, --out-prefix OUT_PREFIX
+                        Output prefix. (default: ORF.filtered).
+  --keep-start-codons KEEP_START_CODONS
+                        Comma-separated start codons to keep. (default: ATG,CTG,GTG,TTG).
+  --min-aa MIN_AA       Minimum ORF peptide length. (default: 8).
+  --max-aa MAX_AA       Maximum ORF peptide length. (default: 10000).
+  --keep-categories KEEP_CATEGORIES
+                        Comma-separated ORF categories to keep. (default:
+                        uORF,dORF,lncORF,iORF,emORF,overlap_uORF,overlap_dORF,other_ORF,annotated_ORF).
+  --remove-categories REMOVE_CATEGORIES
+                        Comma-separated ORF categories to remove. (default: same_frame_iORF,antisense_ORF).
+  --keep-antisense      Keep antisense ORFs. (default: False).
+  --keep-secondary      Keep secondary ORFs. (default: False).
+  --keep-partial        Keep incomplete ORFs. (default: False).
+  --kozak-mode {none,annotated,builtin,pwm,sequence}
+                        Kozak PWM mode. (default: annotated).
+  --builtin-kozak {arabidopsis,drosophila,maize,plant,rice,terrestrial_plant,vertebrate,yeast}
+                        Built-in Kozak PWM name. (default: plant).
+  --kozak-pwm KOZAK_PWM
+                        Custom Kozak PWM matrix file. (default: None).
+  --kozak-seq KOZAK_SEQ
+                        Aligned Kozak sequence file used to build PWM. (default: None).
+  --annotated-categories ANNOTATED_CATEGORIES
+                        Categories used to build annotated ORF Kozak PWM. (default: annotated_ORF).
+  --min-annotated-kozak MIN_ANNOTATED_KOZAK
+                        Minimum annotated ORF Kozak sequences required to build PWM. (default: 100).
+  --fallback-builtin-kozak {arabidopsis,drosophila,maize,plant,rice,terrestrial_plant,vertebrate,yeast}
+                        Fallback built-in Kozak PWM if annotated mode fails. (default: plant).
+  --no-kozak-fallback   Do not fallback to built-in PWM if annotated PWM construction fails. (default: False).
+  --min-kozak-pwm-score MIN_KOZAK_PWM_SCORE
+                        Minimum normalized Kozak PWM score. Range: 0-1. (default: 0.0).
+  --export-kozak-pwm EXPORT_KOZAK_PWM
+                        Export loaded or constructed Kozak PWM. (default: None).
+  --list-builtin-kozak  List built-in Kozak PWM models and exit.
+# example
+smorf_filter \
+ -i mine.message.txt \
+ -o mine.reliable \
+ --min-aa 8 \
+ --max-aa 10000 \
+ --kozak-mode annotated \
+ --keep-categories uORF,dORF,lncORF,overlap_uORF,overlap_dORF
+```
+3. Check smORF translation with `smorf_evidence`
+In general, stably expressed smORFs can be sensitively captured by Ribo-seq, so Ribo-seq data can be used to validate smORFs and identify reliable ones. Multiple datasets are supported to detect smORFs that are consistently present over different time points.
+```bash
+$ smorf_evidence -h
+usage: smorf_evidence
+[-h] -i ORF_TABLE -o OUTPUT [--genepred GENEPRED] [--chrom-sizes CHROM_SIZES] [--density-list DENSITY_LIST]
+[--density-plus DENSITY_PLUS] [--density-minus DENSITY_MINUS] [--density DENSITY] [--sample SAMPLE]
+[--density-format {auto,wig,bedgraph}] [--coord-mode {0based-half-open,1based-closed}]
+[--post-stop-codons POST_STOP_CODONS] [--pseudocount PSEUDOCOUNT] [--min-rpf-sum MIN_RPF_SUM]
+[--min-covered-codon MIN_COVERED_CODON] [--min-coverage-ratio MIN_COVERAGE_RATIO]
+[--strong-periodicity STRONG_PERIODICITY] [--moderate-periodicity MODERATE_PERIODICITY]
+[--strong-start-pause STRONG_START_PAUSE] [--moderate-start-pause MODERATE_START_PAUSE]
+[--strong-stop-pause STRONG_STOP_PAUSE] [--moderate-stop-pause MODERATE_STOP_PAUSE]
+[--strong-release STRONG_RELEASE] [--moderate-release MODERATE_RELEASE]
+[--uniform-coverage-ratio UNIFORM_COVERAGE_RATIO] [--uniform-gini UNIFORM_GINI]
+[--uniform-max-to-mean UNIFORM_MAX_TO_MEAN] [--skewed-max-to-mean SKEWED_MAX_TO_MEAN]
+[--skewed-top-fraction SKEWED_TOP_FRACTION] [--disperse-coverage-ratio DISPERSE_COVERAGE_RATIO]
+[--progress-every PROGRESS_EVERY] [--keep-no-evidence]
+Evaluate smORF translation evidence using Ribo-seq P-site density.
+options:
+  -h, --help            show this help message and exit
+  -i, --orf-table ORF_TABLE
+                        Filtered smORF table from smorf_filter.
+  -o, --output OUTPUT   Output evidence table in TSV format.
+  --genepred GENEPRED   Optional genePred file for ORF exon blocks. (default: None)
+  --chrom-sizes CHROM_SIZES
+                        Optional two-column chromosome size file. If not provided, chromosome sizes will be inferred from density files.
+                        (default: None)
+  --density-list DENSITY_LIST
+                        TSV with columns: sample, strand, path, optional format. (default: None)
+  --density-plus DENSITY_PLUS
+                        Plus-strand P-site density file. (default: None)
+  --density-minus DENSITY_MINUS
+                        Minus-strand P-site density file. (default: None)
+  --density DENSITY     Unstranded P-site density file. (default: None)
+  --sample SAMPLE       Sample name for direct density input. (default: sample1)
+  --density-format {auto,wig,bedgraph}
+                        Density file format. (default: auto)
+  --coord-mode {0based-half-open,1based-closed}
+                        Coordinate mode for ORF table and genePred-like blocks. (default: 0based-half-open)
+  --post-stop-codons POST_STOP_CODONS
+                        Number of codons after stop codon used for release signal. (default: 10)
+  --pseudocount PSEUDOCOUNT
+                        Pseudocount for ratio calculation. (default: 0.1)
+  --min-rpf-sum MIN_RPF_SUM
+                        Minimum ORF-level RPF sum for evidence scoring. (default: 3.0)
+  --min-covered-codon MIN_COVERED_CODON
+                        Minimum covered codon count. (default: 2)
+  --min-coverage-ratio MIN_COVERAGE_RATIO
+                        Minimum nucleotide-level coverage ratio. (default: 0.1)
+  --strong-periodicity STRONG_PERIODICITY
+                        Frame-0 ratio threshold for strong periodicity. (default: 0.7)
+  --moderate-periodicity MODERATE_PERIODICITY
+                        Frame-0 ratio threshold for moderate periodicity. (default: 0.55)
+  --strong-start-pause STRONG_START_PAUSE
+                        Start pausing ratio threshold for strong signal. (default: 1.5)
+  --moderate-start-pause MODERATE_START_PAUSE
+                        Start pausing ratio threshold for moderate signal. (default: 1.2)
+  --strong-stop-pause STRONG_STOP_PAUSE
+                        Pre-stop pausing ratio threshold for strong signal. (default: 1.5)
+  --moderate-stop-pause MODERATE_STOP_PAUSE
+                        Pre-stop pausing ratio threshold for moderate signal. (default: 1.2)
+  --strong-release STRONG_RELEASE
+                        Release ratio threshold for strong signal. (default: 3.0)
+  --moderate-release MODERATE_RELEASE
+                        Release ratio threshold for moderate signal. (default: 1.5)
+  --uniform-coverage-ratio UNIFORM_COVERAGE_RATIO
+                        Coverage ratio threshold for Uniform shape. (default: 0.4)
+  --uniform-gini UNIFORM_GINI
+                        Gini threshold for Uniform shape. (default: 0.5)
+  --uniform-max-to-mean UNIFORM_MAX_TO_MEAN
+                        Max/mean threshold for Uniform shape. (default: 5.0)
+  --skewed-max-to-mean SKEWED_MAX_TO_MEAN
+                        Max/mean threshold for Skewed shape. (default: 10.0)
+  --skewed-top-fraction SKEWED_TOP_FRACTION
+                        Top 10 percent density fraction threshold for Skewed shape. (default: 0.7)
+  --disperse-coverage-ratio DISPERSE_COVERAGE_RATIO
+                        Coverage ratio threshold below which coverage is Disperse. (default: 0.2)
+  --progress-every PROGRESS_EVERY
+                        Print progress every N ORFs per chromosome. (default: 10000)
+  --keep-no-evidence    Keep ORFs with no RPF evidence in output. (default: False)
+# example
+smorf_evidence \
+  -i mine.reliable.passed.message.txt \
+  --genepred mine.genePred \
+  --density-list ribo.bedgraph.list \
+  -o mine.smorf.riboseq_evidence.txt
+# cat ribo.bedgraph.list
+# sample	strand	path	format
+# ribo1	+	/project/mine/ribo/bedgraph/ribo1_plus.rpf.bedgraph	bedgraph
+# ribo1	-	/project/mine/ribo/bedgraph/ribo1_minus.rpf.bedgraph	bedgraph
+# ribo2	+	/project/mine/ribo/bedgraph/ribo2_plus.rpf.bedgraph	bedgraph
+# ribo2	-	/project/mine/ribo/bedgraph/ribo2_minus.rpf.bedgraph	bedgraph
+# these bedgraph are generated from `rpf_Bam2bw`
+```
+4. Use `smorf_integrate` to integrate all high-confidence smORFs
+Combine results filtered by different sample Ribo-seq data into one comprehensive table.
+You can then filter the output table and the annotation file from the first scanning step, and use RiboParser's rpf module for complete quality control and quantification analysis.
+```bash
+$ smorf_integrate -h
+usage: smorf_integrate
+[-h] -i INPUT [--output-matrix OUTPUT_MATRIX] [--output-integrated OUTPUT_INTEGRATED]
+[--capture-labels CAPTURE_LABELS] [--pass-labels PASS_LABELS] [--excellent-min-samples EXCELLENT_MIN_SAMPLES]
+Integrate long-format smORF Ribo-seq evidence into matrices and ORF-level summary tables.
+options:
+  -h, --help            show this help message and exit
+  -i, --input INPUT     Long-format smORF Ribo-seq evidence table generated by smorf_evidence.py.
+  --output-matrix OUTPUT_MATRIX
+                        Output ORF-by-sample matrix table for frame density and selected sample metrics. (default: None)
+  --output-integrated OUTPUT_INTEGRATED
+                        Output integrated ORF-level evidence table. (default: None)
+  --capture-labels CAPTURE_LABELS
+                        Translation evidence labels used to define captured samples. (default: LowConfidence,MediumConfidence,HighConfidence)
+  --pass-labels PASS_LABELS
+                        Translation evidence labels used to define reliable/pass samples. (default: MediumConfidence,HighConfidence)
+  --excellent-min-samples EXCELLENT_MIN_SAMPLES
+                        Minimum number of pass samples required to mark an ORF as Excellent. (default: 2)
+# example
+smorf_integrate \
+ -i mine.smorf.riboseq_evidence.txt \
+ --output-matrix mine.smorf.frame_density_matrix.txt \
+ --output-integrated mine.smorf.integrated_evidence.txt
+```
+## 7. Other toolkits
+### 7.1 Data shuffling
 Some analysis processes require randomly assigned data for control,
  so a step is added here to reshuffling the RPFs density file.
@@ -2504,7 +2772,7 @@ RIBO_shuffle.txt # Shuffled RPFs density file
 ```
-### 6.2 Retrieve and format the gene density
+### 7.2 Retrieve and format the gene density
 In many cases, it is necessary to perform some additional operations on the gene set in the RPFs density file,
 such as filtering, RPM standardization, long and width data format conversion, etc.
@@ -2582,7 +2850,7 @@ RIBO.log
 RIBO_retrieve.txt
 ```
-### 6.3 Filter the frame shifting genes
+### 7.3 Filter the frame shifting genes
 A frameshift in translation occurs when the ribosome shifts by one or more nucleotides in the mRNA sequence, causing a misreading of the codons.
 This results in a completely altered amino acid sequence downstream of the shift,
@@ -2649,9 +2917,9 @@ RIBO_SRR1944912_gene_frame_shift.txt
 ```
-## 7. one step for pipeline
+## 8. one step for pipeline
-### 7.0 Prepare the directories and design file for your project
+### 8.0 Prepare the directories and design file for your project
 1. create the directories to store the raw-data and results
@@ -2682,7 +2950,7 @@ SRR1944917      ncs2d_ribo_YPD
 ```
-### 7.1 run_step1.sh
+### 8.1 run_step1.sh
 This step is used for constructing the database, which is essential for the alignment of reads and subsequent analysis using `RiboParser`.
@@ -2694,7 +2962,7 @@ This step is suitable for most genome and gene annotation files derived from `NC
 $ nohup sh run_step1.sh &
 ```
-### 7.2 run_step2.sh
+### 8.2 run_step2.sh
 This step is used for analyzing `RNA-seq` data, including data cleaning,
 alignment, and expression quantification.
@@ -2706,7 +2974,7 @@ method used in your project!
 $ nohup sh run_step2.sh &
 ```
-### 7.3 run_step3.sh
+### 8.3 run_step3.sh
 This step is used for analyzing `Ribo-seq` data, including data cleaning,
 alignment, and expression quantification.
@@ -2718,7 +2986,7 @@ sequencing method used in your project!
 $ nohup sh run_step3.sh &
 ```
-### 7.4 run_step4.sh
+### 8.4 run_step4.sh
 This step is used for analyzing `RNA-seq` data, utilizing `RiboParser` to check the
 sequencing quality of the `RNA-seq` data and prepare formatted files for subsequent
@@ -2731,7 +2999,7 @@ modified according to the files defined for your project!
 $ nohup sh run_step4.sh &
 ```
-### 7.5 run_step5.sh
+### 8.5 run_step5.sh
 This step is used for analyzing `Ribo-seq` data, utilizing `RiboParser` to check the
 sequencing quality of the `Ribo-seq` data.
@@ -2744,7 +3012,7 @@ $ nohup sh run_step5.sh &
 ```
-## 8. Computational performance of the RiboParser
+## 9. Computational performance of the RiboParser
 We assessed the workflow on a CentOS 7 system using 12 threads, with RNA-seq and Ribo-seq data from three different species (S. cerevisiae, M. musculus, and H. sapiens).
 | | | | | | | | | | | |
@@ -2772,7 +3040,7 @@ Optimal Configuration
 - Storage: ≥ 512 GB NVMe SSD for rapid I/O and 2 TB HDD (SATA III)
-## 9. Contribution
+## 10. Contribution
 Thanks for all the open source tools used in the process.
@@ -2783,6 +3051,6 @@ Contribute to our open-source project by submitting questions and code.
 Contact `rensc0718@163.com` for more information.
-## 10. License
+## 11. License
 GPL License.

RiboParser 0.2.2__tar.gz → 0.2.5__tar.gz

RiboParser 0.2.2tar.gz → 0.2.5tar.gz