DBPeaks 0.0.3__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- dbpeaks-0.0.3/DBPeaks.egg-info/PKG-INFO +175 -0
- dbpeaks-0.0.3/DBPeaks.egg-info/SOURCES.txt +10 -0
- dbpeaks-0.0.3/DBPeaks.egg-info/dependency_links.txt +1 -0
- dbpeaks-0.0.3/DBPeaks.egg-info/entry_points.txt +2 -0
- dbpeaks-0.0.3/DBPeaks.egg-info/requires.txt +5 -0
- dbpeaks-0.0.3/DBPeaks.egg-info/top_level.txt +1 -0
- dbpeaks-0.0.3/DBPeaks.py +810 -0
- dbpeaks-0.0.3/LICENSE +201 -0
- dbpeaks-0.0.3/PKG-INFO +175 -0
- dbpeaks-0.0.3/README.md +151 -0
- dbpeaks-0.0.3/pyproject.toml +36 -0
- dbpeaks-0.0.3/setup.cfg +4 -0
|
@@ -0,0 +1,175 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: DBPeaks
|
|
3
|
+
Version: 0.0.3
|
|
4
|
+
Summary: A tool for identifying differentially bound peaks in CLIP/CRAC data
|
|
5
|
+
Author-email: Sander Granneman <Sander.Granneman@ed.ac.uk>
|
|
6
|
+
License-Expression: MIT
|
|
7
|
+
Classifier: Development Status :: 3 - Alpha
|
|
8
|
+
Classifier: Intended Audience :: Science/Research
|
|
9
|
+
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
|
|
10
|
+
Classifier: Programming Language :: Python :: 3
|
|
11
|
+
Classifier: Programming Language :: Python :: 3.8
|
|
12
|
+
Classifier: Programming Language :: Python :: 3.9
|
|
13
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
14
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
15
|
+
Requires-Python: >=3.8
|
|
16
|
+
Description-Content-Type: text/markdown
|
|
17
|
+
License-File: LICENSE
|
|
18
|
+
Requires-Dist: pybedtools
|
|
19
|
+
Requires-Dist: numpy
|
|
20
|
+
Requires-Dist: pandas
|
|
21
|
+
Requires-Dist: pydeseq2==0.4.9
|
|
22
|
+
Requires-Dist: pyCRAC==1.5.2
|
|
23
|
+
Dynamic: license-file
|
|
24
|
+
|
|
25
|
+
# DBPeaks: Differential RNA-Binding Site Analysis Tool
|
|
26
|
+
|
|
27
|
+
## Contents
|
|
28
|
+
|
|
29
|
+
- [Introduction](#introduction)
|
|
30
|
+
|
|
31
|
+
- [Repo Contents](#repo-contents)
|
|
32
|
+
|
|
33
|
+
- [Features](#features)
|
|
34
|
+
|
|
35
|
+
- [System Requirements](#system-requirements)
|
|
36
|
+
|
|
37
|
+
- [Installation Guide](#installation-guide)
|
|
38
|
+
|
|
39
|
+
- [License](./LICENSE)
|
|
40
|
+
|
|
41
|
+
- [Citation](#citation)
|
|
42
|
+
|
|
43
|
+
- [Contact](#contact)
|
|
44
|
+
|
|
45
|
+
## Inroduction
|
|
46
|
+
|
|
47
|
+
DBPeaks is a Python-based command-line tool designed for the identification and analysis of differential RNA-binding sites in CLIP/CRAC datasets. It integrates various bioinformatics tools and methods to process sequencing data, identify peaks, and perform statistical analyses to detect significant differences in RNA-binding across conditions. It does so by first analysing the peaks in each individual file and it then looks whether peaks are found in the same regions. These peaks need to have overlapping genome mapping coordinates. All overlapping peaks will then be merged into a single peak interval and for each interval the program will then calculate the total number of reads covering that genomic interval. DESeq2 will then be used to determine if the read counts for that interval is statistically significantly different between sample and control files.
|
|
48
|
+
|
|
49
|
+
It requires CLIP/CRAC data BAM files as input as well as GTF and genome files for the model organism.
|
|
50
|
+
Make sure your GTF annotation file does not have any silly formatting mistakes, otherwise the program will not run.
|
|
51
|
+
Example genome files for yeast are available in this repository.
|
|
52
|
+
|
|
53
|
+
NOTE! DBPeaks was SPECIFICALLY designed to analyse CLIP/CRAC datasets from two different conditions or by comparing data from WT vs mutant RBPs. It was NOT designed to compare RBP CLIP datasets to control datasets that have substantially lower read counts (i.e. data from untagged strains or no UV cross-linking controls). Should you be stubborn and still decide to use DBPeaks for this purpose, you will get rubbish results!
|
|
54
|
+
|
|
55
|
+
It is really important that all bam files have good number of reads and that there is not a huge difference in read depth between the files. This will make DESeq2 much happier and will therefore improve the results.
|
|
56
|
+
|
|
57
|
+
We have tried many different tools that do similar things. However, we were either not able to get them running on our servers or they were not able to detect clearly differentially bound (DB) peaks in our data. We have not benchmarked DBPeaks to existing tools so we do not yet know how well it performs compared to most popular peak calling methods. All I can say is that on OUR data where we removed an RBP binding site in the genome it performs better than existing tools such as MACS3 and Peakachu. DBPeaks was able to detect loss of binding in that single genomic location. The other tools we tested could not. DBPeaks is, however, slower than most existing tools. This is because it relies on pyCalculateFDRs from the pyCRAC package to call peaks. This script looks for peaks in each individual gene anotated in the genome and takes read coverage of the gene into consideration for this. So this part is rather slow if you have many features annotated in your genome file.
|
|
58
|
+
|
|
59
|
+
DBPeaks uses multiple CPUs to process the data and has the added advantage that it can also use replicates.
|
|
60
|
+
|
|
61
|
+
## Repo Contents
|
|
62
|
+
- [DBPeaks](./DBPeaks.py)
|
|
63
|
+
- [License](./LICENSE)
|
|
64
|
+
|
|
65
|
+
## Features
|
|
66
|
+
|
|
67
|
+
Comprehensive Analysis Pipeline: From reading BAM files to statistical analysis with DESeq2.
|
|
68
|
+
Parallel Processing: Utilizes multiple CPUs to speed up the analysis.
|
|
69
|
+
Flexible Input Options: Supports various configurations and customizations through command-line options.
|
|
70
|
+
Integrated Peak Calling and Filtering: Includes functionality for peak detection, filtering based on reproducibility, and adjustment of peak widths.
|
|
71
|
+
Statistical Analysis: Incorporates DESeq2 for rigorous differential analysis.
|
|
72
|
+
|
|
73
|
+
## Installation Guide
|
|
74
|
+
|
|
75
|
+
### Prerequisites
|
|
76
|
+
|
|
77
|
+
Python 3.6 or higher
|
|
78
|
+
Dependencies: pybedtools, numpy, pandas, pydeseq2, pyCRAC, and others as listed in requirements.txt.
|
|
79
|
+
Steps
|
|
80
|
+
|
|
81
|
+
Clone the repository:
|
|
82
|
+
|
|
83
|
+
```
|
|
84
|
+
git clone https://git.ecdf.ed.ac.uk/sgrannem/dbpeaks.git
|
|
85
|
+
cd dbpeaks
|
|
86
|
+
```
|
|
87
|
+
|
|
88
|
+
### Install required Python packages:
|
|
89
|
+
|
|
90
|
+
```
|
|
91
|
+
pip install -r requirements.txt
|
|
92
|
+
```
|
|
93
|
+
|
|
94
|
+
### Install DBPeaks:
|
|
95
|
+
|
|
96
|
+
```
|
|
97
|
+
cd dbpeaks
|
|
98
|
+
pip install -e . --user
|
|
99
|
+
```
|
|
100
|
+
|
|
101
|
+
### Running DBPeaks
|
|
102
|
+
|
|
103
|
+
DBPeaks is run from the command line. Here is a basic example to get you started:
|
|
104
|
+
|
|
105
|
+
python dbpeaks.py --samples path/to/sample1.bam path/to/sample2.bam --controls path/to/control1.bam path/to/control2.bam --gtf path/to/annotation.gtf --chromfile path/to/chrominfo.txt --jobname ExampleAnalysis
|
|
106
|
+
|
|
107
|
+
### Command-Line Options
|
|
108
|
+
|
|
109
|
+
--samples: Specify paths to the BAM files for the sample group.
|
|
110
|
+
--controls: Specify paths to the BAM files for the control group.
|
|
111
|
+
--gtf: Path to the GTF annotation file.
|
|
112
|
+
--chromfile: Location of the chromosome info file. This file should have two columns: first column is the names of the chromosomes, second column is length of the chromosomes.
|
|
113
|
+
--jobname: A name for the job to organize output files.
|
|
114
|
+
|
|
115
|
+
### Additional options for peak calling, filtering, and statistical thresholds can be viewed using the help option:
|
|
116
|
+
|
|
117
|
+
```
|
|
118
|
+
DBpeaks.py --help
|
|
119
|
+
```
|
|
120
|
+
|
|
121
|
+
### Peak calling settings:
|
|
122
|
+
-m 0.05, --minfdr 0.05 To set a minimal FDR threshold for filtering interval data. Default is 0.05
|
|
123
|
+
|
|
124
|
+
This is a setting that is used when running pyCalculateFDRs. If you end up getting a lot of peaks in your data,
|
|
125
|
+
it is recommended to change this threshold, let's say to 0.01 as this will reduce the number of significantly enriched peaks
|
|
126
|
+
in your data.
|
|
127
|
+
|
|
128
|
+
--padj 0.05 DESeq2 threshold for calling a peak DB. Default is 0.05
|
|
129
|
+
|
|
130
|
+
If you hardly get any DB peaks, then it may be worth slighly adjusting this threshold.
|
|
131
|
+
However, in this scenario, it may also be the case that your samples just have too much variability.
|
|
132
|
+
It may then be wise to do a PCA analysis on your data to see if replicates are indeed grouped together.
|
|
133
|
+
|
|
134
|
+
--min 5 to set a minimal read coverages for a region. Regions with coverage less than minimum will be ignored
|
|
135
|
+
|
|
136
|
+
--blocks Add this flag if you want to consider reads with identical mapping coordinates once, regardless of sequence.
|
|
137
|
+
|
|
138
|
+
NOTE! This is a HUGELY important flag! Setting --blocks will remove any 'towers' in your data and collapse them into
|
|
139
|
+
one single interval. This can completely change the shape and height of the peak and the peak may no longer be detected.
|
|
140
|
+
However, if you suspect that your library is of low complexity and you see many of these blocks or towers in your genome browser, then I would recommend adding this flag as I have seen that this can improve the reliability of the final DESeq2 analyses.
|
|
141
|
+
|
|
142
|
+
--iterations 100 to set the number of iterations for randomization of read coordinates. Default=100
|
|
143
|
+
|
|
144
|
+
This is important for the peak calling analysis by the pyCalculateFDR.py script.
|
|
145
|
+
|
|
146
|
+
|
|
147
|
+
-r 90, --rep 90 To set in what percentage of the replicates the peak should be detected. Default=100
|
|
148
|
+
|
|
149
|
+
Let's say you have three sample and three control bam files and you set -r to 50, then peaks that are, for example present in the smaple files but absent in the control files will also be considered. If you, in this scenario, set -r to 100, then the tool will expect to find overlapping peaks at any given position for ALL samples! So you may miss peaks that were, for example, only present in your sample but not in the control!
|
|
150
|
+
|
|
151
|
+
--filter mean To filter the peaks in gtf files by a specific threshold. Options are mean, median or mean plus one standard devation
|
|
152
|
+
(mean_plus_std) peak heights. Default is no filtering.
|
|
153
|
+
|
|
154
|
+
I would always recommend starting with no filtering. If you get too many DB peaks, then I would start with --filter mean and then --filter median.
|
|
155
|
+
|
|
156
|
+
--min_peak_width 20 To set the minimum width of a called peak. Default = 20
|
|
157
|
+
|
|
158
|
+
|
|
159
|
+
## Contributing to further improving DBPeaks
|
|
160
|
+
|
|
161
|
+
Contributions to DBPeaks are welcome! Please fork the repository and submit pull requests with your enhancements.
|
|
162
|
+
We will also be including some test data on the repository soon!
|
|
163
|
+
|
|
164
|
+
## License
|
|
165
|
+
|
|
166
|
+
This project is licensed under the Apache License - see the LICENSE file for details.
|
|
167
|
+
|
|
168
|
+
|
|
169
|
+
## Citation
|
|
170
|
+
|
|
171
|
+
DBPeaks was developed to analyse CRAC data for a manuscript that we are about to submit. This will be updated once the paper has been accepted or put on a preprint server.
|
|
172
|
+
|
|
173
|
+
## Contact
|
|
174
|
+
|
|
175
|
+
For support or to report issues, please contact Sander Granneman at Sander.Granneman@ed.ac.uk, University of Edinburgh.
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
DBPeaks
|