PyPI - offtracker - Versions diffs - 2.7.10__zip → 2.10.0__zip - Mend

offtracker 2.7.10zip → 2.10.0zip

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (33) hide show

{offtracker-2.7.10/offtracker.egg-info → offtracker-2.10.0}/PKG-INFO RENAMED Viewed

@@ -1,10 +1,10 @@
 Metadata-Version: 2.1
 Name: offtracker
-Version: 2.7.10
+Version: 2.10.0
 Summary: Tracking-seq data analysis
 Home-page: https://github.com/Lan-lab/offtracker
 Author: Runda Xu
-Author-email: runda.xu@foxmail.com
+Author-email: xrd18@tsinghua.org.cn
 Requires-Python: >=3.6.0
 Description-Content-Type: text/markdown
 License-File: LICENSE.txt
@@ -22,9 +22,10 @@ OFF-TRACKER is an end to end pipeline of Tracking-seq data analysis for detectin
 ## Dependency
 ```bash
-# We recommend creating a new enviroment using mamba/conda to avoid compatibility problems
+# We recommend creating a new environment using mamba/conda to avoid compatibility problems
 # If you don't use mamba, just replace the code with conda
-mamba create -n offtracker -c bioconda blast snakemake pybedtools
+# Windows systems may not be compatible with pybedtools.
+mamba create -n offtracker -c bioconda blast snakemake pybedtools chromap
 ```
@@ -58,32 +59,69 @@ chromap -i -r /Your_Path_To_Reference/hg38_genome.fa \
 -o /Your_Path_To_Reference/hg38_genome.chromap.index
 # Generate candidate regions by sgRNA sequence (need once for each genome and sgRNA)
-# --name: the name of the sgRNA, which will be used in the following analysis
+# --name: a user-defined name of the sgRNA, which will be used in the following analysis.
 offtracker_candidates.py -t 8 -g hg38 \
 -r /Your_Path_To_Reference/hg38_genome.fa \
 -b /Your_Path_To_Reference/hg38_genome.blastdb \
 --name 'VEGFA2' --sgrna 'GACCCCCTCCACCCCGCCTC' --pam 'NGG' \
--o /Your_Path_To_Candidates
+-o /Your_Path_To_Candidates_Folder
 ```
+## Quality control and adapter trimming
+```bash
+# Generate snakemake config file for quality control and adapter trimming.
+offtracker_qc.py -t 4 \
+-f /Your_Path_To_Input_Folder \
+--subfolder 0
+cd /Your_Path_To_Input_Folder/Trimmed_data
+snakemake -np # dry run to check whether everything is alright
+nohup snakemake --cores 16 1>${outdir}/sm_qc.log 2>&1 &
+"""
+Set “--subfolder 0” if the file structure is like:
+| - Input_Folder
+  | - sample1_R1.fastq.gz
+  | - sample1_R2.fastq.gz
+  | - sample2_R1.fastq.gz
+  | - sample2_R2.fastq.gz
+Set “--subfolder 1” if the file structure is like:
+| - Input_Folder
+  | - Sample1_Folder
+    | - sample1_R1.fastq.gz
+    | - sample1_R2.fastq.gz
+  | - Sample2_Folder
+    | - sample2_R1.fastq.gz
+    | - sample2_R2.fastq.gz
+The script “offtracker_qc.py” will create a “Trimmed_data” folder under /Your_Path_To_Input_Folder.
+If “-o /Your_Path_To_Output” is set, the output will be redirected to /Your_Path_To_Output.
+"""
+```
 ## Strand-specific mapping of Tracking-seq data
 ```bash
-# Generate snakemake config file
-# --subfolder: If different samples are in seperate folders, set this to 1
-# if -o is not set, the output will be in the same folder as the fastq files
+# Generate snakemake config file for mapping
+# Results will be generated in /Your_Path_To_Output, if -o is not set, the output will be in the same folder as the fastq files
 offtracker_config.py -t 8 -g hg38 --blacklist hg38 \
 -r /Your_Path_To_Reference/hg38_genome.fa \
 -i /Your_Path_To_Reference/hg38_genome.chromap.index \
--f /Your_Path_To_Fastq \
+-f /Your_Path_To_Trimmed_Data \
 -o /Your_Path_To_Output \
 --subfolder 0
+# Warning: Do not contain "fastq" or "fq" in the folder name, otherwise the program may treat the folder as a fastq file
+# This problem may be fixed in the future
 # Run the snakemake program
 cd /Your_Path_To_Fastq
 snakemake -np # dry run
-nohup snakemake --cores 16 1>snakemake.log 2>snakemake.err &
+nohup snakemake --cores 16 1>sm_mapping.log 2>sm_mapping.err &
 ## about cores
 # --cores of snakemake must be larger than -t of offtracker_config.py
@@ -98,7 +136,7 @@ nohup snakemake --cores 16 1>snakemake.log 2>snakemake.err &
 ## Analyzing the genome-wide off-target sites
 ```bash
-# In this part, multiple samples in the same condition can be analyzed in a single run by pattern recogonization of sample names
+# In this part, multiple samples in the same condition can be analyzed in a single run by pattern recognition of sample names
 offtracker_analysis.py -g hg38 --name "VEGFA2" \
 --exp 'Cas9_VEGFA2' \
@@ -127,19 +165,18 @@ offtracker_plot.py --result Your_Offtracker_Result_CSV \
 --sgrna 'GACCCCCTCCACCCCGCCTC' --pam 'NGG'
 # The default output is a pdf file with Offtracker_result_{outname}.pdf
-# Change the suffix of the output file to change the format (e.g.: .png)
+# Assigning a specific output file with another suffix can change the format. e.g., "--output Offtracker_plot.png" will generate a png file.
 # The orange dash line indicates the empirical threshold of Track score = 2
 # Empirically, the off-target sites with Track score < 2 are less likely to be real off-target sites.
 ```
-## Note1
+## Note1, when not using hg38 or mm10
-The default setting only includes chr1-chr22, chrX, chrY, and chrM. Please make sure the reference genome contains "chr" at the beginning.
+The default setting only includes chr1-chr22, chrX, chrY, and chrM. (only suitable for human and mouse) \
+If you are using reference genomes without "chr" at the beginning, or want to analyze all chromosomes or other species, you can set "--ignore_chr" when running offtracker_config.py to skip chromosome filter.
-Currently, this software is only ready-to-use for mm10 and hg38. For any other genome, e.g., hg19, please add genome size file named "hg19.chrom.sizes" to .\offtracker\mapping and instal manually. Besides, add "--blacklist none" or "--blacklist Your_Blacklist" (e.g., ENCODE blacklist) when running offtracker_config.py, because we only provide blacklists for mm10 and hg38.
-If you have a requirement for species other than human/mouse, please post an issue.
+Currently, this software is only ready-to-use for mm10 and hg38. For any other genome, e.g., hg19, please add a genome size file named "hg19.chrom.sizes" to .\offtracker\utility. Besides, add "--blacklist none" or "--blacklist Your_Blacklist" (e.g., ENCODE blacklist) when running offtracker_config.py, because we only include blacklists for mm10 and hg38.
 ## Note2
@@ -172,6 +209,7 @@ These files can be visualized in genome browser like IGV:
 ![signal](https://github.com/Lan-lab/offtracker/blob/main/example_output/signals_example.png?raw=true)
+The signal (coverage) for each sample is normalized to 1e7/total_reads. As only reads mapping to chr6 were extracted in the example data, the signal range is much higher than that of the whole genome samples.
 ## Whole genome off-target analysis
@@ -183,7 +221,13 @@ After that, you can visualize the off-target sites with their genomic sequence (
 # Citation
+If you use Tracking-seq or OFF-TRACKER in your research, please cite the following paper:
+Zhu, M., Xu, R., Yuan, J., Wang, J. et al. Tracking-seq reveals the heterogeneity of off-target effects in CRISPR–Cas9-mediated genome editing. Nat Biotechnol (2024). https://doi.org/10.1038/s41587-024-02307-y
+The signal visualization of .bw file here was generated by the Integrative Genomics Viewer (IGV) software. The signal visualization in the Tracking-seq article above was generated by either IGV or pyGenomeTracks:
+Robinson, J., Thorvaldsdóttir, H., Winckler, W. et al. Integrative genomics viewer. Nat Biotechnol 29, 24–26 (2011). https://doi.org/10.1038/nbt.1754
+Lopez-Delisle L, Rabbani L, Wolff J, Bhardwaj V, Backofen R, Grüning B, Ramírez F, Manke T. pyGenomeTracks: reproducible plots for multivariate genomic data sets. Bioinformatics. 2020 Aug 3:btaa692. doi: 10.1093/bioinformatics/btaa692.

{offtracker-2.7.10 → offtracker-2.10.0}/README.md RENAMED Viewed

@@ -1,6 +1,6 @@
-# OFF-TRACKER
+# Offtracker
-OFF-TRACKER is an end to end pipeline of Tracking-seq data analysis for detecting off-target sites of any genome editing tools that generate double-strand breaks (DSBs) or single-strand breaks (SSBs).
+Offtracker is an end to end pipeline of Tracking-seq data analysis for detecting off-target sites of any genome editing tools that generate double-strand breaks (DSBs) or single-strand breaks (SSBs).
 ## System requirements
@@ -10,9 +10,10 @@ OFF-TRACKER is an end to end pipeline of Tracking-seq data analysis for detectin
 ## Dependency
 ```bash
-# We recommend creating a new enviroment using mamba/conda to avoid compatibility problems
+# We recommend creating a new environment using mamba/conda to avoid compatibility problems
 # If you don't use mamba, just replace the code with conda
-mamba create -n offtracker -c bioconda blast snakemake pybedtools
+# Windows systems may not be compatible with pybedtools.
+mamba create -n offtracker -c bioconda blast snakemake pybedtools chromap
 ```
@@ -46,32 +47,69 @@ chromap -i -r /Your_Path_To_Reference/hg38_genome.fa \
 -o /Your_Path_To_Reference/hg38_genome.chromap.index
 # Generate candidate regions by sgRNA sequence (need once for each genome and sgRNA)
-# --name: the name of the sgRNA, which will be used in the following analysis
+# --name: a user-defined name of the sgRNA, which will be used in the following analysis.
 offtracker_candidates.py -t 8 -g hg38 \
 -r /Your_Path_To_Reference/hg38_genome.fa \
 -b /Your_Path_To_Reference/hg38_genome.blastdb \
 --name 'VEGFA2' --sgrna 'GACCCCCTCCACCCCGCCTC' --pam 'NGG' \
--o /Your_Path_To_Candidates
+-o /Your_Path_To_Candidates_Folder
 ```
+## Quality control and adapter trimming
+```bash
+# Generate snakemake config file for quality control and adapter trimming.
+offtracker_qc.py -t 4 \
+-f /Your_Path_To_Input_Folder \
+--subfolder 0
+cd /Your_Path_To_Input_Folder/Trimmed_data
+snakemake -np # dry run to check whether everything is alright
+nohup snakemake --cores 16 1>${outdir}/sm_qc.log 2>&1 &
+"""
+Set “--subfolder 0” if the file structure is like:
+| - Input_Folder
+  | - sample1_R1.fastq.gz
+  | - sample1_R2.fastq.gz
+  | - sample2_R1.fastq.gz
+  | - sample2_R2.fastq.gz
+Set “--subfolder 1” if the file structure is like:
+| - Input_Folder
+  | - Sample1_Folder
+    | - sample1_R1.fastq.gz
+    | - sample1_R2.fastq.gz
+  | - Sample2_Folder
+    | - sample2_R1.fastq.gz
+    | - sample2_R2.fastq.gz
+The script “offtracker_qc.py” will create a “Trimmed_data” folder under /Your_Path_To_Input_Folder.
+If “-o /Your_Path_To_Output” is set, the output will be redirected to /Your_Path_To_Output.
+"""
+```
 ## Strand-specific mapping of Tracking-seq data
 ```bash
-# Generate snakemake config file
-# --subfolder: If different samples are in seperate folders, set this to 1
-# if -o is not set, the output will be in the same folder as the fastq files
+# Generate snakemake config file for mapping
+# Results will be generated in /Your_Path_To_Output, if -o is not set, the output will be in the same folder as the fastq files
 offtracker_config.py -t 8 -g hg38 --blacklist hg38 \
 -r /Your_Path_To_Reference/hg38_genome.fa \
 -i /Your_Path_To_Reference/hg38_genome.chromap.index \
--f /Your_Path_To_Fastq \
+-f /Your_Path_To_Trimmed_Data \
 -o /Your_Path_To_Output \
 --subfolder 0
+# Warning: Do not contain "fastq" or "fq" in the folder name, otherwise the program may treat the folder as a fastq file
+# This problem may be fixed in the future
 # Run the snakemake program
 cd /Your_Path_To_Fastq
 snakemake -np # dry run
-nohup snakemake --cores 16 1>snakemake.log 2>snakemake.err &
+nohup snakemake --cores 16 1>sm_mapping.log 2>sm_mapping.err &
 ## about cores
 # --cores of snakemake must be larger than -t of offtracker_config.py
@@ -86,7 +124,7 @@ nohup snakemake --cores 16 1>snakemake.log 2>snakemake.err &
 ## Analyzing the genome-wide off-target sites
 ```bash
-# In this part, multiple samples in the same condition can be analyzed in a single run by pattern recogonization of sample names
+# In this part, multiple samples in the same condition can be analyzed in a single run by pattern recognition of sample names
 offtracker_analysis.py -g hg38 --name "VEGFA2" \
 --exp 'Cas9_VEGFA2' \
@@ -115,19 +153,18 @@ offtracker_plot.py --result Your_Offtracker_Result_CSV \
 --sgrna 'GACCCCCTCCACCCCGCCTC' --pam 'NGG'
 # The default output is a pdf file with Offtracker_result_{outname}.pdf
-# Change the suffix of the output file to change the format (e.g.: .png)
+# Assigning a specific output file with another suffix can change the format. e.g., "--output Offtracker_plot.png" will generate a png file.
 # The orange dash line indicates the empirical threshold of Track score = 2
 # Empirically, the off-target sites with Track score < 2 are less likely to be real off-target sites.
 ```
-## Note1
+## Note1, when not using hg38 or mm10
-The default setting only includes chr1-chr22, chrX, chrY, and chrM. Please make sure the reference genome contains "chr" at the beginning.
+The default setting only includes chr1-chr22, chrX, chrY, and chrM. (only suitable for human and mouse) \
+If you are using reference genomes without "chr" at the beginning, or want to analyze all chromosomes or other species, you can set "--ignore_chr" when running offtracker_config.py to skip chromosome filter.
-Currently, this software is only ready-to-use for mm10 and hg38. For any other genome, e.g., hg19, please add genome size file named "hg19.chrom.sizes" to .\offtracker\mapping and instal manually. Besides, add "--blacklist none" or "--blacklist Your_Blacklist" (e.g., ENCODE blacklist) when running offtracker_config.py, because we only provide blacklists for mm10 and hg38.
-If you have a requirement for species other than human/mouse, please post an issue.
+Currently, this software is only ready-to-use for mm10 and hg38. For any other genome, e.g., hg19, please add a genome size file named "hg19.chrom.sizes" to .\offtracker\utility. Besides, add "--blacklist none" or "--blacklist Your_Blacklist" (e.g., ENCODE blacklist) when running offtracker_config.py, because we only include blacklists for mm10 and hg38.
 ## Note2
@@ -160,6 +197,7 @@ These files can be visualized in genome browser like IGV:
 ![signal](https://github.com/Lan-lab/offtracker/blob/main/example_output/signals_example.png?raw=true)
+The signal (coverage) for each sample is normalized to 1e7/total_reads. As only reads mapping to chr6 were extracted in the example data, the signal range is much higher than that of the whole genome samples.
 ## Whole genome off-target analysis
@@ -171,7 +209,13 @@ After that, you can visualize the off-target sites with their genomic sequence (
 # Citation
+If you use Tracking-seq or OFF-TRACKER in your research, please cite the following paper:
+Zhu, M., Xu, R., Yuan, J., Wang, J. et al. Tracking-seq reveals the heterogeneity of off-target effects in CRISPR–Cas9-mediated genome editing. Nat Biotechnol (2024). https://doi.org/10.1038/s41587-024-02307-y
+The signal visualization of .bw file here was generated by the Integrative Genomics Viewer (IGV) software. The signal visualization in the Tracking-seq article above was generated by either IGV or pyGenomeTracks:
+Robinson, J., Thorvaldsdóttir, H., Winckler, W. et al. Integrative genomics viewer. Nat Biotechnol 29, 24–26 (2011). https://doi.org/10.1038/nbt.1754
+Lopez-Delisle L, Rabbani L, Wolff J, Bhardwaj V, Backofen R, Grüning B, Ramírez F, Manke T. pyGenomeTracks: reproducible plots for multivariate genomic data sets. Bioinformatics. 2020 Aug 3:btaa692. doi: 10.1093/bioinformatics/btaa692.

{offtracker-2.7.10 → offtracker-2.10.0}/offtracker/X_offplot.py RENAMED Viewed

@@ -12,6 +12,7 @@ dict_rc = {
 rcParams.update(dict_rc)
 # 2024.06.03. offtable 添加 threshold 分界线，默认为 None，常用的是 2
 def offtable(offtargets, target_guide,  length_pam = 3,
                          col_seq='best_target', col_score='track_score', col_mismatch='mismatch', col_loc='target_location',
                          title=None, font='Arial', font_size=9,
@@ -28,12 +29,15 @@ def offtable(offtargets, target_guide,  length_pam = 3,
         '-': 'orange'
     }
     # If offtargets is a DataFrame, convert to list of dictionaries
     if isinstance(offtargets, pd.DataFrame):
         if threshold is not None:
             n_positive = sum(offtargets[col_score]>=threshold)
         offtargets = offtargets.to_dict(orient='records')
     # Configuration
     # title=None
     # font='Arial'
@@ -106,10 +110,16 @@ def offtable(offtargets, target_guide,  length_pam = 3,
             ax.text(x + box_size_x / 2, y + box_size_y / 2, "." if c == target_guide[i] else c, ha='center', va='center', family=font, fontsize=font_size, weight='bold')
         # Annotations for score, mismatches, and location coordinates
-        ax.text(x_offset + (len(target_guide) + 2) * box_size_x, y + box_size_y / 2, round(seq[col_score],2), ha='center', va='center', family=font, fontsize=font_size)
+        # 2025.06.05. 如果有负数的，用红色显示
+        if seq[col_score]>0:
+            text_color = 'black'
+        else:
+            text_color = 'red'
+        ax.text(x_offset + (len(target_guide) + 2) * box_size_x, y + box_size_y / 2, round(seq[col_score],2), ha='center', va='center', family=font, fontsize=font_size, color=text_color)
         #ax.text(x_offset + (len(target_guide) + 7) * box_size_x, y + box_size_y / 2, "Target" if seq[col_mismatch] == 0 else seq[col_mismatch], ha='center', va='center', family=font, fontsize=font_size, color='red' if seq[col_mismatch] == 0 else 'black')
-        ax.text(x_offset + (len(target_guide) + 4) * box_size_x, y + box_size_y / 2, seq[col_loc], ha='left', va='center', family=font, fontsize=font_size)
+        ax.text(x_offset + (len(target_guide) + 4) * box_size_x, y + box_size_y / 2, seq[col_loc], ha='left', va='center', family=font, fontsize=font_size, color=text_color)
     # add a vertical line to indicate the PAM
     x_line = x_offset + (len(target_guide) - length_pam) * box_size_x
     y_start = y_offset # + box_size_y / 2
@@ -123,6 +133,7 @@ def offtable(offtargets, target_guide,  length_pam = 3,
         thresh_y = y_offset + (n_positive+1) * (box_size_y + box_gap) - box_gap*0.5
         ax.hlines(y=thresh_y, xmin=thresh_x_start, xmax=thresh_x_end, color='orange', linestyle='--')
     # Styling and save
     ax.set_xlim(0, width*1.1) # location 的文字太长了，所以要加长一点
     ax.set_ylim(height, 0)

{offtracker-2.7.10 → offtracker-2.10.0}/offtracker/X_sequence.py RENAMED Viewed

@@ -3,6 +3,7 @@ import math
 import pandas as pd
 from itertools import product
 import numpy as np
+import os, glob
 ambiguous_nt = {'A': ['A'],
                 'T': ['T'],
@@ -19,7 +20,7 @@ ambiguous_nt = {'A': ['A'],
                 'H': ['A', 'C', 'T'],
                 'D': ['A', 'G', 'T'],
                 'B': ['C', 'G', 'T'],
-                'N': ['A', 'T', 'C', 'G']}
+                'N': ['A', 'C', 'G', 'T']}
 def is_seq_valid(sequence, extra=True, ambiguous_nt=ambiguous_nt):
     if extra:
@@ -43,12 +44,24 @@ def possible_seq(sequence):
         raise KeyError(f'Unvalid character \'{valid_check}\' in sequence')
     return sequences
+# 包含 degenerate base pairs
+def get_base_score(base1, base2, exact_score=2, partial_match=2, mismatch_score=0.01):
+    base1 = ambiguous_nt[base1]
+    base2 = ambiguous_nt[base2]
+    if base1 == base2:
+        return exact_score
+    if list(np.union1d(base1,base2)) == base1 or list(np.union1d(base1,base2)) == base2:
+        # 其中一个是子集，注意顺序不一致会导致不等，所以必须排好序
+        return partial_match
+    return mismatch_score
 def complement(seq):
-    complement = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A', 'N': 'N', '-':'-',
+    dict_complement = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A', 'N': 'N', '-':'-',
                   'M': 'K', 'R': 'Y', 'W': 'W', 'S': 'S', 'Y': 'R', 'K':'M',
                   'V': 'B', 'H': 'D', 'D': 'H', 'B': 'V'}
     bases = list(seq)
-    letters = [complement[base] for base in bases]
+    letters = [dict_complement[base] for base in bases]
     return ''.join(letters)
 def reverse(seq):
@@ -100,14 +113,107 @@ def add_ID(df, chr_col=0, midpoint='cleavage_site'):#, midpoint='midpoint'):
 	df.loc[point_tail>=500,'ID_2'] = df[chr_col_name] + ':' + (point_head+1).astype(str)
 	return df
+def detect_fastq(folder, n_subfolder, NGS_type='paired-end'):
+    """
+    搜索 folder 的 n级子目录下的所有 fastq/fastq.gz/fq/fq.gz 文件
+    paired-end 模式 : 识别 2.fq/2.fastq 为 paired-end 的 R2 文件，并验证对应 R1 文件
+    single-end 模式 : 所有 fastq/fastq.gz/fq/fq.gz 文件都视为 single-end 文件
+    不建议 2. 和 fq/fastq 之间有其他字符，如 2.trimmed.fq.gz，因为中间字符不确定，使用通配符容易误判文件名其他的2.
+    样本名不要带点，建议用_分割特征，同特征内分割不要用_可以用-，如 sample_day-hour_type_batch_rep_1.fq.gz
+    Input
+    ----------
+    folder : 根目录
+    n_subfolder : n级子目录
+    Parameter
+    ----------
+    NGS_type : 'paired-end' or 'single-end'
+    Output
+    ----------
+    sample_names : 识别的样品名
+    files_R1 : R1文件的完整路径
+    files_R2 : R2文件的完整路径
+    """
+    # import os, sys, glob
+    # import pandas as pd
+    if NGS_type == 'paired-end':
+        print('paired-end mode')
+        files_R2 = []
+        # 支持四种文件扩展名
+        # 个人习惯包含绝对路径
+        for fastq in ['*2.fq','*2.fastq','*2.fq.gz','*2.fastq.gz']:
+            fq_files = glob.glob( os.path.join(folder, n_subfolder*'*/', fastq ) )
+            print(f'{len(fq_files)} {fastq[2:]} samples detected')
+            files_R2.extend( fq_files )
+        #
+        if len(files_R2) > 0:
+            files_R2 = pd.Series(files_R2).sort_values().reset_index(drop=True)
+            # 拆分文件名
+            suffix = files_R2.str.extract('(\.fastq.*|\.fq.*)',expand=False)
+            prefix = files_R2.str.extract('(.*)(?:.fq|.fastq)',expand=False)
+            # 将 prefix 进一步拆分为 sample_dir （真样品名） 和 nametype （某种统一后缀），支持五种样本名后缀
+            nametype = []
+            sample_dir = []
+            for a_prefix in prefix:
+                for a_type in ['_trimmed_2', '_2_val_2','_R2_val_2','_R2','_2']:
+                    len_type = len(a_type)
+                    if a_prefix[-len_type:] == a_type:
+                        nametype.append(a_type)
+                        sample_dir.append(a_prefix[:-len_type])
+                        break
+            assert len(nametype) == len(files_R2), 'The file name pattern is invaild!'
+            nametype = pd.Series(nametype)
+            sample_dir = pd.Series(sample_dir)
+            # 根据 R2 文件，检查 R1 文件是否存在
+            files_R1 = sample_dir + nametype.str.replace('2','1') + suffix
+            for i in range(len(files_R1)):
+                assert os.path.exists(files_R1[i]), f'{files_R1[i]} not found!'
+            sample_names = sample_dir.apply(os.path.basename)
+        else:
+            print('No paired-end samples detected!')
+            sample_names = 'no sample'
+            files_R1 = []
+    elif NGS_type == 'single-end':
+        print('single-end mode')
+        files_R1 = []
+        files_R2 = [] # 占位
+        # 支持四种文件扩展名
+        # 个人习惯包含绝对路径
+        for fastq in ['*.fq','*.fastq','*.fq.gz','*.fastq.gz']:
+            fq_files = glob.glob( os.path.join(folder, n_subfolder*'*/', fastq ) )
+            print(f'{len(fq_files)} {fastq[1:]} samples detected')
+            files_R1.extend( fq_files )
+        files_R1 = pd.Series(files_R1).sort_values()
+        #
+        if len(files_R1) > 0:
+            # 拆分文件名
+            suffix = files_R1.str.extract('(\.fastq.*|\.fq.*)',expand=False)
+            prefix = files_R1.str.extract('(.*)(?:.fq|.fastq)',expand=False)
+            # 单端模式下，所有前缀都视为样品名
+            sample_names = prefix.apply(os.path.basename)
+        else:
+            print('No single-end samples detected!')
+            sample_names = 'no sample'
+            files_R1 = []
+    return sample_names, files_R1, files_R2
 def sgRNA_alignment(a_key, sgRNA, seq, frag_len, DNA_matrix=None, mismatch_score = 0.01, return_align=False):
     from Bio import pairwise2
     import numpy as np
     if DNA_matrix is None:
-        DNA_matrix = {('A','A'): 2, ('A','T'):0.01, ('A','C'):0.01, ('A','G'):0.01, ('A','N'):0.01,
-                    ('T','T'): 2, ('T','A'):0.01, ('T','C'):0.01, ('T','G'):0.01, ('T','N'):0.01,
-                    ('G','G'): 2, ('G','A'):0.01, ('G','C'):0.01, ('G','T'):0.01, ('G','N'):0.01,
-                    ('C','C'): 2, ('C','A'):0.01, ('C','G'):0.01, ('C','T'):0.01, ('C','N'):0.01,
+        DNA_matrix = {('A','A'): 2, ('A','T'):0.01, ('A','C'):0.01, ('A','G'):0.01, ('A','N'):2,
+                    ('T','T'): 2, ('T','A'):0.01, ('T','C'):0.01, ('T','G'):0.01, ('T','N'):2,
+                    ('G','G'): 2, ('G','A'):0.01, ('G','C'):0.01, ('G','T'):0.01, ('G','N'):2,
+                    ('C','C'): 2, ('C','A'):0.01, ('C','G'):0.01, ('C','T'):0.01, ('C','N'):2,
                     ('N','N'): 2, ('N','C'):2, ('N','A'): 2, ('N','G'): 2, ('N','T'): 2}
     # a_key 是 pybedtools 得到的位置 chrA:X-Y 而 X 数字会往左多1bp
     alignments = pairwise2.align.localds( sgRNA, seq, DNA_matrix, -2, -2, penalize_extend_when_opening=False)

{offtracker-2.7.10 → offtracker-2.10.0}/offtracker/_version.py RENAMED Viewed

@@ -1,4 +1,4 @@
-__version__ = "2.7.10"
+__version__ = "2.10.0"
 # 2023.08.11. v1.1.0	adding a option for not normalizing the bw file
 # 2023.10.26. v1.9.0	prerelease for v2.0
 # 2023.10.27. v2.0.0	大更新，还没微调
@@ -27,4 +27,10 @@ __version__ = "2.7.10"
 # 2024.01.23. v2.7.7	Snakefile_offtracker: add --fixedStep to bigwigCompare for not merging neighbouring bins with equal values.
 # 2024.02.01. v2.7.8	逐步添加 X_offplot.py 功能
 # 2024.06.02. v2.7.9	添加 offtracker_plot.py
-# 2024.06.03. v2.7.10	修复 bugs，offtable 添加 threshold = 2 的分界
+# 2024.06.03. v2.7.10	修复 bugs，offtable 添加 threshold = 2 的分界
+# 2024.06.04. v2.7.11	readme 修改
+# 2024.11.19. v2.7.12	offtracker_candidates.py 新增 --pam_location 参数指定 upstream 或 downstream，用于非 Cas9 情况
+# 2025.04.25. v2.8.0	修复了 offtracker candidates 会把小写序列转换成 N 的 bug
+# 2025.05.22. v2.9.0	翻新部分代码结构
+# 2025.06.05. v2.10.0	增加了QC模块。保留了负数score的记录，并在plot时显示为红字。增加了 "--ignore_chr" 用于跳过common chr过滤。

offtracker-2.10.0/offtracker/snakefile/Snakefile_QC.smk ADDED Viewed

@@ -0,0 +1,66 @@
+# 更新记录:
+# 2022.05.04. v1.0:    初步运行, fastp + multiqc
+# 2024.01.17. v2.0:    翻新结构，匹配 X_NGS 框架
+# 参数列表
+configfile: "config.yaml"
+### config['files_R1'], config['files_R2'] 为 dict型
+# # fastq 信息
+_files_R1 = config['files_R1'] # dict型, key 为 sample
+_files_R2 = config['files_R2'] # dict型, key 为 sample
+# # 输入输出文件夹
+# config['input_dir']
+_output_dir = config["output_dir"]
+# # 运行参数
+_thread = config['thread']
+# config['utility_dir']
+import os
+############################
+# conditional output_files #
+############################
+output_HT = expand( os.path.join(_output_dir,"{sample}_fastp.html"), sample=_files_R1)
+output_JS = expand( os.path.join(_output_dir,"{sample}_fastp.json"), sample=_files_R1)
+output_MQC = os.path.join(_output_dir,"MultiQC_Report_Raw.html")
+output_R1 = expand( os.path.join(_output_dir,"{sample}_trimmed_1.fq.gz"), sample=_files_R1) # dict 会自动迭代 keys
+output_R2 = expand( os.path.join(_output_dir,"{sample}_trimmed_2.fq.gz"), sample=_files_R1)
+output_files = output_HT + output_JS + [output_MQC] + output_R1 + output_R2
+rule all:
+    input:
+        output_files
+#######################
+## fastp and multiQC ##
+#######################
+rule QCtrim:
+    input:
+        R1=lambda w: _files_R1[w.sample],
+        R2=lambda w: _files_R2[w.sample]
+    threads:
+        _thread
+    output:
+        R1=os.path.join(_output_dir,"{sample}_trimmed_1.fq.gz"),
+        R2=os.path.join(_output_dir,"{sample}_trimmed_2.fq.gz"),
+        HT=os.path.join(_output_dir,"{sample}_fastp.html"),
+        JS=os.path.join(_output_dir,"{sample}_fastp.json")
+    shell:
+        """
+        fastp -i {input.R1} -I {input.R2} -o {output.R1} -O {output.R2} \
+        -h {wildcards.sample}_fastp.html -j {wildcards.sample}_fastp.json \
+        --length_required 10 --thread {threads} --detect_adapter_for_pe --disable_quality_filtering
+        """
+rule multiqc:
+    input:
+        expand( os.path.join(_output_dir,"{sample}_fastp.html"), sample=_files_R1 )
+    threads:
+        _thread
+    output:
+        os.path.join(_output_dir,"MultiQC_Report_Raw.html")
+    shell:
+        "multiqc {_output_dir} -n MultiQC_Report_Raw --outdir {_output_dir}"

offtracker 2.7.10__zip → 2.10.0__zip

offtracker 2.7.10zip → 2.10.0zip