PyPI - toulligqc - Versions diffs - 2.5.7__tar.gz → 2.7__tar.gz - Mend

toulligqc 2.5.7tar.gz → 2.7tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (42) hide show

{toulligqc-2.5.7 → toulligqc-2.7}/PKG-INFO RENAMED Viewed

@@ -1,10 +1,10 @@
 Metadata-Version: 2.1
 Name: toulligqc
-Version: 2.5.7
+Version: 2.7
 Summary: A post sequencing QC tool for Oxford Nanopore sequencers
-Home-page: https://github.com/GenomicParisCentre/toulligQC
+Home-page: https://github.com/GenomiqueENS/toulligQC
 Author: Genomic Paris Centre team
-Author-email: toulligqc@biologie.ens.fr
+Author-email: toulligqc@bio.ens.psl.eu
 License: GPL V3
 Keywords: Nanopore MinION QC report
 Platform: ALL
@@ -15,7 +15,7 @@ Classifier: Intended Audience :: Science/Research
 Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
 Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
 Classifier: License :: OSI Approved :: CEA CNRS Inria Logiciel Libre License, version 2.1 (CeCILL-2.1)
-Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
 Requires-Python: >=3.11.0
 License-File: LICENSE-CeCILL.txt
 License-File: LICENSE.txt

{toulligqc-2.5.7 → toulligqc-2.7}/README.md RENAMED Viewed

@@ -24,6 +24,7 @@ Support is availlable on [GitHub issue page](https://github.com/GenomicParisCent
   * 1.3 [Docker](#docker)
      *  [Docker image recovery](#docker-image-recovery)
      *  [Launching Docker image with docker run](#launching-Docker-image-with-docker-run)
+  * 1.4 [nf-core module](#nfcore-module)
 * 2.[Usage](#usage)
   * 2.1 [Command line](#command-line)
@@ -93,14 +94,25 @@ $ docker run -ti \
              -v /path/to/basecaller/sequencing/summary/file:/path/to/basecaller/sequencing/summary/file \
              -v /path/to/basecaller/sequencing/telemetry/file:/path/to/basecaller/telemetry/summary/file \
              -v /path/to/result/directory:/path/to/result/directory \
-             toulligqc:latest
+             genomicpariscentre/toulligqc:latest
 ```
+<a name="nfcore-module"></a>
+### 1.4 Using nf-core module
+ToulligQC is also available on nf-core as a module written in nextflow. To install nf-core on your system, please visit their website (<https://nf-co.re/docs/usage/introduction>).
+The following command line will install the latest version of the ToulligQC module:
+```bash
+$ nf-core modules install toulligqc
+```
 <a name="usage"></a>
 ## 2. Usage
 <a name="command-line"></a>
 ToulligQC is adapted to RNA-Seq along with DNA-Seq and it is compatible with 1D² runs.
-This QC tool supports only Guppy basecalling ouput files.
+This QC tool supports only Guppy and Dorado basecalling ouput files.
 It also needs a single FAST5 file (to catch the flowcell ID and the run date) if a telemetry file is not provided.
 Flow cells and kits version are retrieved using the telemetry file.
 ToulligQC can take barcoding samples by adding the barcode list as a command line option.
@@ -111,7 +123,7 @@ To do so, ToulligQC deals with different file formats: gz, tar.gz, bz2, tar.bz2
 This tool will produce a set of graphs, statistic file in plain text format and a HTML report.
-To run ToulligQC you need the Guppy basecaller output files : ```sequencing_summary.txt``` and ```sequencing_telemetry.js```. or ```FASTQ``` or ```BAM```
+To run ToulligQC you need the Guppy/ Dorado basecaller output files : ```sequencing_summary.txt``` and ```sequencing_telemetry.js```. or ```FASTQ``` or ```BAM```
 This can be compressed with gzip or bzip2.
 You can use your initial Fast5 ONT file too.
 ToulligQC can perform analyses on your data if the directory is organised as the following:
@@ -132,7 +144,7 @@ RUN_ID
     └── sequencing_1dsq_summary.txt
  ```
-For a barcoded run you can add the barcoding files generated by Guppy ```barcoding_summary_pass.txt``` and ```barcoding_summary_fail.txt``` to ToulligQC or a single file ```sequencing_summary_all.txt``` containing sequencing_summary and barcoding_summary information combined.
+For a barcoded run you can add the barcoding files generated by Guppy/ Dorado ```barcoding_summary_pass.txt``` and ```barcoding_summary_fail.txt``` to ToulligQC or a single file ```sequencing_summary_all.txt``` containing sequencing_summary and barcoding_summary information combined.
 For the barcode list to use in the command line options, ToulligQC handle the following naming schemes: BCXX, RBXX, NBXX and barcodeXX where XX is the number of the barcode.
 The barcode naming schemes are case insensitive.
@@ -156,14 +168,16 @@ This is a directory for 1D² analysis with barcoding files:
 General Options:
 ```
-usage: ToulligQC V2.2.1 -a SEQUENCING_SUMMARY_SOURCE [-t TELEMETRY_SOURCE]
-                        [--fastq -q FASTQ] [--bam -u BAM]
-                        [-f FAST5_SOURCE] [-n REPORT_NAME]
-                        [--output-directory OUTPUT] [-o HTML_REPORT_PATH]
-                        [--data-report-path DATA_REPORT_PATH]
-                        [--images-directory IMAGES_DIRECTORY]
-                        [-d SEQUENCING_SUMMARY_1DSQR_SOURCE] [-b]
-                        [-l BARCODES] [--quiet] [--force] [-h] [--version]
+usage: ToulligQC V2.6 [-a SEQUENCING_SUMMARY_SOURCE] [-t TELEMETRY_SOURCE]
+                      [-f FAST5_SOURCE] [-p POD5_SOURCE] [-q FASTQ] [-u BAM]
+                      [--thread THREAD] [--batch-size BATCH_SIZE] [--qscore-threshold THRESHOLD]
+                      [-n REPORT_NAME] [--output-directory OUTPUT] [-o HTML_REPORT_PATH]
+                      [--data-report-path DATA_REPORT_PATH]
+                      [--images-directory IMAGES_DIRECTORY]
+                      [-d SEQUENCING_SUMMARY_1DSQR_SOURCE]
+                      [-s SAMPLESHEET]
+                      [-b] [-l BARCODES]
+                      [--quiet] [--force] [-h] [--version]
 required arguments:
   -a SEQUENCING_SUMMARY_SOURCE, --sequencing-summary-source SEQUENCING_SUMMARY_SOURCE
@@ -175,6 +189,9 @@ required arguments:
   -f FAST5_SOURCE, --fast5-source FAST5_SOURCE
                         Fast5 file source (necessary if no telemetry file),
                         can also be in a tar.gz/tar.bz2 archive or a directory
+  -p POD5_SOURCE, --pod5-source POD5_SOURCE
+                        pod5 file source (necessary if no telemetry file),
+                        can also be in a tar.gz/tar.bz2 archive or a directory
   -q FASTQ, --fastq FASTQ
                         FASTQ file (necessary if no sequencing summary file),
                         can also be in a .gz archive
@@ -183,6 +200,8 @@ required arguments:
                         can also be a SAM format
 optional arguments:
+  -s SAMPLESHEET, --samplesheet SAMPLESHEET
+                        Samplesheet (.csv file) to fill out sample names in MinKNOW.
   -n REPORT_NAME, --report-name REPORT_NAME
                         Report name
   --output-directory OUTPUT
@@ -197,8 +216,9 @@ optional arguments:
                         Basecaller 1dsq summary source
   -b, --barcoding       Option for barcode usage
   -l BARCODES, --barcodes BARCODES
-                        Coma separated barcode list (e.g.
-                        BC05,RB09,NB01,barcode10)
+                        Comma-separated barcode list (e.g.,
+                        BC05,RB09,NB01,barcode10) or a range separated with ':' (e.g.,
+                        barcode01:barcode19)
   --thread THREAD       Number of threads for parsing FASTQ or BAM files (default: 2).
   --batch-size BATCH_SIZE Batch size for each threads (default: 500).
   --qscore-threshold THRESHOLD Q-score threshold to distinguish between passing filter and
@@ -213,7 +233,41 @@ optional arguments:
  * #### Examples
-Example with optional arguments:
+* Sequencing summary alone \
+Note that the fowcell ID and run date will be missing from report, found in telemetry file or single fast5 file
+```bash
+$ toulligqc --report-name summary_only \
+            --sequencing-summary-source /path/to/basecaller/output/sequencing_summary.txt \
+            --html-report-path /path/to/output/report.html
+```
+* Sequencing summary + telemetry file
+```bash
+$ toulligqc --report-name summary_plus_telemetry \
+            --telemetry-source /path/to/basecaller/output/sequencing_telemetry.js \
+            --sequencing-summary-source /path/to/basecaller/output/sequencing_summary.txt \
+            --html-report-path /path/to/output/report.html
+```
+* Telemetry file + fast5 files
+```bash
+$ toulligqc --report-name telemetry_plus_fast5 \
+            --telemetry-source /path/to/basecaller/output/sequencing_telemetry.js \
+            --fast5-source /path/to/basecaller/output/fast5_files.fast5.gz \
+            --html-report-path /path/to/output/report.html
+```
+* Fastq/ bam files only
+```bash
+$ toulligqc --report-name FAF0256 \
+            --fastq /path/to/basecaller/output/fastq_files.fq.gz \ # (replace with --bam)
+            --html-report-path /path/to/output/report.html
+```
+* Optional arguments for 1D² analysis
 ```bash
 $ toulligqc --report-name FAF0256 \
@@ -223,7 +277,7 @@ $ toulligqc --report-name FAF0256 \
             --html-report-path /path/to/output/report.html
 ```
-Example with optional arguments to deal with barcoded samples:
+* Optional arguments to deal with barcoded samples
 ```bash
 $ toulligqc --report-name FAF0256 \
@@ -271,7 +325,7 @@ $ toulligqc \
     --sequencing-summary-source sequencing_summary.txt \
     --sequencing-summary-source barcoding_summary_pass.txt \
     --sequencing-summary-source barcoding_summary_fail.txt \
-    --barcodes                  BC01,BC02,BC03,BC04,BC05,BC07 \
+    --barcodes                  BC01:BC07 \
     --output-directory          output
 ```

{toulligqc-2.5.7 → toulligqc-2.7}/setup.py RENAMED Viewed

@@ -14,11 +14,11 @@ setup(
     long_description='See project website for more information.',
     # The project's main homepage.
-    url='https://github.com/GenomicParisCentre/toulligQC',
+    url='https://github.com/GenomiqueENS/toulligQC',
     # Author details
     author='Genomic Paris Centre team',
-    author_email='toulligqc@biologie.ens.fr',
+    author_email='toulligqc@bio.ens.psl.eu',
     license='GPL V3',
     platforms='ALL',
@@ -34,7 +34,7 @@ setup(
         'License :: OSI Approved :: GNU General Public License v3 (GPLv3)',
         'License :: OSI Approved :: CEA CNRS Inria Logiciel Libre License, version 2.1 (CeCILL-2.1)',
-        'Programming Language :: Python :: 3.11'
+        'Programming Language :: Python :: 3.12'
     ],
     keywords='Nanopore MinION QC report',
@@ -46,10 +46,10 @@ setup(
     include_package_data=True,
     python_requires='>=3.11.0',
-    install_requires=['matplotlib>=3.6.3',   'plotly==4.5.0', 'h5py>=3.7.0',
-                      'pandas>=1.5.3',       'numpy>=1.24.2',  'scipy>=1.10.1',
-                      'scikit-learn>=1.2.1', 'tqdm>=4.64.1',   'pysam>=0.21.0',
-                      'ezcharts==0.7.6'],
+    install_requires=['matplotlib>=3.6.3',   'plotly==5.15.0', 'h5py>=3.10.0',
+                      'pandas>=2.1.4',       'numpy>=1.26.4',  'scipy>=1.11.4',
+                      'scikit-learn>=1.4.1', 'tqdm>=4.66.2',   'pysam>=0.22.0',
+                      'pod5>=0.3.10', 'ezcharts==0.7.6'],
     entry_points={
         'console_scripts': [

{toulligqc-2.5.7 → toulligqc-2.7}/toulligqc/bam_extractor.py RENAMED Viewed

@@ -4,7 +4,7 @@ import numpy as np
 import pandas as pd
 import time
 import pysam
-from datetime import datetime
+from collections import defaultdict
 from toulligqc.extractor_common import log_task
 from toulligqc.extractor_common import describe_dict
 from toulligqc.extractor_common import set_result_value
@@ -15,6 +15,7 @@ from toulligqc.extractor_common import get_result_value
 from toulligqc.extractor_common import set_result_dict_telemetry_value
 from toulligqc.extractor_common import fill_series_dict
 from toulligqc.extractor_common import timeISO_to_float
+from toulligqc.extractor_common import extract_barcode_info
 from toulligqc.common_statistics import compute_NXX, compute_LXX, occupancy_channel, avg_qual
 from toulligqc.fastq_bam_common import multiprocessing_submit, extract_headerTag
 from toulligqc.fastq_bam_common import batch_iterator
@@ -24,13 +25,16 @@ from toulligqc import plotly_graph_generator as pgg
 class uBAM_Extractor:
     def __init__(self, config_dictionary):
-        self.config_file_dictionary = config_dictionary
+        self.config_dictionary = config_dictionary
         self.ubam = config_dictionary['bam'].split('\t')
         self.images_directory = config_dictionary['images_directory']
         self.threshold_Qscore = int(config_dictionary['threshold'])
         self.batch_size = int(config_dictionary['batch_size'])
         self.thread = int(config_dictionary['thread'])
         self.header = dict()
+        self.is_barcode = False
+        if config_dictionary['barcoding'] == 'True':
+            self.is_barcode = True
         if 'quiet' not in config_dictionary or config_dictionary['quiet'].lower() != 'true':
             self.quiet = False
         else:
@@ -53,12 +57,24 @@ class uBAM_Extractor:
         :return: Panda's Dataframe object
         """
         start_time = time.time()
-        self.dataframe_1d = self._load_uBAM_data()
-        if self.dataframe_1d.empty:
+        self.dataframe = self._load_uBAM_file()
+        if self.dataframe.empty:
             raise pd.errors.EmptyDataError("Dataframe is empty")
         self.dataframe_dict = {}
+        # Add missing categories
+        if 'barcode_arrangement' in self.dataframe.columns:
+            self.dataframe['barcode_arrangement'] = self.dataframe['barcode_arrangement'].cat.add_categories([0,
+                                                                                        'other barcodes',
+                                                                                        'passes_filtering'])
+        # Replace all NaN values by 0 to avoid data manipulation errors when columns are not the same length
+        self.dataframe = self.dataframe.fillna(0)
+        self.barcode_selection = self.config_dictionary['barcode_selection']
         log_task(self.quiet,
-                 'Load BAM file ({:,.2f} MB used)'.format(self.dataframe_1d.memory_usage(deep=True).sum()/1024/1024),
+                 'Load BAM file ({:,.2f} MB used)'.format(self.dataframe.memory_usage(deep=True).sum()/1024/1024),
                  start_time,
                  time.time())
@@ -70,7 +86,7 @@ class uBAM_Extractor:
         """
         check_result_values(self, result_dict)
         self.dataframe_dict.clear()
-        self.dataframe_1d.iloc[0:0]
+        self.dataframe.iloc[0:0]
     @staticmethod
@@ -79,7 +95,7 @@ class uBAM_Extractor:
         Get the name of the extractor.
         :return: the name of the extractor
         """
-        return 'ubam'
+        return 'uBAM'
     @staticmethod
@@ -100,14 +116,38 @@ class uBAM_Extractor:
         add_image_to_result(self.quiet, images, time.time(), pgg.read_count_histogram(result_dict, self.images_directory))
         add_image_to_result(self.quiet, images, time.time(), pgg.read_length_scatterplot(self.dataframe_dict, self.images_directory))
-        add_image_to_result(self.quiet, images, time.time(), pgg.yield_plot(self.dataframe_1d, self.images_directory))
+        add_image_to_result(self.quiet, images, time.time(), pgg.yield_plot(self.dataframe, self.images_directory))
         add_image_to_result(self.quiet, images, time.time(), pgg.read_quality_multiboxplot(self.dataframe_dict, self.images_directory))
         add_image_to_result(self.quiet, images, time.time(), pgg.allphred_score_frequency(self.dataframe_dict, self.images_directory))
-        add_image_to_result(self.quiet, images, time.time(), pgg.plot_performance(self.dataframe_1d, self.images_directory))
+        add_image_to_result(self.quiet, images, time.time(), pgg.plot_performance(self.dataframe, self.images_directory))
         add_image_to_result(self.quiet, images, time.time(), pgg.twod_density(self.dataframe_dict, self.images_directory))
         add_image_to_result(self.quiet, images, time.time(), pgg.sequence_length_over_time(self.dataframe_dict, self.images_directory))
         add_image_to_result(self.quiet, images, time.time(), pgg.phred_score_over_time(self.dataframe_dict, result_dict, self.images_directory))
         add_image_to_result(self.quiet, images, time.time(), pgg.speed_over_time(self.dataframe_dict, self.images_directory))
+        if self.is_barcode:
+            if "barcode_alias" in self.config_dictionary:
+                barcode_alias = self.config_dictionary['barcode_alias']
+            else:
+                barcode_alias = None
+            add_image_to_result(self.quiet, images, time.time(), pgg.barcode_percentage_pie_chart_pass(self.dataframe_dict,
+                                                                                                       self.barcode_selection,
+                                                                                                       self.images_directory,
+                                                                                                       barcode_alias))
+            read_fail = self.dataframe_dict["read.fail.barcoded"]
+            if not (len(read_fail) == 1 and read_fail["other barcodes"] == 0):
+                add_image_to_result(self.quiet, images, time.time(), pgg.barcode_percentage_pie_chart_fail(self.dataframe_dict,
+                                                                                                      self.barcode_selection,
+                                                                                                      self.images_directory,
+                                                                                                      barcode_alias))
+            add_image_to_result(self.quiet, images, time.time(), pgg.barcode_length_boxplot(self.dataframe_dict,
+                                                                                            self.images_directory,
+                                                                                            barcode_alias))
+            add_image_to_result(self.quiet, images, time.time(), pgg.barcoded_phred_score_frequency(self.dataframe_dict,
+                                                                                                    self.images_directory,
+                                                                                                    barcode_alias))
         return images
@@ -117,7 +157,7 @@ class uBAM_Extractor:
         :param result_dict:
         """
         start_time = time.time()
-        fill_series_dict(self.dataframe_dict, self.dataframe_1d)
+        fill_series_dict(self.dataframe_dict, self.dataframe)
         set_result_dict_telemetry_value(result_dict, "run.id", self.header["run_id"])
         set_result_dict_telemetry_value(result_dict, "sample.id", self.header["sample_id"])
@@ -128,11 +168,11 @@ class uBAM_Extractor:
         set_result_dict_telemetry_value(result_dict, "basecalling.date", self.header["run_date"])
         set_result_dict_telemetry_value(result_dict, "pass.threshold.qscore", str(self.threshold_Qscore))
-        set_result_value(self, result_dict, "read.count", len(self.dataframe_1d))
+        set_result_value(self, result_dict, "read.count", len(self.dataframe))
         set_result_value(self, result_dict, "read.pass.count",
-                         count_boolean_elements(self.dataframe_1d, 'passes_filtering', True))
+                         count_boolean_elements(self.dataframe, 'passes_filtering', True))
         set_result_value(self, result_dict, "read.fail.count",
-                         count_boolean_elements(self.dataframe_1d, 'passes_filtering', False))
+                         count_boolean_elements(self.dataframe, 'passes_filtering', False))
         total_reads = get_result_value(self, result_dict, "read.count")
         # Ratios
@@ -160,9 +200,9 @@ class uBAM_Extractor:
         set_result_value(self, result_dict, "n50", compute_NXX(self.dataframe_dict, 50))
         set_result_value(self, result_dict, "l50", compute_LXX(self.dataframe_dict, 50))
-        set_result_value(self, result_dict, "run.time", max(self.dataframe_1d['start_time']))
+        set_result_value(self, result_dict, "run.time", max(self.dataframe['start_time']))
         # Get channel occupancy statistics and store each value into result_dict
-        for index, value in occupancy_channel(self.dataframe_1d).items():
+        for index, value in occupancy_channel(self.dataframe).items():
             set_result_value(self,
                             result_dict, "channel.occupancy.statistics." + index, value)
@@ -180,7 +220,7 @@ class uBAM_Extractor:
                       "fail.reads.sequence.length")
         # Get Qscore statistics without count value and store them into result_dict
-        qscore_statistics = self.dataframe_1d['mean_qscore'].describe().drop(
+        qscore_statistics = self.dataframe['mean_qscore'].describe().drop(
             "count")
         for index, value in qscore_statistics.items():
@@ -190,11 +230,16 @@ class uBAM_Extractor:
         # Add statistics (without count) about read pass/fail qscore in the result_dict
         describe_dict(self, result_dict, self.dataframe_dict["pass.reads.mean.qscore"], "pass.reads.mean.qscore")
         describe_dict(self, result_dict, self.dataframe_dict["fail.reads.mean.qscore"], "fail.reads.mean.qscore")
+        if self.is_barcode:
+            extract_barcode_info(self, result_dict,
+                                 self.barcode_selection,
+                                 self.dataframe_dict,
+                                 self.dataframe)
         log_task(self.quiet, 'Extract info from uBAM file', start_time, time.time())
-    def _load_uBAM_data(self):
+    def _load_uBAM_file(self):
         """
         Load uBAM dataframe
         :return: a Pandas Dataframe object
@@ -205,23 +250,27 @@ class uBAM_Extractor:
                                                         uBAM_chunks,
                                                         n_process=self.thread,
                                                         pbar_update=self.batch_size)
-        uBAM_data = []
+        uBAM_df = []
         for _, f in enumerate(rst_futures):
-            uBAM_data.extend(f.result())
+            uBAM_df.extend(f.result())
         columns = ['sequence_length', 'mean_qscore', 'passes_filtering', 'start_time', 'channel', 'duration']
-        uBAM_data = pd.DataFrame(uBAM_data, columns=columns)
+        if self.is_barcode:
+            columns.append('barcode_arrangement')
-        uBAM_data['sequence_length'] = uBAM_data['sequence_length'].astype(np.uint32)
-        uBAM_data['mean_qscore'] = uBAM_data['mean_qscore'].astype(np.float32)
-        uBAM_data['passes_filtering'] = uBAM_data['passes_filtering'].astype(np.bool_ if is_numpy_1_24 else np.bool)
-        uBAM_data["start_time"] = uBAM_data["start_time"] - uBAM_data["start_time"].min()
-        uBAM_data['channel'] = uBAM_data['channel'].astype(np.int16)
-        uBAM_data['start_time'] = uBAM_data['start_time'].astype(np.float64)
-        uBAM_data['duration'] = uBAM_data['duration'].astype(np.float32)
-        return uBAM_data
+        uBAM_df = pd.DataFrame(uBAM_df, columns=columns)
+        uBAM_df['sequence_length'] = uBAM_df['sequence_length'].astype(np.uint32)
+        uBAM_df['mean_qscore'] = uBAM_df['mean_qscore'].astype(np.float32)
+        uBAM_df['passes_filtering'] = uBAM_df['passes_filtering'].astype(np.bool_ if is_numpy_1_24 else np.bool)
+        uBAM_df["start_time"] = uBAM_df["start_time"] - uBAM_df["start_time"].min()
+        uBAM_df['channel'] = uBAM_df['channel'].astype(np.int16)
+        uBAM_df['start_time'] = uBAM_df['start_time'].astype(np.float64)
+        uBAM_df['duration'] = uBAM_df['duration'].astype(np.float32)
+        if self.is_barcode:
+            uBAM_df['barcode_arrangement'] = uBAM_df['barcode_arrangement'].astype("category")
+        return uBAM_df
     def _uBAM_batch_reader(self, uBAM_chunk):
@@ -251,14 +300,6 @@ class uBAM_Extractor:
                 yield batch
-    def _timeISO_to_float(self, iso_datetime, format):
-        """
-        """
-        dt = datetime.strptime(iso_datetime, format)
-        unix_timestamp = dt.timestamp()
-        return unix_timestamp
     def _get_header(self):
         sam_file = pysam.AlignmentFile(self.ubam[0], "rb", check_sq=False)
         header = sam_file.header.to_dict()
@@ -299,4 +340,6 @@ class uBAM_Extractor:
             attributes.get('ch', '1'),  # Channel
             attributes.get('du', '1')  # Duration
         ]
+        if self.is_barcode:
+            data.append(attributes.get('BC', 'unclassified'))
         return data

{toulligqc-2.5.7 → toulligqc-2.7}/toulligqc/common.py RENAMED Viewed

@@ -20,6 +20,8 @@
 import numpy as np
 from packaging import version
+import glob
+import os
 def is_numpy_1_24():
     """
@@ -27,6 +29,7 @@ def is_numpy_1_24():
     """
     return version.parse(np.__version__) >= version.parse("1.20")
 def format_duration(t):
     """
     Format a time duration
@@ -35,3 +38,28 @@ def format_duration(t):
     """
     return "{:,d}m{:2.2f}s".format(int(t // 60), t % 60)
+def set_result_dict_value(result_dict, key, tracking_id_dict, dict_key):
+    """
+    Set metadata values from Fast5 or pod5 dict to result_dict
+    """
+    value = ''
+    if dict_key in tracking_id_dict:
+        value = tracking_id_dict[dict_key]
+    result_dict[key] = value
+def find_file_in_directory(source_file, format):
+    """
+    Looking for a suitable Fast5 or Pod5 file in the source directory.
+    :return: The path to the first suitable file in the source directory
+    """
+    for ext in (format, 'tar.bz2', 'tar.gz'):
+        if glob.glob(source_file + '/*.' + ext):
+            files_found = os.listdir(source_file)
+            if len(files_found) > 0:
+                return source_file + files_found[0]
+    return None

{toulligqc-2.5.7 → toulligqc-2.7}/toulligqc/extractor_common.py RENAMED Viewed

@@ -164,16 +164,24 @@ def extract_barcode_info(extractor, result_dict, barcode_selection, dataframe_di
     if "unclassified" not in barcode_selection:
         barcode_selection.append("unclassified")
+    # If the barcode_arrangement column contains a barcode kit id
+    mask = df['barcode_arrangement'].str.startswith(('SQK', 'VQK'))
+    if mask.any():
+        df['barcode_arrangement'] = df['barcode_arrangement'].astype(str)
+        df.loc[mask, 'barcode_arrangement'] = df.loc[mask, 'barcode_arrangement'].str.extract(r'[SV]QK-.+_(.+)$')[0]
     # Create keys barcode.arrangement, and read.pass/fail.barcode in dataframe_dict with all values of
     # column barcode_arrangement when reads are passed/failed
-    dataframe_dict["barcode.arrangement"] = df["barcode_arrangement"]
+    dataframe_dict["barcode.arrangement"] = df['barcode_arrangement']
     # Print warning message if a barcode is unknown
-    barcodes_found = set(dataframe_dict["barcode.arrangement"].unique())
+    barcodes_found = set(df["barcode_arrangement"].unique())
     for element in barcode_selection:
         if element not in barcodes_found and element != 'other barcodes':
-            sys.stderr.write("Warning: The barcode {} doesn't exist in input data\n".format(element))
+            sys.stderr.write("\033[93mWarning:\033[0m The barcode {} doesn't exist in input data\n".format(element))
     # Get barcodes frequency by Bases
     df_base_pass_barcode = series_cols_boolean_elements(df, ["barcode_arrangement",  "sequence_length"],
@@ -218,6 +226,7 @@ def extract_barcode_info(extractor, result_dict, barcode_selection, dataframe_di
                      (read_fail_barcoded_count / total_reads) * 100)
     # Replaces all rows with unused barcodes (ie not in barcode_selection) in column barcode_arrangement with the 'other' value
     df.loc[~df['barcode_arrangement'].isin(
         barcode_selection), 'barcode_arrangement'] = 'other barcodes'
@@ -447,6 +456,7 @@ def read_first_line_file(filename):
     except IOError:
         raise FileNotFoundError
 def set_result_dict_telemetry_value(result_dict, key, new_value):
     """
     """
@@ -461,4 +471,15 @@ def set_result_dict_telemetry_value(result_dict, key, new_value):
     if new_value is None:
         new_value = current_value
-    result_dict[final_key] = new_value
+    result_dict[final_key] = new_value
+def pd_read_sequencing_summary(file, cols, data_type):
+        try:
+            return pd.read_csv(file, sep="\t", usecols=cols,
+                            dtype=data_type)
+        except:
+            del data_type['passes_filtering']
+            cols.remove('passes_filtering')
+            return pd.read_csv(file, sep="\t", usecols=cols,
+                            dtype=data_type)

toulligqc 2.5.7__tar.gz → 2.7__tar.gz

toulligqc 2.5.7tar.gz → 2.7tar.gz