PyPI - nci-cidc-schemas - Versions diffs - 0.28.1__py2.py3-none-any.whl → 0.28.3__py2.py3-none-any.whl - Mend

nci-cidc-schemas 0.28.1py2.py3-none-any.whl → 0.28.3py2.py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (29) hide show

cidc_schemas/__init__.py CHANGED Viewed

@@ -2,4 +2,4 @@
 __author__ = """NCI"""
 __email__ = "nci-cidc-tools-admin@mail.nih.gov"
-__version__ = "0.28.1"
+__version__ = "0.28.3"

cidc_schemas/ngs_pipeline_api/__init__.py ADDED Viewed

@@ -0,0 +1,29 @@
+# NOTE: this is copied form nci-cidc-ngs-pipeline-api==0.1.25 which is archived
+import os
+from json import load
+# __author__ = """NCI"""
+# __email__ = "nci-cidc-tools-admin@mail.nih.gov"
+# __version__ = "0.1.25"
+_API_ENDING = "_output_API.json"
+_BASE_DIR = os.path.dirname(os.path.abspath(__file__))
+_SCHEMA_PATH = os.path.join(_BASE_DIR, "output_API.schema.json")
+try:
+    with open(_SCHEMA_PATH, "r", encoding="UTF") as f:
+        METASCHEMA = load(f)
+except Exception as e:  # pylint: disable=broad-except
+    raise Exception(f"Failed loading json {_SCHEMA_PATH}") from e
+OUTPUT_APIS = {}
+for dname, _, files in os.walk(_BASE_DIR):
+    for fname in files:
+        if fname.endswith(_API_ENDING):
+            analysis = fname[: -len(_API_ENDING)]
+            with open(os.path.join(dname, fname), "rb") as f:
+                OUTPUT_APIS[analysis] = load(f)

cidc_schemas/ngs_pipeline_api/atacseq/atacseq.md ADDED Viewed

@@ -0,0 +1,55 @@
+## CHIPS ATAC-seq Pipeline Description
+The CHIPS pipeline is designed to perform robust quality control and reproducible processing of the chromatin profile sequencing data derived from ChIP-seq, DNase-seq, and ATAC-seq. The CHIPS pipeline includes procedures such as read alignment, peak calling, motif finding, and putative target prediction. The inputs to the pipeline are FASTQ/BAM format DNA sequence read files. The analysis process is split into three main components: read alignment, quality control and downstream analysis. The pipeline itself is encoded in the workflow language [Snakemake](https://snakemake.readthedocs.io/) and executed in a conda environment using the Google Cloud Compute Engine.
+The main components of the CHIPs ATAC-seq pipline are:
+* Read alignment
+* Quality control:
+    * Mapped reads
+    * Sample contamination from other species.
+    * Evolutionary conservation
+    * Fraction of reads in peaks (FRIP) and PRC bottleneck (PBC) score
+    * Overlapping with union Dnase Hypersensitivy Sites.
+    * Number of high-quality peaks,
+* Peak calling using [`MACS2`](https://github.com/macs3-project/MACS) and generating genome browser view [bigwig](https://genome.ucsc.edu/goldenPath/help/bigWig.html) file.
+* Downstream analysis:
+    * Peak annotation
+    * putative target prediction using [Regulatory Potential](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-1934-6).
+    * Motif enrichment using [`Homer`](http://homer.ucsd.edu/homer/motif/)
+### workflow figure for ATAC-seq pipeline
+![](https://raw.githubusercontent.com/CIMAC-CIDC/cidc-ngs-pipeline-api/master/cidc_ngs_pipeline_api/atacseq/imgs/atacseq.png)
+## Versions of Tools and Reference Files Used in CHIPs
+| Software         | Version | Source                | Notes               |
+|------------------|---------|-----------------------|---------------------|
+| snakemake        | 5.4.5   | bioconda              | Pipeline management |
+| samtools         | 1.10    | bioconda              |                     |
+| python           | 3.6.12  | conda-forge           |                     |
+| r                | 3.5.1   | conda-forge           |                     |
+| numpy            | 1.19.5  | conda                 |                     |
+| bwa              | 0.7.15  | bioconda              | Alignment           |
+| picard           | 2.18.4  | bioconda              | Mark duplicates     |
+| bedtools         | 2.27.1  | bioconda              |                     |
+| seqtk            | 1.3     | bioconda              |                     |
+| fastqc           | 0.11.9  | bioconda              |                     |
+| ggplot2          | 3.3.0   | conda-forge r         |                     |
+| reshape2         | 1.4.4   | conda-forge r         |                     |
+| git              | 2.26.0  | conda-forge           |                     |
+| perl             | 5.26.2  | conda-forge           |                     |
+| homer            | 4.11    | bioconda              | Motif analysis      |
+| weblogo          | 2.8.2   | bioconda              |                     |
+| seqLogo          | 1.50.0  | bioconda bioconductor |                     |
+| bedgraphtobigwig | 377     | bioconda ucsc         |                     |
+| bedsort          | 377     | bioconda ucsc         |                     |
+| seaborn          | 0.11.1  | conda-forge           |                     |
+| r.utils          | 2.9.2   | conda-forge r         |                     |
+| pybigwig         | 0.3.17  | bioconda              |                     |
+| cython           | 0.29.2  | conda                 |                     |
+| jinja2           | 2.11.2  | conda                 |                     |
+| macs2            | 2.2.7   | bioconda              | Peak calling        |
+| fastp            | 0.20.1  | bioconda              | Adaptor trimming    |

cidc_schemas/ngs_pipeline_api/atacseq/atacseq_output_API.json ADDED Viewed

@@ -0,0 +1,39 @@
+{
+    "cimac id": [
+        {
+            "filter_group": "peaks/sorted_peaks",
+            "file_path_template": "analysis/peaks/{cimac id}.rep1/{cimac id}.rep1_sorted_peaks.bed",
+            "short_description": "Regular peak called by MACS2",
+            "long_description": "5th: integer score for display. It's calculated as int(-10*log10pvalue) or int(-10*log10qvalue) depending on whether -p (pvalue) or -q (qvalue) is used as score cutoff 7th: fold-change at peak summit 8th: -log10pvalue at peak summit 9th: -log10qvalue at peak summit 10th: relative summit position to peak start. https://github.com/macs3-project/MACS",
+            "file_purpose": "Analysis view"
+        },
+        {
+            "filter_group": "peaks/sorted_summits",
+            "file_path_template": "analysis/peaks/{cimac id}.rep1/{cimac id}.rep1_sorted_summits.bed",
+            "short_description": "Peak summit called by MACS2",
+            "long_description": "MACS2-called location with the highest fragment pileup aka the summit",
+            "file_purpose": "Analysis view"
+        },
+        {
+            "filter_group": "peaks/sorted_narrowPeak",
+            "file_path_template": "analysis/peaks/{cimac id}.rep1/{cimac id}.rep1_sorted_peaks.narrowPeak",
+            "short_description": "narrowPeak called by MACS2",
+            "long_description": "MACS2-called peak locations, summits, p-, and q-values in BED6+4 format",
+            "file_purpose": "Analysis view"
+        },
+        {
+            "filter_group": "peaks/bigwig",
+            "file_path_template": "analysis/peaks/{cimac id}.rep1/{cimac id}.rep1_treat_pileup.bw",
+            "short_description": "bigwig file",
+            "long_description": "RPKM (reads per kilobase per million) normalized pile up bigwig file for visualization in IGV",
+            "file_purpose": "Analysis view"
+        },
+        {
+            "filter_group": "align/sorted_bam",
+            "file_path_template": "analysis/align/{cimac id}/{cimac id}.sorted.bam",
+            "short_description": "alignment file",
+            "long_description": "bwa-mem aligned sorted alignment file",
+            "file_purpose": "Source view"
+        }
+    ]
+}

cidc_schemas/ngs_pipeline_api/atacseq/imgs/atacseq.png ADDED Viewed

Binary file

cidc_schemas/ngs_pipeline_api/output_API.schema.json ADDED Viewed

@@ -0,0 +1,45 @@
+{
+  "$schema": "http://json-schema.org/draft-07/schema#",
+  "$id": "output_API.schema",
+  "title": "Pipeline output_API.JSON schema",
+  "type": "object",
+  "description": "Schema for pipeline's output_API.JSONs",
+  "properties": {
+    "id": {"type":"null"}
+  },
+  "additionalProperties": {
+    "type": "array",
+    "items": {
+      "type": "object",
+      "properties": {
+        "filter_group": {
+          "description": "Filter under which the file would appear during faceted search. It is the GCS-URI top-level hierarchy.",
+          "type": "string"
+        },
+        "file_path_template": {
+          "description": "Local file path template string, describes where the CLI expects a file to be located during upload.",
+          "type": "string"
+        },
+        "short_description": {
+          "description": "Short description, to appear on hover, not more than a sentence long.",
+          "type": "string"
+        },
+        "long_description": {
+          "description": "Long description, to appear in file documentation page, 3-5 sentences long.",
+          "type": "string"
+        },
+        "file_purpose": {
+          "description": "Assigns a tag to for the file to show up in a particular file-browser view configuration.",
+          "type": "string",
+          "enum": [
+            "Source view",
+            "Analysis view",
+            "Clinical view",
+            "Miscellaneous"
+          ]
+        }
+      }
+    }
+  }
+}

cidc_schemas/ngs_pipeline_api/rna/imgs/RIMA.png ADDED Viewed

Binary file

cidc_schemas/ngs_pipeline_api/rna/rna.md ADDED Viewed

@@ -0,0 +1,54 @@
+## RIMA (RNA-Seq IMmune Analysis) Pipeline Description
+Tumor RNA-seq has become an important technique for molecular profiling and
+immune characterization of tumors. RIMA (RNA-seq IMmune Analysis) performs
+integrative computational modeling of tumor microenvironment from bulk tumor
+RNA-seq data, which has the potential to offer essential insights to cancer
+immunology and immune-oncology studies.
+The pre-processing module includes four main procedures:
+- Read mapping
+- Quality control
+- Gene quantification
+- Batch effect removal
+The downstream analysis includes seven modules:
+- Differential gene expression
+- Immune infiltration estimation
+- Immune repertoire estimation
+- Gene fusion
+- Immunotherapy response prediction
+- HLA prediction
+- Microbiome characterization
+RIMA uses a conda virtual environment for software compiling and the python
+Snakemake workflow management system for automatic batch processing.
+### Workflow figure of RIMA pipeline
+![](https://raw.githubusercontent.com/CIMAC-CIDC/cidc-ngs-pipeline-api/master/cidc_ngs_pipeline_api/rna/imgs/RIMA.png)
+### Versions of Tools used in RIMA
+| Software         | Version | Source                | Notes                             |
+|:-----------------|:--------|:----------------------|:----------------------------------|
+| conda            | 4.8.3   | bioconda              | Environment Management            |
+| Snakemake        | 5.5.4   | bioconda              | Pipeline Management               |
+| python           | 3.6.8   | conda-forge           | Scripting  Language               |
+| numpy            | 1.19.1  | conda-forge           | python library                    |
+| perl             | 5.26.2  | conda-forge           | Scripting Language                |
+| r                | 3.5.1   | conda-forge           | Scripting Language                |
+| star             | 2.6.1   | bioconda              | Read Alignment                    |
+| star-fusion      | 1.5.0   | bioconda              | Fusion Transcripts                |
+| samtools         | 1.9     | bioconda              | Alignment Utility                 |
+| salmon           | 1.3.0   | bioconda              | Gene Quantification               |
+| picard           | 2.20.4  | bioconda              | Utility tool for aligned files    |
+| bedtools         | 2.26.0  | bioconda              | Utility tool for aligned files    |
+| bcftools         | 1.9     | bioconda              | BCF Manipulation                  |
+| rseqc            | 3.0.1   | bioconda              | Quality Check                     |
+| limma            | 3.42.0  | bioconductor          | Batch Removal                     |
+| deseq2           | 1.26.0  | bioconductor          | Differential Expression Analysis  |
+| MSISensor2       | 0.1     | git                   | Microsatellite Instability        |
+| TRUST4           | 1.0.0   | git                   | Immune repertoire analysis        |
+| arcasHLA         | 1.3.2   | git                   | HLA Typing                        |
+| TIDEpy           | 1.3.6   | git                   | Immune response prediction        |
+| centrifuge       | 1.0.4   | bioconda              | Microbiome                        |

cidc_schemas/ngs_pipeline_api/rna/rna_config.schema.json ADDED Viewed

@@ -0,0 +1,39 @@
+{
+  "$schema": "http://json-schema.org/draft-07/schema#",
+  "$id": "output_API.schema",
+  "title": "Pipeline output_API.JSON schema",
+  "type": "object",
+  "description": "Schema for pipeline's output_API.JSONs",
+  "properties": {
+    "metasheet": {
+      "type": "string",
+      "pattern": ".*\\.csv$"
+    },
+    "assembly": {
+      "type": "string",
+      "const": "hg38"
+    },
+    "ref": {
+      "type": "string",
+      "pattern": ".*\\.yaml$"
+    },
+    "mate": {
+      "type": "array"
+    },
+    "rsem": {
+      "type": "boolean"
+    },
+    "stranded": {
+      "type": "boolean"
+    },
+    "samples": {
+      "type": "object"
+    },
+    "runs": {
+      "type": "object"
+    }
+  }
+}

cidc_schemas/ngs_pipeline_api/rna/rna_output_API.json ADDED Viewed

@@ -0,0 +1,195 @@
+{
+    "cimac id": [
+        {
+            "filter_group": "",
+            "file_path_template": "analysis/{cimac id}_error.yaml",
+            "short_description": "yaml file that specifies error codes for files",
+            "long_description": "Explanation of all files which are expected to be empty due to a failed/missing module.",
+            "file_purpose": "Analysis view",
+            "optional": true
+        },
+        {
+            "filter_group": "alignment",
+            "file_path_template": "analysis/star/{cimac id}/{cimac id}.sorted.bam",
+            "short_description": "star alignment output",
+            "long_description": "Alignments in binary BAM format sorted by coordinate Aligned.",
+            "file_purpose": "Analysis view"
+        },
+        {
+            "filter_group": "alignment",
+            "file_path_template": "analysis/star/{cimac id}/{cimac id}.sorted.bam.bai",
+            "short_description": "sorted_bam_index file",
+            "long_description": "file sorted_bam_index file sorted_bam_index file sorted_bam_index file",
+            "file_purpose": "Miscellaneous"
+        },
+        {
+            "filter_group": "alignment",
+            "file_path_template": "analysis/star/{cimac id}/{cimac id}.sorted.bam.stat.txt",
+            "short_description": "sorted_bam_stat_txt file",
+            "long_description": "sorted_bam_stat_txt file sorted_bam_stat_txt sorted_bam_stat_txt",
+            "file_purpose": "Miscellaneous"
+        },
+        {
+           "filter_group": "alignment",
+            "file_path_template": "analysis/star/{cimac id}/{cimac id}.transcriptome.bam",
+            "short_description": "transcriptome bam file",
+            "long_description": "file transcriptome_bam file transcriptome_bam file transcriptome_bam file",
+            "file_purpose": "Miscellaneous"
+        },
+        {
+           "filter_group": "alignment",
+            "file_path_template": "analysis/star/{cimac id}/{cimac id}.Chimeric.out.junction",
+            "short_description": "Chimeric junction output",
+            "long_description": "Chimeric junction output for fusion calling",
+            "file_purpose": "Miscellaneous"
+        },
+        {
+            "filter_group": "quality",
+            "file_path_template": "analysis/rseqc/read_distrib/{cimac id}/{cimac id}.txt",
+            "short_description": "read distribution ouput",
+            "long_description": "file read_distrib file read_distrib file read_distrib file",
+            "file_purpose": "Clinical view"
+        },
+        {
+            "filter_group": "quality",
+            "file_path_template": "analysis/rseqc/tin_score/{cimac id}/{cimac id}.summary.txt",
+            "short_description": "tin_score_summary file",
+            "long_description": "file tin_score_summary file tin_score_summary file tin_score_summary file",
+            "file_purpose": "Miscellaneous"
+        },
+        {
+            "filter_group": "quality",
+            "file_path_template": "analysis/rseqc/tin_score/{cimac id}/{cimac id}.tin_score.txt",
+            "short_description": "tin score output",
+            "long_description": "file tin_score file tin_score file tin_score file",
+            "file_purpose": "Analysis view"
+        },
+        {
+            "filter_group": "gene-quantification",
+            "file_path_template": "analysis/salmon/{cimac id}/{cimac id}.quant.sf",
+            "short_description": "quant_sf file",
+            "long_description": "file quant_sf file quant_sf file quant_sf file",
+            "file_purpose": "Miscellaneous"
+        },
+        {
+            "filter_group": "gene-quantification",
+            "file_path_template": "analysis/salmon/{cimac id}/{cimac id}.transcriptome.bam.log",
+            "short_description": "transcriptome_bam_log file",
+            "long_description": "file transcriptome_bam_log file transcriptome_bam_log file transcriptome_bam_log file",
+            "file_purpose": "Miscellaneous"
+        },
+        {
+            "filter_group": "gene-quantification",
+            "file_path_template": "analysis/salmon/{cimac id}/aux_info/ambig_info.tsv",
+            "short_description": "aux_info_ambig_info_tsv file",
+            "long_description": "file aux_info_ambig_info_tsv file aux_info_ambig_info_tsv file aux_info_ambig_info_tsv file",
+            "file_purpose": "Miscellaneous"
+        },
+        {
+            "filter_group": "gene-quantification",
+            "file_path_template": "analysis/salmon/{cimac id}/aux_info/expected_bias.gz",
+            "short_description": "aux_info_expected_bias file",
+            "long_description": "file aux_info_expected_bias file aux_info_expected_bias file aux_info_expected_bias file",
+            "file_purpose": "Miscellaneous"
+        },
+        {
+            "filter_group": "gene-quantification",
+            "file_path_template": "analysis/salmon/{cimac id}/aux_info/fld.gz",
+            "short_description": "aux_info_fld file",
+            "long_description": "Fragment length didstribution file contains an approximation of observed fragment length distribution",
+            "file_purpose": "Miscellaneous"
+        },
+        {
+            "filter_group": "gene-quantification",
+            "file_path_template": "analysis/salmon/{cimac id}/aux_info/meta_info.json",
+            "short_description": "aux_info_meta_info file",
+            "long_description": "meta information about the run, including stats such as the number of observed and mapped fragments, details of the bias modeling etc",
+            "file_purpose": "Miscellaneous"
+        },
+        {
+            "filter_group": "gene-quantification",
+            "file_path_template": "analysis/salmon/{cimac id}/aux_info/observed_bias.gz",
+            "short_description": "aux_info_observed_bias file",
+            "long_description": "file aux_info_observed_bias file aux_info_observed_bias file aux_info_observed_bias file",
+            "file_purpose": "Miscellaneous"
+        },
+        {
+            "filter_group": "gene-quantification",
+            "file_path_template": "analysis/salmon/{cimac id}/aux_info/observed_bias_3p.gz",
+            "short_description": "aux_info_observed_bias_3p file",
+            "long_description": "file aux_info_observed_bias_3p file aux_info_observed_bias_3p file aux_info_observed_bias_3p file",
+            "file_purpose": "Miscellaneous"
+        },
+        {
+            "filter_group": "gene-quantification",
+            "file_path_template": "analysis/salmon/{cimac id}/cmd_info.json",
+            "short_description": "cmd_info file",
+            "long_description": "A file that records the main command line parameters with which Salmon used",
+            "file_purpose": "Miscellaneous"
+        },
+        {
+            "filter_group": "gene-quantification",
+            "file_path_template": "analysis/salmon/{cimac id}/logs/salmon_quant.log",
+            "short_description": "salmon_quant_log file",
+            "long_description": "file salmon_quant_log file salmon_quant_log file salmon_quant_log file",
+            "file_purpose": "Miscellaneous"
+        },
+	{
+            "filter_group": "microbiome",
+            "file_path_template": "analysis/microbiome/{cimac id}/{cimac id}_addSample_report.txt",
+            "short_description": "Centrifuge summary output file",
+            "long_description": "Centrifuge output file contains name of a genome, taxonomic ID and rank, and also the proportion of this genome normalized by its genomic length",
+            "file_purpose": "Miscellaneous"
+        },
+        {
+            "filter_group": "immune-repertoire",
+            "file_path_template": "analysis/trust4/{cimac id}/{cimac id}_report.tsv",
+            "short_description": "TURST4 final report file",
+            "long_description": "This report file focuses on CDR3 and is compatible with other repertoire analysis tools, such as VDJTools ",
+            "file_purpose": "Miscellaneous"
+        },
+        {
+            "filter_group": "fusion",
+            "file_path_template": "analysis/fusion/{cimac id}/{cimac id}.fusion_predictions.abridged_addSample.tsv",
+            "short_description": "fusion analysis report file",
+            "long_description": "this report file contains valiated fusion gene pairs found in all samples including their gene expression",
+            "file_purpose": "Miscellaneous"
+        },
+        {
+        	"filter_group": "MSI",
+            "file_path_template": "analysis/msisensor/single/{cimac id}/{cimac id}_msisensor.txt",
+            "short_description": "msisensor report file",
+            "long_description": "this report file contains msi score this report file contains msi score",
+            "file_purpose": "Miscellaneous"
+        },
+        {
+        	"filter_group": "HLA",
+            "file_path_template": "analysis/neoantigen/{cimac id}/{cimac id}.genotype.json",
+            "short_description": "arcasHLA report file",
+            "long_description": "this report file contains MHC class I&II HLA allels",
+            "file_purpose": "Miscellaneous"
+        }
+    ]
+}

cidc_schemas/ngs_pipeline_api/tcr/imgs/TCRseq.png ADDED Viewed

Binary file

cidc_schemas/ngs_pipeline_api/tcr/tcr.md ADDED Viewed

@@ -0,0 +1,101 @@
+# TCR-seq CIMAC-CIDC Report Description
+### Summary
+An in-browser immune repertoire report, relying on VisualizIRR (Visualized Immune Repertoire Report), is used make immune repertoire analysis results simple to navigate and understand for the end user on their local machine or on a server.
+The report contains analysis on the cohort and sample level in order to inform the end user of CDR3 distribution, top clonotypes, and V/J-gene usage.
+Additional information related to diversity and clonality is also included in an intracohort analysis module and is displayed on a boxplot that may be split by meta-information, when it is available.
+When multiple chains are available in the dataset, they are seperated.
+Information in the report is displayed mostly through interactive plots which may be exported at a higher resolution.
+### Decompressing the report file
+The report can be downloaded from the CIDC portal in tar.gz compressed format.  On a Mac, one can simply double-click on the <report_name>.tar.gz file to decompress the file.  The file can also be decompressed in a terminal window using the bash command tar:
+```
+tar -xvzf <report_name>tar.gz
+```
+On a Windows-based system, a third-party software such as Winzip can be used to decompress the file.
+### Viewing Report on a Local Machine
+For security purposes, newer browsers can have problems loading local files normally in an HTML page.
+If you can't view the the display locally by simply opening the html in a browser, you can perform the following in your terminal to set up a simple server and view the report:
+```
+cd <VisualizIRR_directory>
+```
+Python 2:
+```
+python -m SimpleHTTPServer & python -m webbrowser -n "http://0.0.0.0:8000"
+```
+Python 3:
+```
+python3 -m http.server & python3 -m webbrowser -n "http://0.0.0.0:8000"
+```
+The report is viewable at http://0.0.0.0:8000, which should pop up automatically.
+### TCR-seq Workflow
+![](https://raw.githubusercontent.com/CIMAC-CIDC/cidc-ngs-pipeline-api/master/cidc_ngs_pipeline_api/tcr/imgs/TCRseq.png)
+### Adding metadata to the report
+**meta.csv** is required for intracohort analysis and is the only component of the final report that needs to be manually composed by the end-user.
+**meta.csv** can be found within the report folder in the **data** subdirectory.
+**meta.csv** should be a csv with the first column including sample names and remaining columns for different conditions.
+There are a few ways to enter your meta information.
+1. In order to have ordered sample condition groups, denote the categorical label of those groups in the header using '|' as the separator and use the corresponding numbers in the sample rows, starting with 0 for the first label (as demonstrated in Timepoint and Condition 1).
+2. You can also use the labels in the metasheet and not denote them in the header (as demonstrated in Age and Condition 2).
+3. Meta-data should be converted to categorical bins if it isn't categorical already (as demonstrated in Age)
+4. In order to set up paired samples analysis, two columns must be in the meta.csv -- 'Timepoint' and 'VisGroup'. 'Timepoint' should contain ordered labels as described above. The column named 'VisGroup' indicates which samples should be paired. (as demonstrated in the meta.csv sample below, SampleName1 and SampleName3 are paired. SampleName1 is a sample taken at Baseline.  SampleName3 is a sample taken at Cycle 2.)  Therefore, patient samples from different timepoints can be paired.
+meta.csv template:
+```
+sample,Timepoint|Baseline|Cycle 1|Cycle 2,Age,Condition 1|Group 0|Group 1|Group 2,Condition 2,VisGroup
+SampleName0,1,20-29,0,Aa,0
+SampleName1,0,40-49,2,Bb,1
+SampleName2,1,10-19,1,Cc,2
+SampleName3,2,40-49,2,Bb,1
+```
+### Video Tutorial
+The following video describes decompressing and opening the report as well as how to modify the meta.csv file.
+[![Watch the video](http://img.youtube.com/vi/vmDwjSrei0c/0.jpg)](https://www.youtube.com/watch?v=vmDwjSrei0c)
+### Information in Report
+* Cohort and sample level
+    * Segment Usage
+        * V gene and J gene usage
+        * Combined V and J gene usage
+    * CDR3 Info
+        * CDR3 length distribution
+        * Top clonotypes
+* Intracohort Analysis & Information Table
+    * Clonotypes Per Kilo-reads (CPK)
+    * Raw Diversity
+    * Entropy, 1/Entropy, Normalized Entropy
+    * Gini Coefficient
+    * Gini-Simpson Index
+    * Inverse Simpson Index
+    * Chao1 Index
+    * Clonal Proportionality Information
+    * Average CDR3 Length
+    * Unique CDR3 Count
+### Tools used
+The cidc version of VisualizIRR uses the following R based tools:
+* R version 3.5.2
+* data.table 1.12.8
+* immunarch 0.6.5
+* naturalsort 0.1.3
+* dplyr 1.0.0
+* rjson 0.2.20

cidc_schemas/ngs_pipeline_api/wes/imgs/wes.png ADDED Viewed

Binary file

cidc_schemas/ngs_pipeline_api/wes/wes.md ADDED Viewed

@@ -0,0 +1,46 @@
+## Whole-exome sequencing (WES) analysis pipeline description
+The CIDC whole-exome sequencing (WES) pipeline aims to identify and characterize key immunotherapeutic features of tumor samples.  WES implements the [Gene Analysis Toolkit](https://gatk.broadinstitute.org/hc/en-us) (GATK) best practices and identifies both somatic and germline variants using Sentieon's TNScope and Haplotyper algorithms, respectively.  Somatic variants are annotated using the Variant Effect Predictor software.  As recommended in [Chen YC, Seifuddin F, et al. 2021](https://www.biorxiv.org/content/10.1101/2021.02.18.431906v1.full), the pipeline uses an ensemble of three callers (CNVkit, Sequenza, and Facets) to characterize tumor copy number variation:  the overlap of the CNV segments is used to generate a high-confident consensus set.  WES estimates tumor purity using two different software packages, Sequenza and FACETS, and also infers tumor clonal populations using PyClone-VI. The pipeline characterizes tumor HLA type (both class I and class II alleles) using HLA-HD, xHLA, and Optitype.  WES also performs neoantigen prediction with pVACtools version 2.0.7 which incorporates netMHCpan 4.1 and netMHCpanII 4.0 as neoantigen callers and IEDB 3.1.1 to predict MHC peptide-MHC binding affinities and prioritize candidate epitopes.  WES estimates tumor tcell fraction using TCellExTRECT and tumor microsatellite instability using MSIsensor2.
+### WES Workflow
+![](https://raw.githubusercontent.com/CIMAC-CIDC/cidc-ngs-pipeline-api/master/cidc_ngs_pipeline_api/wes/imgs/wes.png)
+## Versions of Tools and Reference Files Used in WES
+| Reference/Software | Version | Source | Notes |
+|--|--|--|--|
+| Assembly | hg38 | GDC | modified to only include chr1-22,X,Y,M |
+| Sentieon TNscope | sentieon-genomics-202010.01 | Sentieon | Somatic variant caller |
+| Sentieon Haplotyper | sentieon-genomics-202010.01 | Sentieon | Germline variant caller |
+| VEP  | 91.3  | Ensembl  | Variant annotation |
+| CNVkit | 0.9.9 | bioconda | Copy number variation |
+| Sequenza | 2.1.2 | biobuilds  | Copy number variation; Tumor purity/ploidy |
+| Sequenza-utils | 2.1.9999b0 | pip  | Copy number variation; Tumor purity/ploidy |
+| Facets | 0.5.14  | bioconda | Copy number variation; Tumor purity/ploidy |
+| PyClone-VI  | 0.1.1  | pip | Tumor clonality |
+| HLA-HD | 1.4.0 | [website](https://www.genome.med.kyoto-u.ac.jp/HLA-HD/) | HLA typing |
+| xHLA | commit 34221ea | [github](https://github.com/humanlongevity/HLA) | HLA typing |
+| pVACtools  | 2.0.7 | pip | Neoantigen prediction |
+| IEDB | 3.1.1 | [website](https://downloads.iedb.org/tools/mhci/3.1.1/)| Epitope database |
+| TcellExTRECT | commit ec81143 | [github](https://github.com/McGranahanLab/TcellExTRECT)| Epitope database |
+| MSIsensor2 | v0.1  | [github](https://github.com/niu-lab/msisensor2.git) |  Microsatellite instability |
+---
+## ExACdb assembly compatibility issue
+### Issue:
+The version of vcf2maf  (v1.6.18)  included in the CIDC WES pipeline uses ExACdb. CIDC WES uses the hg38 gene model. ExACdb is based on a hg19 reference, not hg38. Therefore some of the CIDC WES variants annotated by vcf2maf using ExACdb might not be accurate as the hg19/hg38 variant locations might be discrepant.
+### CIDC Portal Files Affected (WES only):
+The columns listed below in the output.twist.maf and output.twist.filtered.maf files.
+### MAF fields affected in these files:
+FILTER (though original FILTER flag is preserved; the ExAC_FILTER is added)
+ExAC_FILTER, ExAC_AF_Adj, ExAC_AC_AN_Adj, ExAC_AC_AN, ExAC_AC_AN_AFR, ExAC_AC_AN_AMR, ExAC_AC_AN_EAS, ExAC_AC_AN_FIN, ExAC_AC_AN_NFE, ExAC_AC_AN_OTH, ExAC_AC_AN_SAS
+Please see this [link](https://github.com/mskcc/vcf2maf/blob/47c4a18a15d5f93d4f3622615b3368448a74127d/docs/vep_maf_readme.txt) for more information about these fields.
+---

nci-cidc-schemas 0.28.1__py2.py3-none-any.whl → 0.28.3__py2.py3-none-any.whl

nci-cidc-schemas 0.28.1py2.py3-none-any.whl → 0.28.3py2.py3-none-any.whl