PyPI - samplesheet-parser - Versions diffs - 0.1.0__tar.gz - Mend

samplesheet-parser 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (30) hide show

samplesheet_parser-0.1.0/.github/workflows/ci.yml ADDED Viewed

@@ -0,0 +1,64 @@
+name: CI
+on:
+  push:
+    branches: [main]
+  pull_request:
+    branches: [main]
+jobs:
+  test:
+    runs-on: ubuntu-latest
+    strategy:
+      matrix:
+        python-version: ["3.10", "3.11", "3.12"]
+    steps:
+      - uses: actions/checkout@v4
+      - name: Set up Python ${{ matrix.python-version }}
+        uses: actions/setup-python@v5
+        with:
+          python-version: ${{ matrix.python-version }}
+      - name: Install dependencies
+        run: pip install -e ".[dev]"
+      - name: Lint with ruff
+        run: ruff check samplesheet_parser/
+      - name: Type check with mypy
+        run: mypy samplesheet_parser/ --ignore-missing-imports
+        continue-on-error: true
+      - name: Run tests
+        run: pytest tests/ -v --cov=samplesheet_parser --cov-report=xml
+      - name: Upload coverage to Codecov
+        uses: codecov/codecov-action@v4
+        if: matrix.python-version == '3.12'
+        with:
+          file: ./coverage.xml
+  publish:
+    needs: test
+    runs-on: ubuntu-latest
+    if: startsWith(github.ref, 'refs/tags/v')
+    permissions:
+      id-token: write
+    steps:
+      - uses: actions/checkout@v4
+      - name: Set up Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: "3.12"
+      - name: Build package
+        run: |
+          pip install hatchling
+          python -m hatchling build
+      - name: Publish to PyPI
+        uses: pypa/gh-action-pypi-publish@release/v1

samplesheet_parser-0.1.0/.gitignore ADDED Viewed

@@ -0,0 +1,40 @@
+BLOGPOST.md
+__pycache__/
+*.py[cod]
+*.pyo
+*.pyd
+.Python
+*.egg-info/
+dist/
+build/
+.eggs/
+*.egg
+# Testing
+.pytest_cache/
+.coverage
+coverage.xml
+htmlcov/
+.tox/
+# Type checking
+.mypy_cache/
+# Virtual environments
+.venv/
+venv/
+env/
+# Editor
+.vscode/
+.idea/
+*.swp
+*.swo
+# macOS
+.DS_Store
+# Backup files created by parser cleaning
+*.backup
+*.tmp

samplesheet_parser-0.1.0/LICENSE ADDED Viewed

@@ -0,0 +1,19 @@
+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+Copyright 2026 Chaitanya Kasaraneni
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.

samplesheet_parser-0.1.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,384 @@
+Metadata-Version: 2.4
+Name: samplesheet-parser
+Version: 0.1.0
+Summary: Format-agnostic parser for Illumina SampleSheet.csv files — supports IEM V1 and BCLConvert V2
+Project-URL: Homepage, https://github.com/chaitanyakasaraneni/samplesheet-parser
+Project-URL: Documentation, https://illumina-samplesheet.readthedocs.io
+Project-URL: Repository, https://github.com/chaitanyakasaraneni/samplesheet-parser
+Project-URL: Issues, https://github.com/chaitanyakasaraneni/samplesheet-parser/issues
+Author-email: Chaitanya Kasaraneni <kc.kasaraneni@gmail.com>
+License:                                  Apache License
+                                   Version 2.0, January 2004
+                                http://www.apache.org/licenses/
+        TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+        Copyright 2026 Chaitanya Kasaraneni
+        Licensed under the Apache License, Version 2.0 (the "License");
+        you may not use this file except in compliance with the License.
+        You may obtain a copy of the License at
+            http://www.apache.org/licenses/LICENSE-2.0
+        Unless required by applicable law or agreed to in writing, software
+        distributed under the License is distributed on an "AS IS" BASIS,
+        WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+        See the License for the specific language governing permissions and
+        limitations under the License.
+License-File: LICENSE
+Keywords: bcl2fastq,bclconvert,bioinformatics,demultiplexing,genomics,illumina,ngs,samplesheet,sequencing
+Classifier: Development Status :: 3 - Alpha
+Classifier: Intended Audience :: Developers
+Classifier: Intended Audience :: Science/Research
+Classifier: License :: OSI Approved :: Apache Software License
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.10
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
+Classifier: Typing :: Typed
+Requires-Python: >=3.10
+Requires-Dist: loguru>=0.7
+Provides-Extra: dev
+Requires-Dist: black>=24.0; extra == 'dev'
+Requires-Dist: mypy>=1.8; extra == 'dev'
+Requires-Dist: pytest-cov>=4.1; extra == 'dev'
+Requires-Dist: pytest>=7.4; extra == 'dev'
+Requires-Dist: ruff>=0.3; extra == 'dev'
+Description-Content-Type: text/markdown
+# samplesheet-parser
+**Format-agnostic Python parser for Illumina `SampleSheet.csv` files.**
+Supports IEM V1 (bcl2fastq / NovaSeq 6000 era) and BCLConvert V2 (NovaSeq X series) with automatic format detection, index validation, and `OverrideCycles` / UMI decoding — no format hints required from the caller.
+[![PyPI version](https://badge.fury.io/py/samplesheet-parser.svg)](https://badge.fury.io/py/samplesheet-parser)
+[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
+[![License: Apache 2.0](https://img.shields.io/badge/License-Apache_2.0-yellow.svg)](https://opensource.org/licenses/Apache-2.0)
+[![Tests](https://github.com/chaitanyakasaraneni/samplesheet-parser/actions/workflows/ci.yml/badge.svg)](https://github.com/chaitanyakasaraneni/samplesheet-parser/actions)
+[![codecov](https://codecov.io/gh/chaitanyakasaraneni/samplesheet-parser/branch/main/graph/badge.svg)](https://codecov.io/gh/chaitanyakasaraneni/samplesheet-parser)
+---
+## The problem
+Labs running mixed instrument fleets — NovaSeq 6000 alongside NovaSeq X series — produce two structurally incompatible `SampleSheet.csv` formats:
+| | IEM V1 | BCLConvert V2 |
+|---|---|---|
+| Discriminator | `IEMFileVersion` in `[Header]` | `FileFormatVersion` in `[Header]` |
+| Data section | `[Data]` | `[BCLConvert_Data]` |
+| Settings section | `[Settings]` | `[BCLConvert_Settings]` |
+| Index columns | `index`, `index2` (lowercase) | `Index`, `Index2` (uppercase) |
+| Read cycles | Bare integers | Key-value (`Read1Cycles,151`) |
+| UMI encoding | Not supported | `OverrideCycles` string |
+| Used with | bcl2fastq | BCLConvert ≥ 3.x |
+Without a single parser, every pipeline component that reads a SampleSheet needs an `if v1 else v2` branch — or worse, the format is hardcoded and the wrong sheet is silently processed.
+---
+## Installation
+```bash
+pip install samplesheet-parser
+```
+Requires Python 3.10+. The only mandatory dependency is [`loguru`](https://github.com/Delgan/loguru).
+---
+## Quickstart
+### Auto-detect format (recommended)
+```python
+from samplesheet_parser import SampleSheetFactory
+factory = SampleSheetFactory()
+sheet = factory.create_parser("SampleSheet.csv", parse=True)
+print(factory.version)           # SampleSheetVersion.V1 or .V2
+print(sheet.index_type())        # "dual", "single", or "none"
+print(factory.get_umi_length())  # 0 if no UMI
+for sample in sheet.samples():
+    print(sample["sample_id"], sample["index"])
+```
+### Validate before demultiplexing
+```python
+from samplesheet_parser import SampleSheetFactory, SampleSheetValidator
+sheet = SampleSheetFactory().create_parser("SampleSheet.csv", parse=True)
+result = SampleSheetValidator().validate(sheet)
+print(result.summary())
+# PASS — 0 error(s), 1 warning(s)
+for err in result.errors:
+    print(err)
+# [ERROR] DUPLICATE_INDEX: Index 'ATTACTCG+TATAGCCT' appears more than once in lane 1.
+```
+### UMI extraction (V2 only)
+```python
+from samplesheet_parser import SampleSheetV2
+sheet = SampleSheetV2("SampleSheet.csv", parse=True)
+# OverrideCycles: Y151;I10U9;I10;Y151 → 9 bp UMI in Index1
+print(sheet.get_umi_length())        # 9
+rs = sheet.get_read_structure()
+print(rs.umi_location)               # "index2"
+print(rs.read_structure)             # {"read1_template": 151, "index2_length": 10, "index2_umi": 9, ...}
+```
+### Use parsers directly
+```python
+from samplesheet_parser import SampleSheetV1, SampleSheetV2
+# V1
+v1 = SampleSheetV1("SampleSheet_v1.csv", parse=True)
+print(v1.experiment_name)    # "240115_A01234_0042_AHJLG7DRXX"
+print(v1.instrument_type)    # "NovaSeq 6000"
+print(v1.adapter_read1)      # "AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"
+print(v1.adapter_read2)      # "AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT"
+print(v1.reverse_complement) # 0
+print(v1.read_lengths)       # [151, 151]
+# V2
+v2 = SampleSheetV2("SampleSheet_v2.csv", parse=True)
+print(v2.instrument_platform)  # "NovaSeqXSeries"
+print(v2.software_version)     # "3.9.3"
+```
+---
+## Format detection logic
+The factory uses a three-step strategy — stopping as early as possible:
+```
+1. Scan [Header] for FileFormatVersion  → V2
+                  or IEMFileVersion     → V1
+2. If undetermined: scan full file for
+   [BCLConvert_Settings] or
+   [BCLConvert_Data]                    → V2
+3. Default                              → V1
+```
+The file is read only once. No second `open()`, no `seek()`.
+---
+## OverrideCycles decoding
+The V2 `OverrideCycles` string encodes the full read structure using single-letter type codes:
+| Code | Meaning |
+|---|---|
+| `Y` | Template (sequenced) bases |
+| `I` | Index bases |
+| `U` | UMI bases |
+| `N` | Masked / skipped bases |
+Segment order: `Read1 ; Index1 ; Index2 ; Read2`
+| OverrideCycles | UMI length | UMI location |
+|---|---|---|
+| `Y151;I10;I10;Y151` | 0 | — |
+| `Y151;I10U9;I10;Y151` | 9 bp | Index1 |
+| `U5Y146;I8;I8;U5Y146` | 5 bp | Read1 + Read2 |
+| `Y76;I8;Y76` | 0 | — (single-index) |
+---
+## Validation checks
+| Code | Level | Condition |
+|---|---|---|
+| `EMPTY_SAMPLES` | error | No samples found in data section |
+| `INVALID_INDEX_CHARS` | error | Index contains non-ACGTN characters |
+| `INDEX_TOO_LONG` | error | Index longer than 24 bp |
+| `DUPLICATE_INDEX` | error | Two samples share an index in the same lane |
+| `DUPLICATE_SAMPLE_ID` | error | Same `Sample_ID` appears twice in one lane |
+| `INDEX_TOO_SHORT` | warning | Index shorter than 6 bp |
+| `NO_ADAPTERS` | warning | No adapter sequences configured |
+| `ADAPTER_MISMATCH` | warning | Adapter is not a standard Illumina sequence |
+---
+## API reference
+### `SampleSheetFactory`
+```python
+factory = SampleSheetFactory()
+sheet = factory.create_parser(path, *, clean=True, experiment_id=None, parse=None)
+```
+| Attribute / Method | Returns | Description |
+|---|---|---|
+| `.create_parser(path, ...)` | `SampleSheetV1 \| SampleSheetV2` | Auto-detect format and return parser |
+| `.get_umi_length()` | `int` | UMI length from current parser |
+| `.version` | `SampleSheetVersion` | Detected version after `create_parser()` |
+### Shared interface — `SampleSheetV1` and `SampleSheetV2`
+| Method / Attribute | Returns | Description |
+|---|---|---|
+| `.parse(do_clean=True)` | `None` | Parse all sections |
+| `.samples()` | `list[dict]` | One record per unique sample |
+| `.index_type()` | `str` | `"dual"`, `"single"`, or `"none"` |
+| `.adapters` | `list[str]` | All configured adapter sequences |
+| `.experiment_name` | `str \| None` | Run or experiment name |
+| `.read_lengths` / `.reads` | `list[int]` / `dict` | Read cycle lengths |
+### V1-specific
+| Attribute | Type | Description |
+|---|---|---|
+| `.iem_version` | `str \| None` | e.g. `"5"` |
+| `.instrument_type` | `str \| None` | e.g. `"NovaSeq 6000"`, `"MiSeq"` |
+| `.application` | `str \| None` | e.g. `"FASTQ Only"` |
+| `.assay` | `str \| None` | Library prep kit name |
+| `.index_adapters` | `str \| None` | Illumina index set name |
+| `.chemistry` | `str \| None` | `"Amplicon"` = dual index, `"Default"` = single/no index |
+| `.adapter_read1` | `str` | Read 1 adapter (`Adapter` or `AdapterRead1` key) |
+| `.adapter_read2` | `str` | Read 2 adapter (`AdapterRead2` key) |
+| `.reverse_complement` | `int` | `0` = default, `1` = reverse-complement R2 (Nextera MP only) |
+| `.flowcell_id` | `str \| None` | Parsed from experiment ID run folder name |
+### V2-specific
+| Method / Attribute | Returns | Description |
+|---|---|---|
+| `.get_umi_length()` | `int` | UMI length from `OverrideCycles` |
+| `.get_read_structure()` | `ReadStructure` | Full decoded read structure |
+| `.instrument_platform` | `str \| None` | e.g. `"NovaSeqXSeries"` |
+| `.software_version` | `str \| None` | BCLConvert version string |
+| `.custom_fields` | `dict[str, set[str]]` | Non-standard fields by section |
+---
+## Example sample sheets
+The [`examples/sample_sheets/`](examples/sample_sheets/) directory contains ready-to-use reference sheets for every supported configuration:
+| File | Format | Instrument | UMI | Use case |
+|---|---|---|---|---|
+| `v1_dual_index.csv` | V1 | NovaSeq 6000 | No | Standard WGS, multi-lane |
+| `v1_single_index.csv` | V1 | NextSeq 500 | No | Small RNA |
+| `v1_multi_lane.csv` | V1 | NovaSeq 6000 | No | 4 lanes, mixed projects |
+| `v2_novaseq_x_dual_index.csv` | V2 | NovaSeq X | No | Standard PE150 |
+| `v2_with_index_umi.csv` | V2 | NovaSeq X | Index1 UMI (9 bp) | cfDNA / liquid biopsy |
+| `v2_with_read_umi.csv` | V2 | NovaSeq X | Read UMI (5 bp) | Duplex sequencing |
+| `v2_nextseq_single_index.csv` | V2 | NextSeq 1000/2000 | No | Amplicon panel |
+Run the demo to parse all of them:
+```bash
+python examples/parse_examples.py
+```
+---
+## V1 adapter key reference
+From the [Illumina IEM specification](https://knowledge.illumina.com/software/on-premises-software/software-on-premises-software-reference_material-list/000002204), the correct V1 `[Settings]` adapter keys are:
+```csv
+[Settings]
+ReverseComplement,0
+Adapter,AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
+AdapterRead2,AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
+```
+- `Adapter` = Read 1 adapter (primary key per IEM spec)
+- `AdapterRead2` = Read 2 adapter (explicit separate key)
+- `AdapterRead1` = BCLConvert V1-mode alias for `Adapter` (also accepted)
+- `ReverseComplement,1` = only for Nextera Mate Pair libraries; `0` for everything else
+---
+## Project structure
+```
+samplesheet-parser/
+├── samplesheet_parser/
+│   ├── __init__.py          # Public API
+│   ├── factory.py           # SampleSheetFactory — auto-detection
+│   ├── enums.py             # SampleSheetVersion, IndexType, ...
+│   ├── validators.py        # SampleSheetValidator, ValidationResult
+│   └── parsers/
+│       ├── v1.py            # IEM V1 parser (bcl2fastq)
+│       └── v2.py            # BCLConvert V2 parser (NovaSeq X)
+├── tests/
+│   ├── conftest.py          # Shared fixtures
+│   ├── test_factory.py
+│   ├── test_parsers/
+│   │   ├── test_v1.py
+│   │   └── test_v2.py
+│   └── test_validators/
+│       └── test_validators.py
+├── examples/
+│   ├── parse_examples.py    # Demo script
+│   └── sample_sheets/       # Reference SampleSheet.csv files
+├── pyproject.toml
+├── LICENSE
+└── README.md
+```
+---
+## Development
+```bash
+git clone https://github.com/chaitanyakasaraneni/samplesheet-parser
+cd samplesheet-parser
+pip install -e ".[dev]"
+# Run tests
+pytest tests/ -v
+# Run example demo
+python examples/parse_examples.py
+```
+---
+## Citation
+If you use this library in a published pipeline or analysis, please cite:
+```bibtex
+@software{kasaraneni2026samplsheetparser,
+  author  = {Kasaraneni, Chaitanya},
+  title   = {samplesheet-parser: Format-agnostic parser for Illumina SampleSheet.csv},
+  year    = {2026},
+  url     = {https://github.com/chaitanyakasaraneni/samplesheet-parser},
+  version = {0.1.0}
+}
+```
+---
+## License
+Apache 2.0 — see [LICENSE](LICENSE).
+---
+## Related resources
+- [Illumina IEM V1 SampleSheet reference (Knowledge Article #2204)](https://knowledge.illumina.com/software/on-premises-software/software-on-premises-software-reference_material-list/000002204)
+- [BCLConvert Software Guide](https://support.illumina.com/sequencing/sequencing_software/bcl-convert.html)
+- [Upgrading from bcl2fastq to BCLConvert](https://knowledge.illumina.com/software/general/software-general-reference_material-list/000003710)