pywombat 1.0.0__tar.gz → 1.0.2__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- pywombat-1.0.2/.github/copilot-instructions.md +130 -0
- {pywombat-1.0.0 → pywombat-1.0.2}/CHANGELOG.md +10 -0
- {pywombat-1.0.0 → pywombat-1.0.2}/PKG-INFO +10 -7
- {pywombat-1.0.0 → pywombat-1.0.2}/README.md +9 -6
- {pywombat-1.0.0 → pywombat-1.0.2}/pyproject.toml +1 -1
- {pywombat-1.0.0 → pywombat-1.0.2}/src/pywombat/cli.py +20 -2
- {pywombat-1.0.0 → pywombat-1.0.2}/.github/workflows/publish.yml +0 -0
- {pywombat-1.0.0 → pywombat-1.0.2}/.gitignore +0 -0
- {pywombat-1.0.0 → pywombat-1.0.2}/.python-version +0 -0
- {pywombat-1.0.0 → pywombat-1.0.2}/QUICKSTART.md +0 -0
- {pywombat-1.0.0 → pywombat-1.0.2}/examples/README.md +0 -0
- {pywombat-1.0.0 → pywombat-1.0.2}/examples/de_novo_mutations.yml +0 -0
- {pywombat-1.0.0 → pywombat-1.0.2}/examples/rare_variants_high_impact.yml +0 -0
- {pywombat-1.0.0 → pywombat-1.0.2}/src/pywombat/__init__.py +0 -0
- {pywombat-1.0.0 → pywombat-1.0.2}/uv.lock +0 -0
|
@@ -0,0 +1,130 @@
|
|
|
1
|
+
# PyWombat Development Guide
|
|
2
|
+
|
|
3
|
+
## Project Overview
|
|
4
|
+
|
|
5
|
+
PyWombat is a bioinformatics CLI tool that transforms bcftools tabulated TSV files into analysis-ready long-format data. It specializes in:
|
|
6
|
+
- Expanding VCF INFO fields from `(null)` column into individual columns
|
|
7
|
+
- Converting wide-format sample data to long-format (melting)
|
|
8
|
+
- Extracting and calculating genotype metrics (GT, DP, GQ, AD, VAF)
|
|
9
|
+
- De novo mutation (DNM) detection with sex-chromosome awareness
|
|
10
|
+
- Trio/family analysis with pedigree-based parent genotype joining
|
|
11
|
+
|
|
12
|
+
## Architecture
|
|
13
|
+
|
|
14
|
+
### Single-File Monolith
|
|
15
|
+
All functionality lives in [src/pywombat/cli.py](src/pywombat/cli.py) (~2100 lines). This is intentional for deployment simplicity with `uvx` (one-command installation-free execution).
|
|
16
|
+
|
|
17
|
+
### Core Data Flow
|
|
18
|
+
1. **Input**: bcftools tabulated TSV (wide format, one row per variant)
|
|
19
|
+
2. **Transform Pipeline**:
|
|
20
|
+
- `format_bcftools_tsv_lazy()` → Expand INFO, melt samples, parse genotypes
|
|
21
|
+
- `apply_filters_lazy()` → Apply quality/expression filters OR DNM detection
|
|
22
|
+
3. **Output**: Long-format TSV/Parquet (one row per variant-sample)
|
|
23
|
+
|
|
24
|
+
### Key Functions
|
|
25
|
+
- **`format_bcftools_tsv()`**: Eager transformation (collects data)
|
|
26
|
+
- **`format_bcftools_tsv_lazy()`**: Streaming wrapper using Polars LazyFrame
|
|
27
|
+
- **`apply_filters_lazy()`**: Conditional filter dispatcher (quality/expression OR DNM)
|
|
28
|
+
- **`apply_de_novo_filter()`**: Complex vectorized DNM logic with PAR region handling
|
|
29
|
+
- **`parse_impact_filter_expression()`**: Expression parser for flexible YAML-based filtering
|
|
30
|
+
|
|
31
|
+
## Development Workflow
|
|
32
|
+
|
|
33
|
+
### Setup & Running
|
|
34
|
+
```bash
|
|
35
|
+
# Install dependencies (uses uv package manager)
|
|
36
|
+
uv sync
|
|
37
|
+
|
|
38
|
+
# Run CLI locally
|
|
39
|
+
uv run wombat input.tsv -o output
|
|
40
|
+
|
|
41
|
+
# Production usage (no installation)
|
|
42
|
+
uvx pywombat input.tsv -o output
|
|
43
|
+
```
|
|
44
|
+
|
|
45
|
+
### Testing
|
|
46
|
+
- Test data: `tests/C0733-011-068.*` files (real variant data)
|
|
47
|
+
- Compare script: [tests/compare_dnm_results.py](tests/compare_dnm_results.py) validates DNM detection
|
|
48
|
+
- No formal test framework yet—manual validation against reference outputs in `tests/check/`
|
|
49
|
+
|
|
50
|
+
## Critical Domain Logic
|
|
51
|
+
|
|
52
|
+
### Genotype Parsing
|
|
53
|
+
- Input format: `GT:DP:GQ:AD` (e.g., `0/1:25:99:10,15`)
|
|
54
|
+
- Extracted fields:
|
|
55
|
+
- `sample_gt`: Genotype (0/1, 1/1, etc.)
|
|
56
|
+
- `sample_dp`: Total depth (25)
|
|
57
|
+
- `sample_gq`: Quality (99)
|
|
58
|
+
- `sample_ad`: **Second value** from AD (15 = alternate allele depth)
|
|
59
|
+
- `sample_vaf`: Calculated as `sample_ad / sample_dp` (0.6)
|
|
60
|
+
|
|
61
|
+
### De Novo Mutation Detection
|
|
62
|
+
Sex-chromosome-aware logic in `apply_de_novo_filter()`:
|
|
63
|
+
- **Autosomes**: Both parents must have VAF < 2% (reference genotype)
|
|
64
|
+
- **X chromosome (male proband)**: Mother must be reference; father is hemizygous (ignored)
|
|
65
|
+
- **Y chromosome**: Only father checked (mother doesn't have Y)
|
|
66
|
+
- **PAR regions**: Treated as autosomal (both parents checked)
|
|
67
|
+
- **Hemizygous variants** (X/Y in males): Require VAF ≥ 85% (not ~50% like het)
|
|
68
|
+
|
|
69
|
+
PAR regions defined in [examples/de_novo_mutations.yml](examples/de_novo_mutations.yml):
|
|
70
|
+
```yaml
|
|
71
|
+
par_regions:
|
|
72
|
+
GRCh38:
|
|
73
|
+
PAR1: { chrom: "X", start: 10001, end: 2781479 }
|
|
74
|
+
PAR2: { chrom: "X", start: 155701383, end: 156030895 }
|
|
75
|
+
```
|
|
76
|
+
|
|
77
|
+
## Conventions & Patterns
|
|
78
|
+
|
|
79
|
+
### Polars Usage
|
|
80
|
+
- **Lazy evaluation**: Use `scan_csv()` → `LazyFrame` → `sink_csv()` for streaming
|
|
81
|
+
- **Streaming mode**: `collect(streaming=True)` for memory-efficient processing
|
|
82
|
+
- **Schema overrides**: Force string type for sample IDs to prevent numeric inference:
|
|
83
|
+
```python
|
|
84
|
+
schema_overrides = {col: pl.Utf8 for col in ["sample", "FatherBarcode", ...]}
|
|
85
|
+
```
|
|
86
|
+
|
|
87
|
+
### YAML Configuration Files
|
|
88
|
+
Filter configs in [examples/](examples/) define:
|
|
89
|
+
- `quality`: Thresholds (DP, GQ, VAF ranges for het/hom)
|
|
90
|
+
- `expression`: Polars-compatible filter expressions (see `parse_impact_filter_expression()`)
|
|
91
|
+
- `dnm`: De novo detection parameters (parent thresholds, PAR regions)
|
|
92
|
+
|
|
93
|
+
Example expression syntax:
|
|
94
|
+
```yaml
|
|
95
|
+
expression: "VEP_IMPACT = HIGH & fafmax_faf95_max_genomes <= 0.001"
|
|
96
|
+
```
|
|
97
|
+
Supports: `=`, `!=`, `<=`, `>=`, `<`, `>`, `&`, `|`, `()`, NULL, NaN
|
|
98
|
+
|
|
99
|
+
### Error Handling
|
|
100
|
+
- Use `click.echo(..., err=True)` for verbose messages (stderr)
|
|
101
|
+
- Raise `click.Abort()` instead of generic exceptions for CLI errors
|
|
102
|
+
- Validate pedigree/config requirements early (e.g., DNM mode requires pedigree)
|
|
103
|
+
|
|
104
|
+
## Common Pitfalls
|
|
105
|
+
|
|
106
|
+
1. **VAF Calculation**: Always use `sample_ad / sample_dp`, not first AD value
|
|
107
|
+
2. **Genotype Filtering**: Use `str.contains("1")` not regex—handles 0/1, 1/0, 1/1, 1/2
|
|
108
|
+
3. **Parent Joining**: Pedigree must have `sample_id`, `FatherBarcode`, `MotherBarcode` columns
|
|
109
|
+
4. **Sex Normalization**: DNM filter accepts "1"/"2" or "M"/"F" for sex; normalizes to uppercase
|
|
110
|
+
5. **Categorical Types**: Convert `#CHROM` from Categorical to Utf8 before string operations
|
|
111
|
+
|
|
112
|
+
## File References
|
|
113
|
+
|
|
114
|
+
- Main CLI: [src/pywombat/cli.py](src/pywombat/cli.py)
|
|
115
|
+
- Config examples: [examples/de_novo_mutations.yml](examples/de_novo_mutations.yml), [examples/rare_variants_high_impact.yml](examples/rare_variants_high_impact.yml)
|
|
116
|
+
- Test data: [tests/C0733-011-068.pedigree.tsv](tests/C0733-011-068.pedigree.tsv)
|
|
117
|
+
- Documentation: [README.md](README.md), [QUICKSTART.md](QUICKSTART.md)
|
|
118
|
+
|
|
119
|
+
## Adding New Features
|
|
120
|
+
|
|
121
|
+
When adding filters:
|
|
122
|
+
1. Update YAML schema in config examples
|
|
123
|
+
2. Extend `apply_filters_lazy()` or create specialized function (like `apply_de_novo_filter()`)
|
|
124
|
+
3. Use lazy operations when possible; collect only if vectorization requires it
|
|
125
|
+
4. Add verbose logging with `--verbose` flag support
|
|
126
|
+
|
|
127
|
+
When modifying transforms:
|
|
128
|
+
- Keep `format_bcftools_tsv()` and `format_bcftools_tsv_lazy()` in sync
|
|
129
|
+
- Test with real data from `tests/` directory
|
|
130
|
+
- Verify output with `compare_dnm_results.py` for DNM changes
|
|
@@ -5,6 +5,16 @@ All notable changes to PyWombat will be documented in this file.
|
|
|
5
5
|
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
|
6
6
|
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
|
7
7
|
|
|
8
|
+
## [1.0.1] - 2026-01-24
|
|
9
|
+
|
|
10
|
+
### Added
|
|
11
|
+
|
|
12
|
+
- **Boolean Flag Support in INFO Fields**: INFO field entries without `=` signs (e.g., `PASS`, `DB`, `SOMATIC`) are now extracted as boolean columns with `True`/`False` values instead of being ignored. This enables filtering on VCF flag fields.
|
|
13
|
+
|
|
14
|
+
### Fixed
|
|
15
|
+
|
|
16
|
+
- INFO field parsing now handles all field types correctly, including standalone boolean flags commonly used in VCF files.
|
|
17
|
+
|
|
8
18
|
## [1.0.0] - 2026-01-23
|
|
9
19
|
|
|
10
20
|
First stable release of PyWombat! 🎉
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: pywombat
|
|
3
|
-
Version: 1.0.
|
|
3
|
+
Version: 1.0.2
|
|
4
4
|
Summary: A CLI tool for processing and filtering bcftools tabulated TSV files with pedigree support
|
|
5
5
|
Project-URL: Homepage, https://github.com/bourgeron-lab/pywombat
|
|
6
6
|
Project-URL: Repository, https://github.com/bourgeron-lab/pywombat
|
|
@@ -35,6 +35,7 @@ A high-performance CLI tool for processing and filtering bcftools tabulated TSV
|
|
|
35
35
|
🧬 **De Novo Detection**: Sex-chromosome-aware DNM identification
|
|
36
36
|
📊 **Flexible Output**: TSV, compressed TSV, or Parquet formats
|
|
37
37
|
🎯 **Expression Filters**: Complex filtering with logical expressions
|
|
38
|
+
🏷️ **Boolean Flag Support**: INFO field flags (PASS, DB, etc.) extracted as True/False columns
|
|
38
39
|
⚡ **Streaming Mode**: Memory-efficient processing of large files
|
|
39
40
|
|
|
40
41
|
---
|
|
@@ -77,7 +78,7 @@ uv run wombat input.tsv -o output
|
|
|
77
78
|
|
|
78
79
|
PyWombat transforms bcftools tabulated TSV files into analysis-ready formats by:
|
|
79
80
|
|
|
80
|
-
1. **Expanding the `(null)` INFO column**: Extracts all `NAME=value` fields (e.g., `DP=30;AF=0.5;AC=2`) into separate columns
|
|
81
|
+
1. **Expanding the `(null)` INFO column**: Extracts all `NAME=value` fields (e.g., `DP=30;AF=0.5;AC=2`) and boolean flags (e.g., `PASS`, `DB`) into separate columns
|
|
81
82
|
2. **Melting sample columns**: Converts wide-format sample data into long format with one row per variant-sample combination
|
|
82
83
|
3. **Extracting genotype data**: Parses `GT:DP:GQ:AD` format into separate columns with calculated VAF
|
|
83
84
|
4. **Adding parent data**: Joins father/mother genotypes when pedigree is provided
|
|
@@ -88,20 +89,22 @@ PyWombat transforms bcftools tabulated TSV files into analysis-ready formats by:
|
|
|
88
89
|
**Input (Wide Format):**
|
|
89
90
|
|
|
90
91
|
```tsv
|
|
91
|
-
#CHROM POS REF ALT (null)
|
|
92
|
-
chr1 100 A T DP=30;AF=0.5;AC=2
|
|
92
|
+
#CHROM POS REF ALT (null) Sample1:GT:DP:GQ:AD Sample2:GT:DP:GQ:AD
|
|
93
|
+
chr1 100 A T DP=30;AF=0.5;PASS;AC=2 0/1:15:99:5,10 1/1:18:99:0,18
|
|
93
94
|
```
|
|
94
95
|
|
|
95
96
|
**Output (Long Format):**
|
|
96
97
|
|
|
97
98
|
```tsv
|
|
98
|
-
#CHROM POS REF ALT AC AF DP sample sample_gt sample_dp sample_gq sample_ad sample_vaf
|
|
99
|
-
chr1 100 A T 2 0.5 30 Sample1 0/1 15 99 10 0.6667
|
|
100
|
-
chr1 100 A T 2 0.5 30 Sample2 1/1 18 99 18 1.0
|
|
99
|
+
#CHROM POS REF ALT AC AF DP PASS sample sample_gt sample_dp sample_gq sample_ad sample_vaf
|
|
100
|
+
chr1 100 A T 2 0.5 30 true Sample1 0/1 15 99 10 0.6667
|
|
101
|
+
chr1 100 A T 2 0.5 30 true Sample2 1/1 18 99 18 1.0
|
|
101
102
|
```
|
|
102
103
|
|
|
103
104
|
**Generated Columns:**
|
|
104
105
|
|
|
106
|
+
- INFO fields with `=`: Extracted as separate columns (e.g., `DP`, `AF`, `AC`)
|
|
107
|
+
- INFO boolean flags: Extracted as True/False columns (e.g., `PASS`, `DB`, `SOMATIC`)
|
|
105
108
|
- `sample`: Sample identifier
|
|
106
109
|
- `sample_gt`: Genotype (e.g., 0/1, 1/1)
|
|
107
110
|
- `sample_dp`: Read depth (total coverage)
|
|
@@ -13,6 +13,7 @@ A high-performance CLI tool for processing and filtering bcftools tabulated TSV
|
|
|
13
13
|
🧬 **De Novo Detection**: Sex-chromosome-aware DNM identification
|
|
14
14
|
📊 **Flexible Output**: TSV, compressed TSV, or Parquet formats
|
|
15
15
|
🎯 **Expression Filters**: Complex filtering with logical expressions
|
|
16
|
+
🏷️ **Boolean Flag Support**: INFO field flags (PASS, DB, etc.) extracted as True/False columns
|
|
16
17
|
⚡ **Streaming Mode**: Memory-efficient processing of large files
|
|
17
18
|
|
|
18
19
|
---
|
|
@@ -55,7 +56,7 @@ uv run wombat input.tsv -o output
|
|
|
55
56
|
|
|
56
57
|
PyWombat transforms bcftools tabulated TSV files into analysis-ready formats by:
|
|
57
58
|
|
|
58
|
-
1. **Expanding the `(null)` INFO column**: Extracts all `NAME=value` fields (e.g., `DP=30;AF=0.5;AC=2`) into separate columns
|
|
59
|
+
1. **Expanding the `(null)` INFO column**: Extracts all `NAME=value` fields (e.g., `DP=30;AF=0.5;AC=2`) and boolean flags (e.g., `PASS`, `DB`) into separate columns
|
|
59
60
|
2. **Melting sample columns**: Converts wide-format sample data into long format with one row per variant-sample combination
|
|
60
61
|
3. **Extracting genotype data**: Parses `GT:DP:GQ:AD` format into separate columns with calculated VAF
|
|
61
62
|
4. **Adding parent data**: Joins father/mother genotypes when pedigree is provided
|
|
@@ -66,20 +67,22 @@ PyWombat transforms bcftools tabulated TSV files into analysis-ready formats by:
|
|
|
66
67
|
**Input (Wide Format):**
|
|
67
68
|
|
|
68
69
|
```tsv
|
|
69
|
-
#CHROM POS REF ALT (null)
|
|
70
|
-
chr1 100 A T DP=30;AF=0.5;AC=2
|
|
70
|
+
#CHROM POS REF ALT (null) Sample1:GT:DP:GQ:AD Sample2:GT:DP:GQ:AD
|
|
71
|
+
chr1 100 A T DP=30;AF=0.5;PASS;AC=2 0/1:15:99:5,10 1/1:18:99:0,18
|
|
71
72
|
```
|
|
72
73
|
|
|
73
74
|
**Output (Long Format):**
|
|
74
75
|
|
|
75
76
|
```tsv
|
|
76
|
-
#CHROM POS REF ALT AC AF DP sample sample_gt sample_dp sample_gq sample_ad sample_vaf
|
|
77
|
-
chr1 100 A T 2 0.5 30 Sample1 0/1 15 99 10 0.6667
|
|
78
|
-
chr1 100 A T 2 0.5 30 Sample2 1/1 18 99 18 1.0
|
|
77
|
+
#CHROM POS REF ALT AC AF DP PASS sample sample_gt sample_dp sample_gq sample_ad sample_vaf
|
|
78
|
+
chr1 100 A T 2 0.5 30 true Sample1 0/1 15 99 10 0.6667
|
|
79
|
+
chr1 100 A T 2 0.5 30 true Sample2 1/1 18 99 18 1.0
|
|
79
80
|
```
|
|
80
81
|
|
|
81
82
|
**Generated Columns:**
|
|
82
83
|
|
|
84
|
+
- INFO fields with `=`: Extracted as separate columns (e.g., `DP`, `AF`, `AC`)
|
|
85
|
+
- INFO boolean flags: Extracted as True/False columns (e.g., `PASS`, `DB`, `SOMATIC`)
|
|
83
86
|
- `sample`: Sample identifier
|
|
84
87
|
- `sample_gt`: Genotype (e.g., 0/1, 1/1)
|
|
85
88
|
- `sample_dp`: Read depth (total coverage)
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
[project]
|
|
2
2
|
name = "pywombat"
|
|
3
|
-
version = "1.0.
|
|
3
|
+
version = "1.0.2"
|
|
4
4
|
description = "A CLI tool for processing and filtering bcftools tabulated TSV files with pedigree support"
|
|
5
5
|
readme = "README.md"
|
|
6
6
|
authors = [{ name = "Freddy Cliquet", email = "fcliquet@pasteur.fr" }]
|
|
@@ -1198,15 +1198,18 @@ def read_pedigree(pedigree_path: Path) -> pl.DataFrame:
|
|
|
1198
1198
|
pedigree_df = df.select(select_cols)
|
|
1199
1199
|
|
|
1200
1200
|
# Replace 0 and -9 with null (indicating no parent)
|
|
1201
|
+
# Explicit cast to Utf8 ensures type is preserved even when all values become null
|
|
1201
1202
|
pedigree_df = pedigree_df.with_columns(
|
|
1202
1203
|
[
|
|
1203
1204
|
pl.when(pl.col("father_id").cast(pl.Utf8).is_in(["0", "-9"]))
|
|
1204
1205
|
.then(None)
|
|
1205
1206
|
.otherwise(pl.col("father_id"))
|
|
1207
|
+
.cast(pl.Utf8)
|
|
1206
1208
|
.alias("father_id"),
|
|
1207
1209
|
pl.when(pl.col("mother_id").cast(pl.Utf8).is_in(["0", "-9"]))
|
|
1208
1210
|
.then(None)
|
|
1209
1211
|
.otherwise(pl.col("mother_id"))
|
|
1212
|
+
.cast(pl.Utf8)
|
|
1210
1213
|
.alias("mother_id"),
|
|
1211
1214
|
]
|
|
1212
1215
|
)
|
|
@@ -1313,6 +1316,10 @@ def format_expand_annotations(df: pl.DataFrame) -> pl.DataFrame:
|
|
|
1313
1316
|
This is a separate step that can be applied after filtering to avoid
|
|
1314
1317
|
expensive annotation expansion on variants that will be filtered out.
|
|
1315
1318
|
|
|
1319
|
+
Handles two types of INFO fields:
|
|
1320
|
+
- Key-value pairs (e.g., "DP=30") -> extracted as string values
|
|
1321
|
+
- Boolean flags (e.g., "PASS", "DB") -> created as True/False columns
|
|
1322
|
+
|
|
1316
1323
|
Args:
|
|
1317
1324
|
df: DataFrame with (null) column
|
|
1318
1325
|
|
|
@@ -1324,9 +1331,10 @@ def format_expand_annotations(df: pl.DataFrame) -> pl.DataFrame:
|
|
|
1324
1331
|
# Already expanded or missing - return as-is
|
|
1325
1332
|
return df
|
|
1326
1333
|
|
|
1327
|
-
# Extract all unique field names from the (null) column
|
|
1334
|
+
# Extract all unique field names and flags from the (null) column
|
|
1328
1335
|
null_values = df.select("(null)").to_series()
|
|
1329
1336
|
all_fields = set()
|
|
1337
|
+
all_flags = set()
|
|
1330
1338
|
|
|
1331
1339
|
for value in null_values:
|
|
1332
1340
|
if value and not (isinstance(value, float)): # Skip null/NaN values
|
|
@@ -1335,8 +1343,10 @@ def format_expand_annotations(df: pl.DataFrame) -> pl.DataFrame:
|
|
|
1335
1343
|
if "=" in pair:
|
|
1336
1344
|
field_name = pair.split("=", 1)[0]
|
|
1337
1345
|
all_fields.add(field_name)
|
|
1346
|
+
elif pair.strip(): # Boolean flag (no '=')
|
|
1347
|
+
all_flags.add(pair.strip())
|
|
1338
1348
|
|
|
1339
|
-
# Create expressions to extract each field
|
|
1349
|
+
# Create expressions to extract each key-value field
|
|
1340
1350
|
for field in sorted(all_fields):
|
|
1341
1351
|
# Extract the field value from the (null) column
|
|
1342
1352
|
# Pattern: extract value after "field=" and before ";" or end of string
|
|
@@ -1344,6 +1354,14 @@ def format_expand_annotations(df: pl.DataFrame) -> pl.DataFrame:
|
|
|
1344
1354
|
pl.col("(null)").str.extract(f"{field}=([^;]+)").alias(field)
|
|
1345
1355
|
)
|
|
1346
1356
|
|
|
1357
|
+
# Create boolean columns for flags
|
|
1358
|
+
for flag in sorted(all_flags):
|
|
1359
|
+
# Check if flag appears in the (null) column (as whole word)
|
|
1360
|
+
# Use regex to match flag as a separate field (not part of another field name)
|
|
1361
|
+
df = df.with_columns(
|
|
1362
|
+
pl.col("(null)").str.contains(f"(^|;){flag}(;|$)").alias(flag)
|
|
1363
|
+
)
|
|
1364
|
+
|
|
1347
1365
|
# Drop the original (null) column
|
|
1348
1366
|
df = df.drop("(null)")
|
|
1349
1367
|
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|