PyPI - rdkit-cli - Versions diffs - 0.1.0__tar.gz → 0.3.0__tar.gz - Mend

rdkit-cli 0.1.0tar.gz → 0.3.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (117) hide show

rdkit_cli-0.3.0/CHANGELOG.md ADDED Viewed

@@ -0,0 +1,77 @@
+# Changelog
+All notable changes to this project will be documented in this file.
+The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
+and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+## [0.3.0] - 2026-01-10
+### Added
+- **info**: Quick molecule information from SMILES (formula, MW, LogP, TPSA, stereocenters, Lipinski violations, InChI/InChIKey)
+- **merge**: Combine multiple molecule files with optional deduplication and source tracking
+- **sascorer**: Calculate Synthetic Accessibility (SA) Score, Natural Product-likeness (NPC), and QED scores
+- **rgroup**: R-group decomposition around a core SMARTS pattern with labeled attachment points
+- **rings**: Ring system analysis - extract ring systems (fused, spiro, bridged) and analyze frequencies
+- **align**: 3D molecular alignment to a reference structure (MCS-based or Open3DAlign)
+- **rmsd**: RMSD calculations between 3D structures (compare to reference, pairwise matrix, conformer analysis)
+- **mmp**: Matched Molecular Pairs analysis - fragment molecules, find pairs, apply transformations
+- **protonate**: Protonation state enumeration at specified pH with neutralization option
+- **props**: Property column operations - add, rename, drop, keep columns in molecule files
+### Changed
+- Total command count increased from 19 to 29
+## [0.2.0] - 2026-01-06
+### Added
+- **stats**: Calculate dataset statistics (MolWt, LogP, TPSA, etc. with min/max/mean/median/stdev)
+- **split**: Split files into smaller chunks (by number of chunks or chunk size)
+- **sample**: Randomly sample molecules (by count or fraction, with reservoir sampling for large files)
+- **deduplicate**: Remove duplicate molecules (by SMILES, InChI, InChIKey, or scaffold)
+- **validate**: Validate molecular structures (valence, kekulization, stereo, element constraints)
+### Changed
+- Commands are now displayed in alphabetical order in help output
+- Total command count increased from 14 to 19
+## [0.1.0] - 2026-01-06
+### Added
+- Initial release with 14 command categories
+- **descriptors**: Compute molecular descriptors (200+ available)
+- **fingerprints**: Generate molecular fingerprints (morgan, maccs, rdkit, atompair, torsion, pattern)
+- **filter**: Filter molecules by substructure, properties, drug-likeness (Lipinski/Veber/Ghose), PAINS
+- **convert**: Convert between molecular file formats (CSV, TSV, SMI, SDF, Parquet)
+- **standardize**: Standardize and canonicalize molecules
+- **similarity**: Similarity search, matrix computation, and clustering
+- **conformers**: Generate and optimize 3D conformers
+- **reactions**: SMIRKS transformations and reaction enumeration
+- **scaffold**: Murcko scaffold extraction and decomposition
+- **enumerate**: Stereoisomer and tautomer enumeration
+- **fragment**: BRICS/RECAP fragmentation and functional group analysis
+- **diversity**: MaxMin diversity picking and diversity analysis
+- **mcs**: Maximum Common Substructure finding
+- **depict**: SVG/PNG molecular depictions (single, batch, grid)
+### Features
+- Multi-core parallel processing via ProcessPoolExecutor
+- Ninja-style progress display with speed and ETA
+- Support for multiple I/O formats (CSV, TSV, SMI, SDF, Parquet)
+- Automatic format detection from file extensions
+- Lazy imports for fast CLI startup (~0.08s)
+- Comprehensive test suite (182 tests)
+### Dependencies
+- rdkit>=2024.3.1
+- rich-argparse>=1.4.0
+- pandas>=2.0.0
+- pyarrow>=14.0.0
+- numpy>=1.24.0

{rdkit_cli-0.1.0 → rdkit_cli-0.3.0}/PKG-INFO RENAMED Viewed

@@ -1,11 +1,11 @@
 Metadata-Version: 2.4
 Name: rdkit-cli
-Version: 0.1.0
+Version: 0.3.0
 Summary: A comprehensive CLI tool for RDKit cheminformatics operations
 Project-URL: Homepage, https://github.com/vitruves/rdkit-cli
 Project-URL: Repository, https://github.com/vitruves/rdkit-cli
 Project-URL: Issues, https://github.com/vitruves/rdkit-cli/issues
-Author: Vitruves
+Author: Johan HG Natter
 License-Expression: Apache-2.0
 License-File: LICENSE
 Keywords: cheminformatics,chemistry,cli,fingerprints,molecular-descriptors,rdkit
@@ -38,7 +38,7 @@ A comprehensive, high-performance CLI tool wrapping RDKit functionality for chem
 ## Features
-- **14 Command Categories**: descriptors, fingerprints, filter, convert, standardize, similarity, conformers, reactions, scaffold, enumerate, fragment, diversity, mcs, depict
+- **29 Command Categories**: align, conformers, convert, deduplicate, depict, descriptors, diversity, enumerate, filter, fingerprints, fragment, info, mcs, merge, mmp, props, protonate, reactions, rgroup, rings, rmsd, sample, sascorer, scaffold, similarity, split, standardize, stats, validate
 - **Multiple Input/Output Formats**: CSV, TSV, SMI, SDF, Parquet
 - **Parallel Processing**: Efficient multi-core support via ProcessPoolExecutor
 - **Ninja-style Progress**: Real-time progress display with speed and ETA
@@ -290,6 +290,247 @@ rdkit-cli depict batch -i molecules.csv -o images/ -f svg
 rdkit-cli depict grid -i molecules.csv -o grid.svg --mols-per-row 4
 ```
+### stats
+Calculate dataset statistics.
+```bash
+# Basic statistics
+rdkit-cli stats -i molecules.csv -o stats.json --format json
+# Specific properties
+rdkit-cli stats -i molecules.csv -p MolWt,LogP,TPSA
+# List available properties
+rdkit-cli stats -i molecules.csv --list-properties
+```
+### split
+Split files into smaller chunks.
+```bash
+# Split into N files
+rdkit-cli split -i large.csv -o chunks/ -c 10
+# Split by chunk size
+rdkit-cli split -i large.csv -o chunks/ -s 1000
+# With custom prefix
+rdkit-cli split -i large.csv -o chunks/ -c 5 --prefix molecules
+```
+### sample
+Randomly sample molecules.
+```bash
+# Sample by count
+rdkit-cli sample -i molecules.csv -o sample.csv -k 100 --seed 42
+# Sample by fraction
+rdkit-cli sample -i molecules.csv -o sample.csv -f 0.1
+# Memory-efficient streaming (reservoir sampling)
+rdkit-cli sample -i huge.csv -o sample.csv -k 1000 --stream
+```
+### deduplicate
+Remove duplicate molecules.
+```bash
+# Deduplicate by canonical SMILES (default)
+rdkit-cli deduplicate -i molecules.csv -o unique.csv
+# Deduplicate by InChIKey
+rdkit-cli deduplicate -i molecules.csv -o unique.csv -b inchikey
+# Deduplicate by scaffold
+rdkit-cli deduplicate -i molecules.csv -o unique.csv -b scaffold
+# Keep last occurrence instead of first
+rdkit-cli deduplicate -i molecules.csv -o unique.csv --keep last
+```
+### validate
+Validate molecular structures.
+```bash
+# Basic validation
+rdkit-cli validate -i molecules.csv -o validated.csv
+# Output only valid molecules
+rdkit-cli validate -i molecules.csv -o valid.csv --valid-only
+# With constraints
+rdkit-cli validate -i molecules.csv -o validated.csv \
+    --max-atoms 100 --max-rings 8
+# Check allowed elements
+rdkit-cli validate -i molecules.csv -o validated.csv \
+    --allowed-elements C,H,N,O,S,F,Cl
+# Check stereo and show summary
+rdkit-cli validate -i molecules.csv -o validated.csv \
+    --check-stereo --summary
+```
+### info
+Quick molecule information from SMILES.
+```bash
+# Basic info
+rdkit-cli info "CCO"
+# JSON output
+rdkit-cli info "c1ccccc1" --json
+# Shows: formula, MW, LogP, TPSA, stereocenters, Lipinski violations, InChI/InChIKey
+```
+### merge
+Combine multiple molecule files.
+```bash
+# Merge two files
+rdkit-cli merge -i file1.csv file2.csv -o merged.csv
+# Merge with deduplication
+rdkit-cli merge -i file1.csv file2.csv -o merged.csv --dedupe
+# Track source file
+rdkit-cli merge -i file1.csv file2.csv -o merged.csv --source-column source
+```
+### sascorer
+Calculate synthetic accessibility and drug-likeness scores.
+```bash
+# SA Score only (default)
+rdkit-cli sascorer -i molecules.csv -o scores.csv
+# Include QED score
+rdkit-cli sascorer -i molecules.csv -o scores.csv --qed
+# Include Natural Product-likeness score
+rdkit-cli sascorer -i molecules.csv -o scores.csv --npc
+# All scores
+rdkit-cli sascorer -i molecules.csv -o scores.csv --qed --npc
+```
+### rgroup
+R-group decomposition around a core structure.
+```bash
+# Decompose around benzene core
+rdkit-cli rgroup -i molecules.csv -o decomposed.csv --core "c1ccc([*:1])cc1"
+# Multiple attachment points
+rdkit-cli rgroup -i molecules.csv -o decomposed.csv \
+    --core "c1ccc([*:1])cc([*:2])1"
+```
+### rings
+Ring system analysis.
+```bash
+# Extract ring systems
+rdkit-cli rings extract -i molecules.csv -o rings.csv
+# Ring information (counts, sizes, aromaticity)
+rdkit-cli rings info -i molecules.csv -o ring_info.csv
+# Frequency analysis
+rdkit-cli rings frequency -i molecules.csv -o ring_freq.csv
+```
+### align
+3D molecular alignment.
+```bash
+# Align to reference structure (MCS-based)
+rdkit-cli align -i probes.sdf -o aligned.sdf -r reference.sdf
+# Open3DAlign method
+rdkit-cli align -i probes.sdf -o aligned.sdf -r reference.sdf --method o3a
+```
+### rmsd
+RMSD calculations between 3D structures.
+```bash
+# Compare to reference
+rdkit-cli rmsd compare -i molecules.sdf -o results.csv -r reference.sdf
+# Pairwise RMSD matrix
+rdkit-cli rmsd matrix -i molecules.sdf -o matrix.csv
+# Conformer RMSD analysis
+rdkit-cli rmsd conformers -i multi_conf.sdf -o conf_rmsd.csv
+```
+### mmp
+Matched Molecular Pairs analysis.
+```bash
+# Fragment molecules for MMP
+rdkit-cli mmp fragment -i molecules.csv -o fragments.csv
+# Find matched pairs
+rdkit-cli mmp pairs -i fragments.csv -o pairs.csv
+# Apply MMP transformation
+rdkit-cli mmp transform -i molecules.csv -o transformed.csv \
+    -t "[c:1][CH3]>>[c:1][NH2]"
+```
+### protonate
+Protonation state enumeration.
+```bash
+# Enumerate at physiological pH
+rdkit-cli protonate -i molecules.csv -o protonated.csv --ph 7.4
+# Neutralize charged molecules
+rdkit-cli protonate -i molecules.csv -o neutral.csv --neutralize
+# Enumerate all states
+rdkit-cli protonate -i molecules.csv -o states.csv --enumerate-all
+```
+### props
+Property column operations.
+```bash
+# Add a column
+rdkit-cli props add -i molecules.csv -o output.csv -c series -v "series_A"
+# Rename a column
+rdkit-cli props rename -i molecules.csv -o output.csv --from name --to mol_name
+# Drop columns
+rdkit-cli props drop -i molecules.csv -o output.csv -c col1,col2
+# Keep only specific columns
+rdkit-cli props keep -i molecules.csv -o output.csv -c smiles,name,MolWt
+# List columns
+rdkit-cli props list -i molecules.csv
+```
 ## Global Options
 | Option | Description |
@@ -319,19 +560,28 @@ rdkit-cli depict grid -i molecules.csv -o grid.svg --mols-per-row 4
 ### Cheminformatics Pipeline
 ```bash
-# 1. Standardize input molecules
-rdkit-cli standardize -i raw.csv -o std.csv --cleanup --neutralize
+# 1. Validate and filter input
+rdkit-cli validate -i raw.csv -o validated.csv --valid-only
+# 2. Deduplicate
+rdkit-cli deduplicate -i validated.csv -o unique.csv -b inchikey
+# 3. Standardize molecules
+rdkit-cli standardize -i unique.csv -o std.csv --cleanup --neutralize
-# 2. Filter by drug-likeness
+# 4. Filter by drug-likeness
 rdkit-cli filter druglike -i std.csv -o druglike.csv --rule lipinski
-# 3. Compute descriptors
+# 5. Compute descriptors
 rdkit-cli descriptors compute -i druglike.csv -o desc.csv -d MolWt,MolLogP,TPSA,HBD,HBA
-# 4. Select diverse subset
+# 6. Get dataset statistics
+rdkit-cli stats -i druglike.csv -o stats.json --format json
+# 7. Select diverse subset
 rdkit-cli diversity pick -i druglike.csv -o diverse.csv -k 500
-# 5. Generate depictions
+# 8. Generate depictions
 rdkit-cli depict grid -i diverse.csv -o library.svg --mols-per-row 10
 ```
@@ -358,6 +608,19 @@ rdkit-cli scaffold murcko -i library.csv -o scaffolds.csv
 rdkit-cli diversity analyze -i scaffolds.csv --smiles-column scaffold
 ```
+### Large Dataset Processing
+```bash
+# Sample from a huge dataset
+rdkit-cli sample -i huge_library.csv -o sample.csv -k 10000 --stream
+# Split for parallel processing
+rdkit-cli split -i library.csv -o batches/ -c 10
+# Process batches in parallel (using xargs)
+ls batches/*.csv | xargs -P 4 -I {} rdkit-cli descriptors compute -i {} -o {}.desc.csv -d MolWt,LogP
+```
 ## Development
 ```bash

{rdkit_cli-0.1.0 → rdkit_cli-0.3.0}/README.md RENAMED Viewed

@@ -4,7 +4,7 @@ A comprehensive, high-performance CLI tool wrapping RDKit functionality for chem
 ## Features
-- **14 Command Categories**: descriptors, fingerprints, filter, convert, standardize, similarity, conformers, reactions, scaffold, enumerate, fragment, diversity, mcs, depict
+- **29 Command Categories**: align, conformers, convert, deduplicate, depict, descriptors, diversity, enumerate, filter, fingerprints, fragment, info, mcs, merge, mmp, props, protonate, reactions, rgroup, rings, rmsd, sample, sascorer, scaffold, similarity, split, standardize, stats, validate
 - **Multiple Input/Output Formats**: CSV, TSV, SMI, SDF, Parquet
 - **Parallel Processing**: Efficient multi-core support via ProcessPoolExecutor
 - **Ninja-style Progress**: Real-time progress display with speed and ETA
@@ -256,6 +256,247 @@ rdkit-cli depict batch -i molecules.csv -o images/ -f svg
 rdkit-cli depict grid -i molecules.csv -o grid.svg --mols-per-row 4
 ```
+### stats
+Calculate dataset statistics.
+```bash
+# Basic statistics
+rdkit-cli stats -i molecules.csv -o stats.json --format json
+# Specific properties
+rdkit-cli stats -i molecules.csv -p MolWt,LogP,TPSA
+# List available properties
+rdkit-cli stats -i molecules.csv --list-properties
+```
+### split
+Split files into smaller chunks.
+```bash
+# Split into N files
+rdkit-cli split -i large.csv -o chunks/ -c 10
+# Split by chunk size
+rdkit-cli split -i large.csv -o chunks/ -s 1000
+# With custom prefix
+rdkit-cli split -i large.csv -o chunks/ -c 5 --prefix molecules
+```
+### sample
+Randomly sample molecules.
+```bash
+# Sample by count
+rdkit-cli sample -i molecules.csv -o sample.csv -k 100 --seed 42
+# Sample by fraction
+rdkit-cli sample -i molecules.csv -o sample.csv -f 0.1
+# Memory-efficient streaming (reservoir sampling)
+rdkit-cli sample -i huge.csv -o sample.csv -k 1000 --stream
+```
+### deduplicate
+Remove duplicate molecules.
+```bash
+# Deduplicate by canonical SMILES (default)
+rdkit-cli deduplicate -i molecules.csv -o unique.csv
+# Deduplicate by InChIKey
+rdkit-cli deduplicate -i molecules.csv -o unique.csv -b inchikey
+# Deduplicate by scaffold
+rdkit-cli deduplicate -i molecules.csv -o unique.csv -b scaffold
+# Keep last occurrence instead of first
+rdkit-cli deduplicate -i molecules.csv -o unique.csv --keep last
+```
+### validate
+Validate molecular structures.
+```bash
+# Basic validation
+rdkit-cli validate -i molecules.csv -o validated.csv
+# Output only valid molecules
+rdkit-cli validate -i molecules.csv -o valid.csv --valid-only
+# With constraints
+rdkit-cli validate -i molecules.csv -o validated.csv \
+    --max-atoms 100 --max-rings 8
+# Check allowed elements
+rdkit-cli validate -i molecules.csv -o validated.csv \
+    --allowed-elements C,H,N,O,S,F,Cl
+# Check stereo and show summary
+rdkit-cli validate -i molecules.csv -o validated.csv \
+    --check-stereo --summary
+```
+### info
+Quick molecule information from SMILES.
+```bash
+# Basic info
+rdkit-cli info "CCO"
+# JSON output
+rdkit-cli info "c1ccccc1" --json
+# Shows: formula, MW, LogP, TPSA, stereocenters, Lipinski violations, InChI/InChIKey
+```
+### merge
+Combine multiple molecule files.
+```bash
+# Merge two files
+rdkit-cli merge -i file1.csv file2.csv -o merged.csv
+# Merge with deduplication
+rdkit-cli merge -i file1.csv file2.csv -o merged.csv --dedupe
+# Track source file
+rdkit-cli merge -i file1.csv file2.csv -o merged.csv --source-column source
+```
+### sascorer
+Calculate synthetic accessibility and drug-likeness scores.
+```bash
+# SA Score only (default)
+rdkit-cli sascorer -i molecules.csv -o scores.csv
+# Include QED score
+rdkit-cli sascorer -i molecules.csv -o scores.csv --qed
+# Include Natural Product-likeness score
+rdkit-cli sascorer -i molecules.csv -o scores.csv --npc
+# All scores
+rdkit-cli sascorer -i molecules.csv -o scores.csv --qed --npc
+```
+### rgroup
+R-group decomposition around a core structure.
+```bash
+# Decompose around benzene core
+rdkit-cli rgroup -i molecules.csv -o decomposed.csv --core "c1ccc([*:1])cc1"
+# Multiple attachment points
+rdkit-cli rgroup -i molecules.csv -o decomposed.csv \
+    --core "c1ccc([*:1])cc([*:2])1"
+```
+### rings
+Ring system analysis.
+```bash
+# Extract ring systems
+rdkit-cli rings extract -i molecules.csv -o rings.csv
+# Ring information (counts, sizes, aromaticity)
+rdkit-cli rings info -i molecules.csv -o ring_info.csv
+# Frequency analysis
+rdkit-cli rings frequency -i molecules.csv -o ring_freq.csv
+```
+### align
+3D molecular alignment.
+```bash
+# Align to reference structure (MCS-based)
+rdkit-cli align -i probes.sdf -o aligned.sdf -r reference.sdf
+# Open3DAlign method
+rdkit-cli align -i probes.sdf -o aligned.sdf -r reference.sdf --method o3a
+```
+### rmsd
+RMSD calculations between 3D structures.
+```bash
+# Compare to reference
+rdkit-cli rmsd compare -i molecules.sdf -o results.csv -r reference.sdf
+# Pairwise RMSD matrix
+rdkit-cli rmsd matrix -i molecules.sdf -o matrix.csv
+# Conformer RMSD analysis
+rdkit-cli rmsd conformers -i multi_conf.sdf -o conf_rmsd.csv
+```
+### mmp
+Matched Molecular Pairs analysis.
+```bash
+# Fragment molecules for MMP
+rdkit-cli mmp fragment -i molecules.csv -o fragments.csv
+# Find matched pairs
+rdkit-cli mmp pairs -i fragments.csv -o pairs.csv
+# Apply MMP transformation
+rdkit-cli mmp transform -i molecules.csv -o transformed.csv \
+    -t "[c:1][CH3]>>[c:1][NH2]"
+```
+### protonate
+Protonation state enumeration.
+```bash
+# Enumerate at physiological pH
+rdkit-cli protonate -i molecules.csv -o protonated.csv --ph 7.4
+# Neutralize charged molecules
+rdkit-cli protonate -i molecules.csv -o neutral.csv --neutralize
+# Enumerate all states
+rdkit-cli protonate -i molecules.csv -o states.csv --enumerate-all
+```
+### props
+Property column operations.
+```bash
+# Add a column
+rdkit-cli props add -i molecules.csv -o output.csv -c series -v "series_A"
+# Rename a column
+rdkit-cli props rename -i molecules.csv -o output.csv --from name --to mol_name
+# Drop columns
+rdkit-cli props drop -i molecules.csv -o output.csv -c col1,col2
+# Keep only specific columns
+rdkit-cli props keep -i molecules.csv -o output.csv -c smiles,name,MolWt
+# List columns
+rdkit-cli props list -i molecules.csv
+```
 ## Global Options
 | Option | Description |
@@ -285,19 +526,28 @@ rdkit-cli depict grid -i molecules.csv -o grid.svg --mols-per-row 4
 ### Cheminformatics Pipeline
 ```bash
-# 1. Standardize input molecules
-rdkit-cli standardize -i raw.csv -o std.csv --cleanup --neutralize
+# 1. Validate and filter input
+rdkit-cli validate -i raw.csv -o validated.csv --valid-only
+# 2. Deduplicate
+rdkit-cli deduplicate -i validated.csv -o unique.csv -b inchikey
+# 3. Standardize molecules
+rdkit-cli standardize -i unique.csv -o std.csv --cleanup --neutralize
-# 2. Filter by drug-likeness
+# 4. Filter by drug-likeness
 rdkit-cli filter druglike -i std.csv -o druglike.csv --rule lipinski
-# 3. Compute descriptors
+# 5. Compute descriptors
 rdkit-cli descriptors compute -i druglike.csv -o desc.csv -d MolWt,MolLogP,TPSA,HBD,HBA
-# 4. Select diverse subset
+# 6. Get dataset statistics
+rdkit-cli stats -i druglike.csv -o stats.json --format json
+# 7. Select diverse subset
 rdkit-cli diversity pick -i druglike.csv -o diverse.csv -k 500
-# 5. Generate depictions
+# 8. Generate depictions
 rdkit-cli depict grid -i diverse.csv -o library.svg --mols-per-row 10
 ```
@@ -324,6 +574,19 @@ rdkit-cli scaffold murcko -i library.csv -o scaffolds.csv
 rdkit-cli diversity analyze -i scaffolds.csv --smiles-column scaffold
 ```
+### Large Dataset Processing
+```bash
+# Sample from a huge dataset
+rdkit-cli sample -i huge_library.csv -o sample.csv -k 10000 --stream
+# Split for parallel processing
+rdkit-cli split -i library.csv -o batches/ -c 10
+# Process batches in parallel (using xargs)
+ls batches/*.csv | xargs -P 4 -I {} rdkit-cli descriptors compute -i {} -o {}.desc.csv -d MolWt,LogP
+```
 ## Development
 ```bash

rdkit-cli 0.1.0__tar.gz → 0.3.0__tar.gz

rdkit-cli 0.1.0tar.gz → 0.3.0tar.gz