PyPI - chem_mat_database - Versions diffs - 1.6.0__tar.gz - Mend

chem_mat_database 1.6.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (169) hide show

chem_mat_database-1.6.0/.claude/commands/changelog.md ADDED Viewed

@@ -0,0 +1,20 @@
+---
+description: Document a feature
+---
+You are a documentation writing expert specializing in clear and concise technical writing.
+## Context
+- The code implementation of the feature(s) to be documented
+- The current version of the Changelog located in the `CHANGELOG.rst` file
+## Your task
+Your task is to create a new entry in the `CHANGELOG.rst` file documenting the latest changes made to the project as requested by the user below.
+You will add the changes to the end of the file, appending it to the last version that is listed there - You will NOT create a new version entry.
+Key practices:
+- Entries are to be clear, concise and informative. They should provide enough context for users to understand the changes without being overly verbose.

chem_mat_database-1.6.0/.claude/commands/debug.md ADDED Viewed

@@ -0,0 +1,28 @@
+---
+description: Debug a feature
+---
+You are an expert at debugging specializing in root cause analysis.
+## Your task
+Execute the requested piece of code and debug any issues that arise by following the steps below:
+1. Execute the code in question.
+2. Carefully read any error messages and or logs that are produced.
+3. Isolate the root cause of the issue by examining the code and error messages.
+4. Implement a minimal fix to resolve the issue.
+5. Verify the solution works by re-executing the code or writing specific test cases if they do not yet exist.
+Debugging process:
+- Form and Test hypotheses
+- Add strategic debug logging if necessary
+- Inspect variable states
+- Review relevant documentation
+- Focus on fixing the underlying issues, not just the symptoms
+At the end of the process, provide the user with:
+- Root cause explanation
+- Evidence supporting your findings
+- Specific code fix
+- Prevention recommendations

chem_mat_database-1.6.0/.claude/commands/documentation.md ADDED Viewed

@@ -0,0 +1,20 @@
+---
+description: Document a feature
+---
+You are a documentation writing expert specializing in clear and concise technical writing.
+## Context
+- The code implementation of the feature(s) to be documented
+- The existing documentation style and structure in the `docs/` folder
+## Your task
+Create one ore more new entries in the mkdocs documentation located in the `docs/` folder and update the `mkdocs.yml` file to include the new documentation.
+Key practices:
+- Carefully read multiple existing documentation files to match the style and tone of the existing project documentation. This specifically relates to the consistent usage of terminology, formatting, emojis and the level of detail in which features are explained.
+- Ensure the new documentation is clear, concise and easy to understand for users of varying expertise levels.
+- Use markdown formatting for headings, code blocks, lists, links and other elements as appropriate.
+- Include code examples and usage instructions where relevant.

chem_mat_database-1.6.0/.claude/commands/feature.md ADDED Viewed

@@ -0,0 +1,22 @@
+---
+description: Document a feature
+---
+You are an experienced senior software engineer tasked to implement a new feature.
+## Your task
+Implement the requested feature according to the explanation provided by the user below.
+While implementing follow these key practices:
+- Carefully read some of the existing modules to match the style of the existing project code.
+- Ensure the new code is clear, concise and easy to understand. This particularly implies the usage of code comments and docstrings where appropriate.
+- After implementing the feature, add unittests to the `tests/` folder to verify the new functionality works as intended and execute those tests to ensure they pass, as well as any tests related to the feature to verify nothing else is broken. Please look up where the tests would best fit in the existing test structure.
+- Follow the existing project structure and conventions for file organization, naming, and coding style.
+After completing the implementation, provide the user with:
+- A brief summary of the changes made
+- The updated or new code files
+- The new or updated tests
+- Potential pitfalls or edge cases to be aware of
+- Suggestions for future improvements or extensions to the feature

chem_mat_database-1.6.0/.claude/settings.local.json ADDED Viewed

@@ -0,0 +1,20 @@
+{
+  "permissions": {
+    "allow": [
+      "Bash(find:*)",
+      "Bash(source:*)",
+      "Bash(cmmanage --help:*)",
+      "Bash(cmmanage metadata:*)",
+      "Bash(cmdata:*)",
+      "Bash(pytest:*)",
+      "Bash(cat:*)",
+      "Bash(python:*)",
+      "Bash(curl:*)",
+      "WebSearch",
+      "Bash(echo:*)",
+      "Bash(uv build:*)"
+    ],
+    "deny": [],
+    "ask": []
+  }
+}

chem_mat_database-1.6.0/.github/workflows/docs.yml ADDED Viewed

@@ -0,0 +1,31 @@
+name: docs
+on:
+  push:
+    branches:
+      - master
+      - main
+  permissions:
+    contents: write
+  jobs:
+    deploy:
+      runs-on: ubuntu-latest
+      steps:
+        - uses: actions/checkout@v4
+        - name: Configure Git Credentials
+          run: |
+            git config user.name github-actions[bot]
+            git config user.email 41898282+github-actions[bot]@users.noreply.github.com
+        - uses: actions/setup-python@v5
+          with:
+            python-version: 3.10
+        - run: echo "cache_id=$(data --utc '+%V')" >> $GITHUB_ENV
+        - uses: actions/cache@v4
+          with:
+            key: mkdocs-material-${{ env.cache_id }}
+            path: .cache
+            restore-keys: |
+              mkdocs-material-${{ env.cache_id }}
+        - run: pip install mkdocs-material
+        - run: mkdocs gh-deploy --force

chem_mat_database-1.6.0/.gitignore ADDED Viewed

@@ -0,0 +1,48 @@
+venv/*
+venv/
+.vscode/
+tests/artifacts/
+# Byte-compiled / optimized / DLL files
+*/__pycache__/*
+*.py[cod]
+*$py.class
+# C extensions
+*.so
+# Distribution / packaging
+.Python
+.tox
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+# Custom
+chem_mat_data/scripts/results/*
+site/*
+adrs/*
+*.gz
+*.csv
+*.mpack
+*.zip
+*.data
+*.backup
+**/results/*
+**/lightning_logs/*

chem_mat_database-1.6.0/.mcp.json ADDED Viewed

@@ -0,0 +1,18 @@
+{
+  "mcpServers": {
+    "web-search": {
+      "command": "node",
+      "args": [
+        "/home/jonas/Private/web-search-mcp/dist/index.js"
+      ],
+      "env": {}
+    },
+    "file-download": {
+      "command": "node",
+      "args": [
+        "/home/jonas/Private/mcp-file-downloader/download-server.js"
+      ],
+      "env": {}
+    }
+  }
+}

chem_mat_database-1.6.0/.python-version ADDED Viewed

	@@ -0,0 +1 @@
1	+ 3.10

chem_mat_database-1.6.0/.zenodo.json ADDED Viewed

@@ -0,0 +1,39 @@
+{
+    "title": "ChemMatData: Unified Chemistry and Material Science Datasets for Graph Neural Networks",
+    "upload_type": "software",
+    "access_right": "open",
+    "license": "MIT",
+    "language": "eng",
+    "creators": [
+        {
+            "name": "Teufel, Jonas",
+            "orcid": "0000-0002-9228-9395",
+            "affiliation": "Karlsruhe Institute of Technology"
+        },
+        {
+            "name": "Zeller, Jana",
+            "affiliation": "ETH Zurich / Max Planck Institute for Intelligent Systems"
+        },
+        {
+            "name": "Singh, Mohit",
+            "affiliation": "Karlsruhe Institute of Technology"
+        }
+    ],
+    "keywords": [
+        "chemistry",
+        "materials science",
+        "dataset",
+        "machine learning",
+        "graph neural network",
+        "molecular property prediction",
+        "Python"
+    ],
+    "description": "A Python package providing unified access to chemistry and material science datasets for machine learning applications, specifically designed for training graph neural networks (GNNs). Provides simple CLI and API interfaces to download datasets in raw (CSV/pandas) or processed (graph) format.",
+    "related_identifiers": [
+        {
+            "identifier": "https://github.com/the16thpythonist/chem_mat_data",
+            "relation": "isSupplementTo",
+            "scheme": "url"
+        }
+    ]
+}

chem_mat_database-1.6.0/AGENTS.md ADDED Viewed

@@ -0,0 +1,118 @@
+# Guide for AI Agents
+## Overview
+This project implements the *Chemistry and Materials Database* (ChemMatData) which is supposed to be a database
+containing various datasets related to chemistry and materials science in a format which is accessible for
+the training of neural network models and specifically graph neural networks (GNNs). The project is designed to
+have an intuitive python interface for easy downloading and processing of the datasets as well as a
+command line interface (CLI) to interact with the remote database.
+## Project Structure
+- `/chem_mat_data`: Python source files
+    - `/chem_mat_data/examples`: Example scripts which demonstrate how to use the project
+    - `/chem_mat_data/scripts`: Pycomex experiment files which implement the conversion of the various datasets in their
+    native data formats into the commong graph format used by the project.
+    - `/chem_mat_data/templates`: Jinja2 templates
+- `/tests`: Pytest unit tests which are names "tests_" plus the name of the source python file
+    - `/tests/assets`: Additional files etc. which are needed by some of the unittests
+    - `/tests/artifacts`: Temp folder in which the tests save their results
+## Documentation
+Type hints should be used wherever possible.
+Every function/method should be properly documented by a Docstring using the **ReStructuredText** documentation style.
+The doc strings should start with a brief summary of the function, followed by the parameters as illustrated by this example:
+```python
+def multiply(a: float, b: float) -> float:
+    """
+    Returns the product of ``a`` and ``b``.
+    :param a: first float input value
+    :param b: second float input value
+    :returns: The float result
+    """
+    return a * b
+```
+## Computational Experiments
+This project uses the [pycomex](https://github.com/the16thpythonist/pycomex) micro-framework for implementing and executing computational experiments.
+A computational experiment can be defined according to the following example:
+```python
+from pycomex.functional.experiment import Experiment
+from pycomex.utils import folder_path, file_namespace
+# :param PARAM1:
+#       Description for the parameter...
+PARAM1: int = 100
+__DEBUG__ = True # enables/disables debug mode
+experiment = Experiment(
+    base_path=folder_path(__file__),
+    namespace=file_namespace(__file__),
+    glob=globals()
+)
+@experiment.hook('util_function')
+def util_function(e: Experiment, param: int):
+    return param ** 2
+@experiment
+def experiment(e: Experiment):
+    # automatically pushes to log file as well as to the stdout
+    # parameters are accessed as attributes of the Experiment object
+    e.log(f'this is a parameter value: {e.PARAM1}')
+    # Store values into the experiment automatically
+    e['value'] = value
+    # The e.path contains the path to the artifacts folder of the
+    # current experiment run.
+    e.log('artifacts path: {}')
+    # Easily store figures to the artifacts folder
+    fig: plt.Figure = ...
+    e.commit_fig(fig, 'figure1.png')
+    # using hooks instead of plain functions
+    result = e.apply_hook(
+        'util_function',
+        param=20,
+    )
+experiment.run_if_main()
+```
+## Code Convention
+1. functions with many parameters should be split like this:
+```python
+def function(arg1: List[int],
+             arg2: List[float],
+             **kwargs
+             ) -> float:
+    # ...
+```
+## Testing
+Unittests use `pytest` in the `/tests` folder with this command
+```bash
+pytest -q -m "not localonly"
+```
+## Pull Requests / Contributing
+Pull Requests should always start with a small summary of the changes and a list of the changed files.
+Additionally a PR should contain a small summary of the tests results.

chem_mat_database-1.6.0/CHANGELOG.md ADDED Viewed

@@ -0,0 +1,189 @@
+# Changelog
+All notable changes to this project will be documented in this file.
+The format is based on [Keep a Changelog](https://keepachangelog.com/),
+and this project adheres to [Semantic Versioning](https://semver.org/).
+## [1.6.0] - 2026-03-26
+### Added
+- `.zenodo.json` for Zenodo-GitHub integration to enable DOI generation for releases.
+- `CHANGELOG.md` replacing the previous `HISTORY.rst` file, now following the
+  [Keep a Changelog](https://keepachangelog.com/) convention.
+- Dataset processing script for the DUD-E (Directory of Useful Decoys - Enhanced) multi-target
+  virtual screening benchmark with 102 protein targets and tri-state labels.
+- Additional MUV helper script (`create_graph_datasets__muv_.py`).
+### Changed
+- Dropped Python 3.8 support; minimum supported version is now Python 3.9.
+- Substantially reworked the MUV dataset processing script to support multi-target classification
+  with tri-state labels (active/decoy/no data) across 17 biological targets.
+- Rewrote the BBBP dataset processing script with proper documentation, metadata, and updated
+  data source handling.
+- Updated DUD-E compound count in `metadata.yml` (1,111,394 to 400,040).
+- Added missing source reference for the BBBP dataset in `metadata.yml`.
+### Removed
+- `HISTORY.rst` (replaced by `CHANGELOG.md`).
+## [1.5.0] - 2025-11-27
+### Added
+- New `cmdata stats` command which downloads a given dataset to compute common
+  statistics such as the distribution of elements, graph sizes, and the most common motifs.
+## [1.4.1] - 2025-11-10
+### Changed
+- Changed the way in which the `--help` string is being printed to be more informative.
+### Fixed
+- Changed the way in which the Nextcloud remote accesses the public endpoint which has changed for
+  the transition to NextcloudHub 10.0. This had been a breaking change which prevented accessing the
+  remote file share location at all.
+## [1.4.0] - 2025-10-06
+### Added
+- Implemented **StreamingDataset** architecture for memory-efficient access to large molecular datasets:
+  - `SmilesDataset`: Lazy-loading of SMILES strings from CSV files with minimal memory footprint.
+  - `XyzDataset`: Lazy-loading of 3D molecular structures from XYZ file bundles, supporting multiple
+    format parsers (default, qm9, hopv15).
+  - `GraphDataset`: On-the-fly conversion from raw molecular representations (SMILES or XYZ) to graph dicts.
+    - Automatic detection of dataset format (SMILES vs XYZ) for transparent handling.
+    - Sequential mode (`num_workers=0`) for optimal performance with typical molecules (~2000 mol/s).
+    - Parallel mode (`num_workers>0`) with multi-process architecture for complex molecules or custom processing.
+    - Deadlock-free producer-collector-worker design that maintains dataset order while enabling true CPU parallelism.
+  - `ShuffleDataset`: Approximate shuffling using fixed-size buffer for training with shuffled data
+    while maintaining low memory usage.
+- Comprehensive streaming datasets documentation (`docs/api_streaming_datasets.md`) covering:
+  - Motivation and use cases for streaming vs pre-processed datasets.
+  - Detailed usage examples for all streaming dataset classes.
+  - Performance considerations and when to use sequential vs parallel processing.
+  - Integration with deep learning frameworks and training workflows.
+  - Guidance on choosing between SMILES and XYZ formats.
+- Architecture Decision Record (`docs/architecture_decisions/004_streaming_datasets.md`) documenting:
+  - Design rationale for streaming architecture.
+  - Detailed explanation of parallel processing implementation and deadlock prevention.
+  - Trade-offs between streaming and pre-processed datasets.
+  - Auto-detection mechanism for SMILES vs XYZ datasets.
+- Comprehensive unit tests for streaming datasets:
+  - `tests/test_dataset.py`: Core functionality tests for SmilesDataset, XyzDataset, GraphDataset, and ShuffleDataset.
+  - `tests/test_xyz_dataset.py`: XYZ-specific functionality and format parser tests.
+  - `tests/test_xyz_bundle.py`: XYZ bundle file handling tests.
+  - `tests/test_dataset_benchmark.py`: Performance benchmarks for sequential vs parallel processing modes.
+### Changed
+- Updated existing tests (`test_docs.py`, `test_main.py`, `test_web.py`) to accommodate streaming dataset functionality.
+### Removed
+- Deprecated `tests/test_datasets.py` in favor of new dataset-specific test files.
+## [1.3.0] - 2025-09-12
+### Added
+- `HOPV15_exp` dataset which contains experimental values for organic photovoltaic materials.
+- Missing target descriptions for the QM9 dataset.
+- `melting_point` dataset which contains melting points for small organic molecules.
+### Changed
+- Minimum required version of pycomex to `0.23.0` to support the most recent features
+  such as the caching system which has also been implemented in the dataset processing scripts.
+- CLI logo displayed at the beginning of the help message to "CMDATA" in another ASCII font
+  and added a logo image in ANSI art.
+## [1.2.1] - 2025-09-22
+### Changed
+- Modified the syntax of type annotations so that the package is now compatible with
+  Python 3.9 through Python 3.12.
+- Using `nox` now for the testing sessions instead of `tox` due to the much faster uv backend
+  to create the virtual environments.
+## [1.2.0] - 2025-09-04
+### Added
+- `CLAUDE.md` file which contains information that can be used by AI agents such as
+  Claude to understand and work with the package.
+- `remote show` command which displays useful information for the currently registered
+  file share location such as the URL and additional parameters.
+- `remote diff` command which allows comparing the local and remote file share versions
+  of the metadata.yml file and prints the difference to the console.
+- `metadata diff` command in the manage CLI which allows comparing the local and remote
+  versions of the metadata.yml file.
+- `tadf` dataset associated with OLED design.
+### Changed
+- Default SSL verification in the `web.py` module set to `True` to avoid security issues
+  when downloading files from the internet.
+## [1.1.2] - 2025-09-01
+### Changed
+- Default template for the `config.toml` file to include commented out example values for the
+  Nextcloud remote file share configuration and to fix the default download location.
+## [1.1.1] - 2025-07-07
+### Added
+- `prettytable` as a dependency to create markdown tables in the documentation.
+- `docs` command group in the manage CLI to manage the documentation:
+  - `collect-datasets` which collects all datasets listed in the metadata.yml file and creates
+    a new markdown docs file with a table containing all those datasets.
+### Changed
+- `list` command now also prints the verbose name / short description of the datasets.
+## [1.1.0] - 2025-07-07
+### Added
+- `AGENTS.md` file which contains information that can be used by AI agents such as
+  ChatGPT Codex to understand and work with the package.
+- `manage.py` script which exposes an additional command line interface for the management
+  and maintenance of the database:
+  - `metadata` command group to interact with the local and remote version of the metadata.yml file.
+  - `dataset` command group used to trigger the creation and upload of the local datasets.
+- `remote` command group in the `cmdata` CLI:
+  - `upload` command to upload arbitrary files to the file share server.
+- New datasets:
+  - `skin_irritation`: binary classification dataset on skin irritation.
+  - `skin_sensitizers`: binary classification dataset on skin sensitization.
+  - `elanos_bp`: regression of boiling point.
+  - `elanos_vp`: regression of vapor pressure.
+## [1.0.0] - 2025-05-01
+- First official release of the package.
+## [0.2.0] - 2024-12-12
+### Added
+- `HISTORY.rst` to start a changelog of the changes for each version.
+- `DEVELOP.rst` which contains information about the development environment of the project.
+- `ruff.toml` file to configure the Ruff linter and code formatter.
+### Changed
+- Replaced the `tox.ini` with a `tox.toml` file.
+- Ported the `pyproject.toml` file from using Poetry to using `uv` and `hatchling` as
+  the build backend.