PyPI - idc-index-data - Versions diffs - 22.0.3__tar.gz → 22.1.1__tar.gz - Mend

idc-index-data 22.0.3tar.gz → 22.1.1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (35) hide show

idc_index_data-22.1.1/.github/copilot-instructions.md ADDED Viewed

@@ -0,0 +1,170 @@
+# GitHub Copilot Instructions for idc-index-data
+## Project Overview
+`idc-index-data` is a Python package that bundles the index data for the NCI
+Imaging Data Commons (IDC). The package provides Parquet files containing
+metadata about imaging data hosted by IDC, intended to be used by the
+`idc-index` Python package.
+## Technology Stack
+- **Build System**: scikit-build-core with CMake
+- **Package Manager**: pip
+- **Python Versions**: 3.10, 3.11, 3.12
+- **Testing**: pytest with pytest-cov
+- **Task Runner**: nox
+- **Linting**: ruff, pylint, mypy, pre-commit hooks
+- **Documentation**: Sphinx with MyST parser and Furo theme
+- **Data Processing**: pandas, pyarrow, Google Cloud BigQuery
+## Development Workflow
+### Setting Up Development Environment
+```bash
+python3 -m venv .venv
+source ./.venv/bin/activate
+pip install -v -e .[dev]
+pre-commit install
+```
+### Common Commands
+- **Run all checks**: `nox` (runs lint, pylint, and tests by default)
+- **Lint code**: `nox -s lint`
+- **Run pylint**: `nox -s pylint`
+- **Run tests**: `nox -s tests`
+- **Build docs**: `nox -s docs`
+- **Serve docs**: `nox -s docs -- --serve`
+- **Build package**: `nox -s build`
+- **Update IDC index version**: `nox -s bump -- <version>` (or leave off version
+  for latest)
+- **Tag release**: `nox -s tag_release` (shows instructions)
+### Pre-commit Checks
+Always run pre-commit before committing:
+```bash
+pre-commit run --all-files
+```
+## Code Style and Conventions
+### Python Code Style
+- **Import Statement**: All files must include
+  `from __future__ import annotations` at the top
+- **Type Hints**: Use type hints throughout; strict type checking is enabled for
+  `idc_index_data.*` modules
+- **Linting**: Follow ruff and pylint rules configured in `pyproject.toml`
+- **Formatting**: Code is formatted with ruff formatter
+- **Line Length**: Not strictly enforced but keep reasonable
+- **Docstrings**: Use when appropriate, especially for public APIs
+### Key Ruff Rules
+The project uses extensive ruff rules including:
+- `B` - flake8-bugbear
+- `I` - isort (import sorting)
+- `ARG` - flake8-unused-arguments
+- `UP` - pyupgrade
+- `PTH` - flake8-use-pathlib (prefer pathlib over os.path)
+- `NPY` - NumPy specific rules
+- `PD` - pandas-vet
+### Type Checking
+- Python 3.8 minimum target
+- Strict mypy checking for package code
+- Use `typing.TYPE_CHECKING` for import cycles
+## Project Structure
+```
+idc-index-data/
+├── src/idc_index_data/     # Main package source
+│   ├── __init__.py         # Package exports and file path lookups
+│   └── _version.py         # Auto-generated version file
+├── scripts/                # Management scripts
+│   ├── python/             # Python scripts for index management
+│   └── sql/                # SQL queries for BigQuery
+├── tests/                  # Test files
+│   └── test_package.py     # Package tests
+├── docs/                   # Sphinx documentation
+├── pyproject.toml          # Project configuration
+├── noxfile.py              # Nox session definitions
+└── CMakeLists.txt          # Build configuration
+```
+## Important Considerations
+### Package Purpose
+This package is a **data package** - it bundles index files (CSV and Parquet)
+and provides file paths to locate them. It does not contain complex business
+logic but rather serves as a data distribution mechanism.
+### Version Management
+- Version is defined in `pyproject.toml`
+- Use `nox -s bump` to update to new IDC index versions
+- The version should match the IDC release version
+- Always update both index files and test expectations when bumping version
+### Data Files
+The package includes:
+- `idc_index.csv.zip` - Compressed CSV index (optional)
+- `idc_index.parquet` - Parquet format index
+- `prior_versions_index.parquet` - Historical version index
+### Google Cloud Integration
+- Some operations require Google Cloud credentials
+- BigQuery is used to fetch latest index data
+- Scripts need `GCP_PROJECT` and `GOOGLE_APPLICATION_CREDENTIALS` environment
+  variables
+### Testing
+- Tests verify package installation and file accessibility
+- Coverage reporting is configured but codecov upload is currently disabled
+- Tests should work across platforms (Linux, macOS, Windows)
+## Release Process
+1. Update index version: `nox -s bump -- --commit <version>`
+2. Create PR: `gh pr create --fill`
+3. After merge, tag release: follow instructions from `nox -s tag_release`
+4. Push tag: `git push origin <version>`
+5. GitHub Actions will automatically build and publish to PyPI
+## CI/CD
+- **Format check**: pre-commit hooks + pylint
+- **Tests**: Run on Python 3.10 and 3.12 across Linux, macOS, and Windows
+- **Publishing**: Automated through GitHub Actions on tagged releases
+## Additional Resources
+- [Contributing Guide](.github/CONTRIBUTING.md)
+- [Scientific Python Developer Guide](https://learn.scientific-python.org/development/)
+- [IDC Homepage](https://imaging.datacommons.cancer.gov)
+- [IDC Discourse Forum](https://discourse.canceridc.dev/)
+## When Making Changes
+1. **Always** run tests before and after changes: `nox -s tests`
+2. **Always** run linters: `nox -s lint`
+3. **Never** commit without running pre-commit checks
+4. **Prefer** pathlib over os.path for file operations
+5. **Use** type hints for all new code
+6. **Update** tests if changing package structure or exports
+7. **Follow** existing patterns in the codebase
+8. **Keep** changes minimal and focused
+9. **Document** any new public APIs
+10. **Test** across Python versions when changing core functionality

{idc_index_data-22.0.3 → idc_index_data-22.1.1}/.github/workflows/external-indices.yml RENAMED Viewed

@@ -55,6 +55,6 @@ jobs:
         if: github.event_name == 'release' && github.event.action == 'published'
         uses: ncipollo/release-action@v1
         with:
-          artifacts: "*.parquet"
+          artifacts: "release_artifacts/*.parquet,release_artifacts/*.json"
           allowUpdates: true
           omitBodyDuringUpdate: true

{idc_index_data-22.0.3 → idc_index_data-22.1.1}/.gitignore RENAMED Viewed

@@ -159,3 +159,6 @@ Thumbs.db
 # gcp service account keys
 gha-creds-**.json
+# Release artifacts directory
+release_artifacts/

{idc_index_data-22.0.3 → idc_index_data-22.1.1}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: idc-index-data
-Version: 22.0.3
+Version: 22.1.1
 Summary: ImagingDataCommons index to query and download data.
 Author-Email: Andrey Fedorov <andrey.fedorov@gmail.com>, Vamsi Thiriveedhi <vthiriveedhi@mgh.harvard.edu>, Jean-Christophe Fillion-Robin <jchris.fillionr@kitware.com>
 License: Copyright 2024 Andrey Fedorov
@@ -31,8 +31,6 @@ Classifier: Operating System :: OS Independent
 Classifier: Programming Language :: Python
 Classifier: Programming Language :: Python :: 3
 Classifier: Programming Language :: Python :: 3 :: Only
-Classifier: Programming Language :: Python :: 3.8
-Classifier: Programming Language :: Python :: 3.9
 Classifier: Programming Language :: Python :: 3.10
 Classifier: Programming Language :: Python :: 3.11
 Classifier: Programming Language :: Python :: 3.12
@@ -42,7 +40,7 @@ Project-URL: Homepage, https://github.com/ImagingDataCommons/idc-index-data
 Project-URL: Bug Tracker, https://github.com/ImagingDataCommons/idc-index-data/issues
 Project-URL: Discussions, https://discourse.canceridc.dev/
 Project-URL: Changelog, https://github.com/ImagingDataCommons/idc-index-data/releases
-Requires-Python: >=3.8
+Requires-Python: >=3.10
 Provides-Extra: test
 Requires-Dist: pandas; extra == "test"
 Requires-Dist: pyarrow; extra == "test"

{idc_index_data-22.0.3 → idc_index_data-22.1.1}/pyproject.toml RENAMED Viewed

@@ -13,7 +13,7 @@ build-backend = "scikit_build_core.build"
 [project]
 name = "idc-index-data"
-version = "22.0.3"
+version = "22.1.1"
 authors = [
   { name = "Andrey Fedorov", email = "andrey.fedorov@gmail.com" },
   { name = "Vamsi Thiriveedhi", email = "vthiriveedhi@mgh.harvard.edu" },
@@ -22,7 +22,7 @@ authors = [
 description = "ImagingDataCommons index to query and download data."
 readme = "README.md"
 license.file = "LICENSE"
-requires-python = ">=3.8"
+requires-python = ">=3.10"
 classifiers = [
   "Development Status :: 4 - Beta",
   "Intended Audience :: Science/Research",
@@ -32,8 +32,6 @@ classifiers = [
   "Programming Language :: Python",
   "Programming Language :: Python :: 3",
   "Programming Language :: Python :: 3 :: Only",
-  "Programming Language :: Python :: 3.8",
-  "Programming Language :: Python :: 3.9",
   "Programming Language :: Python :: 3.10",
   "Programming Language :: Python :: 3.11",
   "Programming Language :: Python :: 3.12",

{idc_index_data-22.0.3 → idc_index_data-22.1.1}/scripts/python/generate-indices.py RENAMED Viewed

@@ -12,6 +12,10 @@ def main():
     manager = IDCIndexDataManager(project_id=project_id)
     scripts_dir = Path(__file__).resolve().parent.parent
+    # Create dedicated output directory for release artifacts
+    output_dir = scripts_dir.parent / "release_artifacts"
+    output_dir.mkdir(parents=True, exist_ok=True)
     assets_dir = scripts_dir.parent / "assets"
     # Collecting all .sql files from sql_dir and assets_dir
@@ -19,8 +23,10 @@ def main():
     for file_name in sql_files:
         file_path = assets_dir / file_name
-        index_df, output_basename = manager.execute_sql_query(file_path)
-        index_df.to_parquet(f"{output_basename}.parquet")
+        index_df, output_basename, schema = manager.execute_sql_query(file_path)
+        parquet_file_path = output_dir / f"{output_basename}.parquet"
+        index_df.to_parquet(parquet_file_path)
+        manager.save_schema_to_json(schema, output_basename, output_dir)
     core_indices_dir = scripts_dir.parent / "scripts" / "sql"
@@ -28,8 +34,10 @@ def main():
     for file_name in sql_files:
         file_path = core_indices_dir / file_name
-        index_df, output_basename = manager.execute_sql_query(file_path)
-        index_df.to_parquet(f"{output_basename}.parquet")
+        index_df, output_basename, schema = manager.execute_sql_query(file_path)
+        parquet_file_path = output_dir / f"{output_basename}.parquet"
+        index_df.to_parquet(parquet_file_path)
+        manager.save_schema_to_json(schema, output_basename, output_dir)
 if __name__ == "__main__":

{idc_index_data-22.0.3 → idc_index_data-22.1.1}/scripts/python/idc_index_data_manager.py RENAMED Viewed

@@ -1,5 +1,6 @@
 from __future__ import annotations
+import json
 import logging
 import os
 from pathlib import Path
@@ -20,25 +21,69 @@ class IDCIndexDataManager:
         self.client = bigquery.Client(project=project_id)
         logger.debug("IDCIndexDataManager initialized with project ID: %s", project_id)
-    def execute_sql_query(self, file_path: str) -> tuple[pd.DataFrame, str]:
+    def execute_sql_query(
+        self, file_path: str
+    ) -> tuple[pd.DataFrame, str, list[bigquery.SchemaField]]:
         """
         Executes the SQL query in the specified file.
         Returns:
-            Tuple[pd.DataFrame, str]: A tuple containing the DataFrame with query results,
-            the output basename.
+            Tuple[pd.DataFrame, str, List[bigquery.SchemaField]]: A tuple containing
+            the DataFrame with query results, the output basename, and the BigQuery schema.
         """
         with Path(file_path).open("r") as file:
             sql_query = file.read()
-        index_df = self.client.query(sql_query).to_dataframe()
+        query_job_result = self.client.query(sql_query).result()
+        schema = query_job_result.schema  # Get schema from BigQuery QueryJob
+        index_df = query_job_result.to_dataframe()
         if "StudyDate" in index_df.columns:
             index_df["StudyDate"] = index_df["StudyDate"].astype(str)
         output_basename = Path(file_path).name.split(".")[0]
         logger.debug("Executed SQL query from file: %s", file_path)
-        return index_df, output_basename
+        return index_df, output_basename, schema
+    def save_schema_to_json(
+        self,
+        schema: list[bigquery.SchemaField],
+        output_basename: str,
+        output_dir: Path | None = None,
+    ) -> None:
+        """
+        Saves the BigQuery schema to a JSON file.
+        Args:
+            schema: List of BigQuery SchemaField objects from the query result
+            output_basename: The base name for the output file
+            output_dir: Optional directory path for the output file
+        """
+        # Convert BigQuery schema to JSON-serializable format
+        schema_dict = {
+            "fields": [
+                {
+                    "name": field.name,
+                    "type": field.field_type,
+                    "mode": field.mode,
+                }
+                for field in schema
+            ]
+        }
+        # Save to JSON file
+        if output_dir:
+            output_dir.mkdir(parents=True, exist_ok=True)
+            json_file_path = output_dir / f"{output_basename}.json"
+        else:
+            json_file_path = Path(f"{output_basename}.json")
+        with json_file_path.open("w") as f:
+            json.dump(schema_dict, f, indent=2)
+        logger.debug("Created schema JSON file: %s", json_file_path)
     def generate_index_data_files(
-        self, generate_compressed_csv: bool = True, generate_parquet: bool = False
+        self,
+        generate_compressed_csv: bool = True,
+        generate_parquet: bool = False,
+        output_dir: Path | None = None,
     ) -> None:
         """
         Generates index-data files locally by executing queries against
@@ -47,29 +92,48 @@ class IDCIndexDataManager:
         This method iterates over SQL files in the 'scripts/sql' directory,
         executing each query using :func:`execute_sql_query` and generating a DataFrame,
         'index_df'. The DataFrame is then saved as compressed CSV and/or Parquet file.
+        Args:
+            generate_compressed_csv: Whether to generate compressed CSV files
+            generate_parquet: Whether to generate Parquet files
+            output_dir: Optional directory path for the output files
         """
         scripts_dir = Path(__file__).parent.parent
         sql_dir = scripts_dir / "sql"
+        if output_dir:
+            output_dir.mkdir(parents=True, exist_ok=True)
         for file_name in Path.iterdir(sql_dir):
             if str(file_name).endswith(".sql"):
                 file_path = Path(sql_dir) / file_name
-                index_df, output_basename = self.execute_sql_query(file_path)
+                index_df, output_basename, schema = self.execute_sql_query(file_path)
                 logger.debug(
                     "Executed and processed SQL queries from file: %s", file_path
                 )
-            if generate_compressed_csv:
-                csv_file_name = f"{output_basename}.csv.zip"
-                index_df.to_csv(
-                    csv_file_name, compression={"method": "zip"}, escapechar="\\"
-                )
-                logger.debug("Created CSV zip file: %s", csv_file_name)
-            if generate_parquet:
-                parquet_file_name = f"{output_basename}.parquet"
-                index_df.to_parquet(parquet_file_name, compression="zstd")
-                logger.debug("Created Parquet file: %s", parquet_file_name)
+                if generate_compressed_csv:
+                    csv_file_path = (
+                        output_dir / f"{output_basename}.csv.zip"
+                        if output_dir
+                        else Path(f"{output_basename}.csv.zip")
+                    )
+                    index_df.to_csv(
+                        csv_file_path, compression={"method": "zip"}, escapechar="\\"
+                    )
+                    logger.debug("Created CSV zip file: %s", csv_file_path)
+                if generate_parquet:
+                    parquet_file_path = (
+                        output_dir / f"{output_basename}.parquet"
+                        if output_dir
+                        else Path(f"{output_basename}.parquet")
+                    )
+                    index_df.to_parquet(parquet_file_path, compression="zstd")
+                    logger.debug("Created Parquet file: %s", parquet_file_path)
+                    # Save schema to JSON file
+                    self.save_schema_to_json(schema, output_basename, output_dir)
     def retrieve_latest_idc_release_version(self) -> int:
         """