PyPI - iparq - Versions diffs - 0.3.0__tar.gz → 0.4.1__tar.gz - Mend

iparq 0.3.0tar.gz → 0.4.1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (28) hide show

iparq-0.4.1/.github/FUNDING.yml +4 -0
{iparq-0.3.0 → iparq-0.4.1}/.github/dependabot.yml +1 -1
{iparq-0.3.0 → iparq-0.4.1}/.github/workflows/python-package.yml +12 -2
iparq-0.4.1/.github/workflows/test.yml +37 -0
{iparq-0.3.0 → iparq-0.4.1}/.gitignore +1 -0
{iparq-0.3.0 → iparq-0.4.1}/PKG-INFO +20 -23
{iparq-0.3.0 → iparq-0.4.1}/README.md +18 -22
{iparq-0.3.0 → iparq-0.4.1}/pyproject.toml +9 -2
iparq-0.4.1/src/iparq/__init__.py +1 -0
{iparq-0.3.0 → iparq-0.4.1}/src/iparq/source.py +72 -0
{iparq-0.3.0 → iparq-0.4.1}/tests/test_cli.py +28 -19
iparq-0.4.1/uv.lock +923 -0
iparq-0.3.0/src/iparq/__init__.py +0 -1
iparq-0.3.0/uv.lock +0 -547
{iparq-0.3.0 → iparq-0.4.1}/.github/copilot-instructions.md +0 -0
{iparq-0.3.0 → iparq-0.4.1}/.github/workflows/copilot-setup-steps.yml +0 -0
{iparq-0.3.0 → iparq-0.4.1}/.github/workflows/merge.yml +0 -0
{iparq-0.3.0 → iparq-0.4.1}/.github/workflows/python-publish.yml +0 -0
{iparq-0.3.0 → iparq-0.4.1}/.python-version +0 -0
{iparq-0.3.0 → iparq-0.4.1}/.vscode/launch.json +0 -0
{iparq-0.3.0 → iparq-0.4.1}/.vscode/settings.json +0 -0
{iparq-0.3.0 → iparq-0.4.1}/CONTRIBUTING.md +0 -0
{iparq-0.3.0 → iparq-0.4.1}/LICENSE +0 -0
{iparq-0.3.0 → iparq-0.4.1}/dummy.parquet +0 -0
{iparq-0.3.0 → iparq-0.4.1}/media/iparq.png +0 -0
{iparq-0.3.0 → iparq-0.4.1}/src/iparq/py.typed +0 -0
{iparq-0.3.0 → iparq-0.4.1}/tests/conftest.py +0 -0
{iparq-0.3.0 → iparq-0.4.1}/tests/dummy.parquet +0 -0

iparq-0.4.1/.github/FUNDING.yml ADDED Viewed

@@ -0,0 +1,4 @@
+# These are supported funding model platforms
+github: [MiguelElGallo]

{iparq-0.3.0 → iparq-0.4.1}/.github/dependabot.yml RENAMED Viewed

@@ -5,7 +5,7 @@
 version: 2
 updates:
-  - package-ecosystem: "pip" # See documentation for possible values
+  - package-ecosystem: "uv" # See documentation for possible values
     directory: "/" # Location of package manifests
     schedule:
       interval: "weekly"

{iparq-0.3.0 → iparq-0.4.1}/.github/workflows/python-package.yml RENAMED Viewed

@@ -45,6 +45,16 @@ jobs:
             uv run mypy . --config-file=../../pyproject.toml
         - name: Check formatting with black
           run: uvx black . --check --verbose
-        - name: Run Python tests
+        - name: Run Python tests with coverage
           if: runner.os != 'Windows'
-          run: uv run pytest -vv
+          run: uv run pytest -vv --cov=src/iparq --cov-report=xml --cov-report=term-missing
+        - name: Upload coverage to Codecov
+          if: runner.os != 'Windows'
+          uses: codecov/codecov-action@v5
+          with:
+            files: ./coverage.xml
+            fail_ci_if_error: false
+            verbose: true
+          env:
+            CODECOV_TOKEN: ${{ secrets.CODECOV_TOKEN }}

iparq-0.4.1/.github/workflows/test.yml ADDED Viewed

@@ -0,0 +1,37 @@
+name: Run Tests
+on:
+  push:
+    branches: [ "main" ]
+  pull_request:
+    branches: [ "main" ]
+  workflow_dispatch:
+jobs:
+  test:
+    permissions:
+      contents: read
+    runs-on: ${{ matrix.os }}
+    strategy:
+      fail-fast: false
+      matrix:
+        os: [ubuntu-latest]
+        python-version: ["3.9", "3.10", "3.11", "3.12", "3.13"]
+    steps:
+      - name: Checkout code
+        uses: actions/checkout@v4
+      - name: Set up Python ${{ matrix.python-version }}
+        uses: actions/setup-python@v5
+        with:
+          python-version: ${{ matrix.python-version }}
+      - name: Install uv
+        uses: astral-sh/setup-uv@v5
+      - name: Install dependencies
+        run: uv sync --all-extras
+      - name: Run tests
+        run: uv run pytest -vv

{iparq-0.3.0 → iparq-0.4.1}/.gitignore RENAMED Viewed

@@ -172,3 +172,4 @@ cython_debug/
 .github/.DS_Store
 yellow_tripdata_2024-01.parquet
 filter.parquet
+.DS_Store

{iparq-0.3.0 → iparq-0.4.1}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: iparq
-Version: 0.3.0
+Version: 0.4.1
 Summary: Display version compression and bloom filter information about a parquet file
 Author-email: MiguelElGallo <miguel.zurcher@gmail.com>
 License-File: LICENSE
@@ -13,6 +13,7 @@ Provides-Extra: checks
 Requires-Dist: mypy>=1.14.1; extra == 'checks'
 Requires-Dist: ruff>=0.9.3; extra == 'checks'
 Provides-Extra: test
+Requires-Dist: pytest-cov>=4.0.0; extra == 'test'
 Requires-Dist: pytest>=7.0; extra == 'test'
 Description-Content-Type: text/markdown
@@ -24,6 +25,8 @@ Description-Content-Type: text/markdown
 [![Upload Python Package](https://github.com/MiguelElGallo/iparq/actions/workflows/python-publish.yml/badge.svg)](https://github.com/MiguelElGallo/iparq/actions/workflows/python-publish.yml)
+[![codecov](https://codecov.io/gh/MiguelElGallo/iparq/branch/main/graph/badge.svg)](https://codecov.io/gh/MiguelElGallo/iparq)
 ![alt text](media/iparq.png)
 After reading [this blog](https://duckdb.org/2025/01/22/parquet-encodings.html), I began to wonder which Parquet version and compression methods the everyday tools we rely on actually use, only to find that there's no straightforward way to determine this. That curiosity and the difficulty of quickly discovering such details motivated me to create iparq (Information Parquet). My goal with iparq is to help users easily identify the specifics of the Parquet files generated by different engines, making it clear which features—like newer encodings or certain compression algorithms—the creator of the parquet is using.
@@ -134,31 +137,25 @@ iparq inspect important.parquet temp_*.parquet
 When inspecting multiple files, each file's results are displayed with a header showing the filename. The utility will read the metadata of each file and print the compression codecs used in the parquet files.
-## Example ouput - Bloom Filters
+## Example output - Bloom Filters
 ```log
 ParquetMetaModel(
-    created_by='DuckDB version v1.2.1 (build 8e52ec4395)',
-    num_columns=1,
-    num_rows=100000000,
-    num_row_groups=10,
-    format_version='1.0',
-    serialized_size=1196
+    created_by='parquet-cpp-arrow version 14.0.2',
+    num_columns=3,
+    num_rows=3,
+    num_row_groups=1,
+    format_version='2.6',
+    serialized_size=2223
 )
-                   Parquet Column Information
-┏━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
-┃ Row Group ┃ Column Name ┃ Index ┃ Compression ┃ Bloom Filter ┃
-┡━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
-│     0     │ r           │   0   │ SNAPPY      │      ✅      │
-│     1     │ r           │   0   │ SNAPPY      │      ✅      │
-│     2     │ r           │   0   │ SNAPPY      │      ✅      │
-│     3     │ r           │   0   │ SNAPPY      │      ✅      │
-│     4     │ r           │   0   │ SNAPPY      │      ✅      │
-│     5     │ r           │   0   │ SNAPPY      │      ✅      │
-│     6     │ r           │   0   │ SNAPPY      │      ✅      │
-│     7     │ r           │   0   │ SNAPPY      │      ✅      │
-│     8     │ r           │   0   │ SNAPPY      │      ✅      │
-│     9     │ r           │   0   │ SNAPPY      │      ✅      │
-└───────────┴─────────────┴───────┴─────────────┴──────────────┘
+                            Parquet Column Information
+┏━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┓
+┃           ┃ Column     ┃       ┃           ┃   Bloom    ┃           ┃           ┃
+┃ Row Group ┃ Name       ┃ Index ┃ Compress… ┃   Filter   ┃ Min Value ┃ Max Value ┃
+┡━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━┩
+│     0     │ one        │   0   │ SNAPPY    │     ✅     │ -1.0      │ 2.5       │
+│     0     │ two        │   1   │ SNAPPY    │     ✅     │ bar       │ foo       │
+│     0     │ three      │   2   │ SNAPPY    │     ✅     │ False     │ True      │
+└───────────┴────────────┴───────┴───────────┴────────────┴───────────┴───────────┘
 Compression codecs: {'SNAPPY'}
 ```

{iparq-0.3.0 → iparq-0.4.1}/README.md RENAMED Viewed

@@ -6,6 +6,8 @@
 [![Upload Python Package](https://github.com/MiguelElGallo/iparq/actions/workflows/python-publish.yml/badge.svg)](https://github.com/MiguelElGallo/iparq/actions/workflows/python-publish.yml)
+[![codecov](https://codecov.io/gh/MiguelElGallo/iparq/branch/main/graph/badge.svg)](https://codecov.io/gh/MiguelElGallo/iparq)
 ![alt text](media/iparq.png)
 After reading [this blog](https://duckdb.org/2025/01/22/parquet-encodings.html), I began to wonder which Parquet version and compression methods the everyday tools we rely on actually use, only to find that there's no straightforward way to determine this. That curiosity and the difficulty of quickly discovering such details motivated me to create iparq (Information Parquet). My goal with iparq is to help users easily identify the specifics of the Parquet files generated by different engines, making it clear which features—like newer encodings or certain compression algorithms—the creator of the parquet is using.
@@ -116,31 +118,25 @@ iparq inspect important.parquet temp_*.parquet
 When inspecting multiple files, each file's results are displayed with a header showing the filename. The utility will read the metadata of each file and print the compression codecs used in the parquet files.
-## Example ouput - Bloom Filters
+## Example output - Bloom Filters
 ```log
 ParquetMetaModel(
-    created_by='DuckDB version v1.2.1 (build 8e52ec4395)',
-    num_columns=1,
-    num_rows=100000000,
-    num_row_groups=10,
-    format_version='1.0',
-    serialized_size=1196
+    created_by='parquet-cpp-arrow version 14.0.2',
+    num_columns=3,
+    num_rows=3,
+    num_row_groups=1,
+    format_version='2.6',
+    serialized_size=2223
 )
-                   Parquet Column Information
-┏━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
-┃ Row Group ┃ Column Name ┃ Index ┃ Compression ┃ Bloom Filter ┃
-┡━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
-│     0     │ r           │   0   │ SNAPPY      │      ✅      │
-│     1     │ r           │   0   │ SNAPPY      │      ✅      │
-│     2     │ r           │   0   │ SNAPPY      │      ✅      │
-│     3     │ r           │   0   │ SNAPPY      │      ✅      │
-│     4     │ r           │   0   │ SNAPPY      │      ✅      │
-│     5     │ r           │   0   │ SNAPPY      │      ✅      │
-│     6     │ r           │   0   │ SNAPPY      │      ✅      │
-│     7     │ r           │   0   │ SNAPPY      │      ✅      │
-│     8     │ r           │   0   │ SNAPPY      │      ✅      │
-│     9     │ r           │   0   │ SNAPPY      │      ✅      │
-└───────────┴─────────────┴───────┴─────────────┴──────────────┘
+                            Parquet Column Information
+┏━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┓
+┃           ┃ Column     ┃       ┃           ┃   Bloom    ┃           ┃           ┃
+┃ Row Group ┃ Name       ┃ Index ┃ Compress… ┃   Filter   ┃ Min Value ┃ Max Value ┃
+┡━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━┩
+│     0     │ one        │   0   │ SNAPPY    │     ✅     │ -1.0      │ 2.5       │
+│     0     │ two        │   1   │ SNAPPY    │     ✅     │ bar       │ foo       │
+│     0     │ three      │   2   │ SNAPPY    │     ✅     │ False     │ True      │
+└───────────┴────────────┴───────┴───────────┴────────────┴───────────┴───────────┘
 Compression codecs: {'SNAPPY'}
 ```

{iparq-0.3.0 → iparq-0.4.1}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "iparq"
-version = "0.3.0"
+version = "0.4.1"
 description = "Display version compression and bloom filter information about a parquet file"
 readme = "README.md"
 authors = [
@@ -17,6 +17,7 @@ dependencies = [
 [project.optional-dependencies]
 test = [
   "pytest>=7.0",
+  "pytest-cov>=4.0.0",
 ]
 checks = [
     "mypy>=1.14.1",
@@ -38,4 +39,10 @@ testpaths = [
 [[tool.mypy.overrides]]
 module = ["pyarrow.*"]
-ignore_missing_imports = true
+ignore_missing_imports = true
+[dependency-groups]
+dev = [
+    "pytest>=8.4.1",
+    "pytest-cov>=4.0.0",
+]

iparq-0.4.1/src/iparq/__init__.py ADDED Viewed

	@@ -0,0 +1 @@
1	+ __version__ = "0.4.1"

{iparq-0.3.0 → iparq-0.4.1}/src/iparq/source.py RENAMED Viewed

@@ -54,6 +54,9 @@ class ColumnInfo(BaseModel):
         column_index (int): The index of the column.
         compression_type (str): The compression type used for the column.
         has_bloom_filter (bool): Whether the column has a bloom filter.
+        has_min_max (bool): Whether min/max statistics are available.
+        min_value (Optional[str]): The minimum value in the column (as string for display).
+        max_value (Optional[str]): The maximum value in the column (as string for display).
     """
     row_group: int
@@ -61,6 +64,9 @@ class ColumnInfo(BaseModel):
     column_index: int
     compression_type: str
     has_bloom_filter: Optional[bool] = False
+    has_min_max: Optional[bool] = False
+    min_value: Optional[str] = None
+    max_value: Optional[str] = None
 class ParquetColumnInfo(BaseModel):
@@ -203,6 +209,59 @@ def print_bloom_filter_info(parquet_metadata, column_info: ParquetColumnInfo) ->
         )
+def print_min_max_statistics(parquet_metadata, column_info: ParquetColumnInfo) -> None:
+    """
+    Updates the column_info model with min/max statistics information.
+    Args:
+        parquet_metadata: The Parquet file metadata.
+        column_info: The ParquetColumnInfo model to update.
+    """
+    try:
+        num_row_groups = parquet_metadata.num_row_groups
+        num_columns = parquet_metadata.num_columns
+        for i in range(num_row_groups):
+            row_group = parquet_metadata.row_group(i)
+            for j in range(num_columns):
+                column_chunk = row_group.column(j)
+                # Find the corresponding column in our model
+                for col in column_info.columns:
+                    if col.row_group == i and col.column_index == j:
+                        # Check if this column has statistics
+                        if column_chunk.is_stats_set:
+                            stats = column_chunk.statistics
+                            col.has_min_max = stats.has_min_max
+                            if stats.has_min_max:
+                                # Convert values to string for display, handling potential None values
+                                try:
+                                    col.min_value = (
+                                        str(stats.min)
+                                        if stats.min is not None
+                                        else "null"
+                                    )
+                                    col.max_value = (
+                                        str(stats.max)
+                                        if stats.max is not None
+                                        else "null"
+                                    )
+                                except Exception:
+                                    # Fallback for complex types that might not stringify well
+                                    col.min_value = "<unable to display>"
+                                    col.max_value = "<unable to display>"
+                        else:
+                            col.has_min_max = False
+                        break
+    except Exception as e:
+        console.print(
+            f"Error while collecting min/max statistics: {e}",
+            style="blink bold red underline on white",
+        )
 def print_column_info_table(column_info: ParquetColumnInfo) -> None:
     """
     Prints the column information using a Rich table.
@@ -218,15 +277,27 @@ def print_column_info_table(column_info: ParquetColumnInfo) -> None:
     table.add_column("Index", justify="center")
     table.add_column("Compression", style="magenta")
     table.add_column("Bloom Filter", justify="center")
+    table.add_column("Min Value", style="yellow")
+    table.add_column("Max Value", style="yellow")
     # Add rows to the table
     for col in column_info.columns:
+        # Format min/max values for display
+        min_display = (
+            col.min_value if col.has_min_max and col.min_value is not None else "N/A"
+        )
+        max_display = (
+            col.max_value if col.has_min_max and col.max_value is not None else "N/A"
+        )
         table.add_row(
             str(col.row_group),
             col.column_name,
             str(col.column_index),
             col.compression_type,
             "✅" if col.has_bloom_filter else "❌",
+            min_display,
+            max_display,
         )
     # Print the table
@@ -290,6 +361,7 @@ def inspect_single_file(
     # Collect information
     print_compression_types(parquet_metadata, column_info)
     print_bloom_filter_info(parquet_metadata, column_info)
+    print_min_max_statistics(parquet_metadata, column_info)
     # Filter columns if requested
     if column_filter:

{iparq-0.3.0 → iparq-0.4.1}/tests/test_cli.py RENAMED Viewed

@@ -17,25 +17,24 @@ def test_parquet_info():
     assert result.exit_code == 0
-    expected_output = """ParquetMetaModel(
-    created_by='parquet-cpp-arrow version 14.0.2',
-    num_columns=3,
-    num_rows=3,
-    num_row_groups=1,
-    format_version='2.6',
-    serialized_size=2223
-)
-                   Parquet Column Information
-┏━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
-┃ Row Group ┃ Column Name ┃ Index ┃ Compression ┃ Bloom Filter ┃
-┡━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
-│     0     │ one         │   0   │ SNAPPY      │      ✅      │
-│     0     │ two         │   1   │ SNAPPY      │      ✅      │
-│     0     │ three       │   2   │ SNAPPY      │      ✅      │
-└───────────┴─────────────┴───────┴─────────────┴──────────────┘
-Compression codecs: {'SNAPPY'}"""
-    assert expected_output in result.stdout
+    # Check for key components instead of exact table format
+    assert "ParquetMetaModel" in result.stdout
+    assert "created_by='parquet-cpp-arrow version 14.0.2'" in result.stdout
+    assert "num_columns=3" in result.stdout
+    assert "num_rows=3" in result.stdout
+    assert "Parquet Column Information" in result.stdout
+    assert "Min Value" in result.stdout
+    assert (
+        "Value" in result.stdout
+    )  # This covers "Max Value" which is split across lines
+    assert "one" in result.stdout and "-1.0" in result.stdout and "2.5" in result.stdout
+    assert "two" in result.stdout and "bar" in result.stdout and "foo" in result.stdout
+    assert (
+        "three" in result.stdout
+        and "False" in result.stdout
+        and "True" in result.stdout
+    )
+    assert "Compression codecs: {'SNAPPY'}" in result.stdout
 def test_metadata_only_flag():
@@ -77,6 +76,16 @@ def test_json_output():
     assert "compression_codecs" in data
     assert data["metadata"]["num_columns"] == 3
+    # Check that min/max statistics are included
+    for column in data["columns"]:
+        assert "has_min_max" in column
+        assert "min_value" in column
+        assert "max_value" in column
+        # For our test data, all columns should have min/max stats
+        assert column["has_min_max"] is True
+        assert column["min_value"] is not None
+        assert column["max_value"] is not None
 def test_multiple_files():
     """Test that multiple files can be inspected in a single command."""

iparq 0.3.0__tar.gz → 0.4.1__tar.gz

iparq 0.3.0tar.gz → 0.4.1tar.gz