PyPI - iparq - Versions diffs - 0.4.1__tar.gz → 0.5.0__tar.gz - Mend

iparq 0.4.1tar.gz → 0.5.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (27) hide show

{iparq-0.4.1 → iparq-0.5.0}/PKG-INFO +37 -14
{iparq-0.4.1 → iparq-0.5.0}/README.md +36 -13
{iparq-0.4.1 → iparq-0.5.0}/pyproject.toml +1 -1
{iparq-0.4.1 → iparq-0.5.0}/src/iparq/source.py +121 -7
{iparq-0.4.1 → iparq-0.5.0}/tests/test_cli.py +194 -6
iparq-0.5.0/uv.lock +1042 -0
iparq-0.4.1/uv.lock +0 -923
{iparq-0.4.1 → iparq-0.5.0}/.github/FUNDING.yml +0 -0
{iparq-0.4.1 → iparq-0.5.0}/.github/copilot-instructions.md +0 -0
{iparq-0.4.1 → iparq-0.5.0}/.github/dependabot.yml +0 -0
{iparq-0.4.1 → iparq-0.5.0}/.github/workflows/copilot-setup-steps.yml +0 -0
{iparq-0.4.1 → iparq-0.5.0}/.github/workflows/merge.yml +0 -0
{iparq-0.4.1 → iparq-0.5.0}/.github/workflows/python-package.yml +0 -0
{iparq-0.4.1 → iparq-0.5.0}/.github/workflows/python-publish.yml +0 -0
{iparq-0.4.1 → iparq-0.5.0}/.github/workflows/test.yml +0 -0
{iparq-0.4.1 → iparq-0.5.0}/.gitignore +0 -0
{iparq-0.4.1 → iparq-0.5.0}/.python-version +0 -0
{iparq-0.4.1 → iparq-0.5.0}/.vscode/launch.json +0 -0
{iparq-0.4.1 → iparq-0.5.0}/.vscode/settings.json +0 -0
{iparq-0.4.1 → iparq-0.5.0}/CONTRIBUTING.md +0 -0
{iparq-0.4.1 → iparq-0.5.0}/LICENSE +0 -0
{iparq-0.4.1 → iparq-0.5.0}/dummy.parquet +0 -0
{iparq-0.4.1 → iparq-0.5.0}/media/iparq.png +0 -0
{iparq-0.4.1 → iparq-0.5.0}/src/iparq/__init__.py +0 -0
{iparq-0.4.1 → iparq-0.5.0}/src/iparq/py.typed +0 -0
{iparq-0.4.1 → iparq-0.5.0}/tests/conftest.py +0 -0
{iparq-0.4.1 → iparq-0.5.0}/tests/dummy.parquet +0 -0

{iparq-0.4.1 → iparq-0.5.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: iparq
-Version: 0.4.1
+Version: 0.5.0
 Summary: Display version compression and bloom filter information about a parquet file
 Author-email: MiguelElGallo <miguel.zurcher@gmail.com>
 License-File: LICENSE
@@ -30,8 +30,12 @@ Description-Content-Type: text/markdown
 ![alt text](media/iparq.png)
 After reading [this blog](https://duckdb.org/2025/01/22/parquet-encodings.html), I began to wonder which Parquet version and compression methods the everyday tools we rely on actually use, only to find that there's no straightforward way to determine this. That curiosity and the difficulty of quickly discovering such details motivated me to create iparq (Information Parquet). My goal with iparq is to help users easily identify the specifics of the Parquet files generated by different engines, making it clear which features—like newer encodings or certain compression algorithms—the creator of the parquet is using.
-***New*** Bloom filters information: Displays if there are bloom filters.
-Read more about bloom filters in this [great article](https://duckdb.org/2025/03/07/parquet-bloom-filters-in-duckdb.html).
+## Features
+- **Bloom filters**: Displays if columns have bloom filters. Read more in this [great article](https://duckdb.org/2025/03/07/parquet-bloom-filters-in-duckdb.html).
+- **Encryption detection**: Shows if columns are encrypted (🔒)
+- **Statistics exactness**: Indicates if min/max statistics are exact or approximate (PyArrow 22+)
+- **Compression ratios**: Optional display of column sizes and compression efficiency
 ## Installation
@@ -102,11 +106,12 @@ Options include:
 - `--format`, `-f`: Output format, either `rich` (default) or `json`
 - `--metadata-only`, `-m`: Show only file metadata without column details
 - `--column`, `-c`: Filter results to show only a specific column
+- `--sizes`, `-s`: Show column sizes and compression ratios
 ### Single File Examples:
 ```sh
-# Basic inspection
+# Basic inspection .
 iparq inspect yourfile.parquet
 # Output in JSON format
@@ -117,6 +122,9 @@ iparq inspect yourfile.parquet --metadata-only
 # Filter to show only a specific column
 iparq inspect yourfile.parquet --column column_name
+# Show column sizes and compression ratios
+iparq inspect yourfile.parquet --sizes
 ```
 ### Multiple Files and Glob Patterns:
@@ -137,7 +145,7 @@ iparq inspect important.parquet temp_*.parquet
 When inspecting multiple files, each file's results are displayed with a header showing the filename. The utility will read the metadata of each file and print the compression codecs used in the parquet files.
-## Example output - Bloom Filters
+## Example output
 ```log
 ParquetMetaModel(
@@ -148,14 +156,29 @@ ParquetMetaModel(
     format_version='2.6',
     serialized_size=2223
 )
-                            Parquet Column Information
-┏━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┓
-┃           ┃ Column     ┃       ┃           ┃   Bloom    ┃           ┃           ┃
-┃ Row Group ┃ Name       ┃ Index ┃ Compress… ┃   Filter   ┃ Min Value ┃ Max Value ┃
-┡━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━┩
-│     0     │ one        │   0   │ SNAPPY    │     ✅     │ -1.0      │ 2.5       │
-│     0     │ two        │   1   │ SNAPPY    │     ✅     │ bar       │ foo       │
-│     0     │ three      │   2   │ SNAPPY    │     ✅     │ False     │ True      │
-└───────────┴────────────┴───────┴───────────┴────────────┴───────────┴───────────┘
+                                     Parquet Column Information
+┏━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━┓
+┃ Row Group ┃ Column Name ┃ Index ┃ Compression ┃ Bloom ┃ Encrypted ┃ Min Value ┃ Max Value ┃ Exact ┃
+┡━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━┩
+│     0     │ one         │   0   │ SNAPPY      │  ✅   │     —     │ -1.0      │ 2.5       │  N/A  │
+│     0     │ two         │   1   │ SNAPPY      │  ✅   │     —     │ bar       │ foo       │  N/A  │
+│     0     │ three       │   2   │ SNAPPY      │  ✅   │     —     │ False     │ True      │  N/A  │
+└───────────┴─────────────┴───────┴─────────────┴───────┴───────────┴───────────┴───────────┴───────┘
 Compression codecs: {'SNAPPY'}
 ```
+### With `--sizes` flag
+```log
+iparq inspect yourfile.parquet --sizes
+                                         Parquet Column Information
+┏━━━━━━━━┳━━━━━━━━━┳━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
+┃  Row   ┃ Column  ┃       ┃        ┃       ┃         ┃ Min    ┃ Max     ┃       ┃        ┃        ┃       ┃
+┃ Group  ┃ Name    ┃ Index ┃ Compr… ┃ Bloom ┃ Encryp… ┃ Value  ┃ Value   ┃ Exact ┃ Values ┃ Compr… ┃ Ratio ┃
+┡━━━━━━━━╇━━━━━━━━━╇━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
+│   0    │ one     │   0   │ SNAPPY │  ✅   │    —    │ -1.0   │ 2.5     │  N/A  │      3 │ 104.0B │  1.0x │
+│   0    │ two     │   1   │ SNAPPY │  ✅   │    —    │ bar    │ foo     │  N/A  │      3 │  80.0B │  0.9x │
+│   0    │ three   │   2   │ SNAPPY │  ✅   │    —    │ False  │ True    │  N/A  │      3 │  42.0B │  1.0x │
+└────────┴─────────┴───────┴────────┴───────┴─────────┴────────┴─────────┴───────┴────────┴────────┴───────┘
+```

{iparq-0.4.1 → iparq-0.5.0}/README.md RENAMED Viewed

@@ -11,8 +11,12 @@
 ![alt text](media/iparq.png)
 After reading [this blog](https://duckdb.org/2025/01/22/parquet-encodings.html), I began to wonder which Parquet version and compression methods the everyday tools we rely on actually use, only to find that there's no straightforward way to determine this. That curiosity and the difficulty of quickly discovering such details motivated me to create iparq (Information Parquet). My goal with iparq is to help users easily identify the specifics of the Parquet files generated by different engines, making it clear which features—like newer encodings or certain compression algorithms—the creator of the parquet is using.
-***New*** Bloom filters information: Displays if there are bloom filters.
-Read more about bloom filters in this [great article](https://duckdb.org/2025/03/07/parquet-bloom-filters-in-duckdb.html).
+## Features
+- **Bloom filters**: Displays if columns have bloom filters. Read more in this [great article](https://duckdb.org/2025/03/07/parquet-bloom-filters-in-duckdb.html).
+- **Encryption detection**: Shows if columns are encrypted (🔒)
+- **Statistics exactness**: Indicates if min/max statistics are exact or approximate (PyArrow 22+)
+- **Compression ratios**: Optional display of column sizes and compression efficiency
 ## Installation
@@ -83,11 +87,12 @@ Options include:
 - `--format`, `-f`: Output format, either `rich` (default) or `json`
 - `--metadata-only`, `-m`: Show only file metadata without column details
 - `--column`, `-c`: Filter results to show only a specific column
+- `--sizes`, `-s`: Show column sizes and compression ratios
 ### Single File Examples:
 ```sh
-# Basic inspection
+# Basic inspection .
 iparq inspect yourfile.parquet
 # Output in JSON format
@@ -98,6 +103,9 @@ iparq inspect yourfile.parquet --metadata-only
 # Filter to show only a specific column
 iparq inspect yourfile.parquet --column column_name
+# Show column sizes and compression ratios
+iparq inspect yourfile.parquet --sizes
 ```
 ### Multiple Files and Glob Patterns:
@@ -118,7 +126,7 @@ iparq inspect important.parquet temp_*.parquet
 When inspecting multiple files, each file's results are displayed with a header showing the filename. The utility will read the metadata of each file and print the compression codecs used in the parquet files.
-## Example output - Bloom Filters
+## Example output
 ```log
 ParquetMetaModel(
@@ -129,14 +137,29 @@ ParquetMetaModel(
     format_version='2.6',
     serialized_size=2223
 )
-                            Parquet Column Information
-┏━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┓
-┃           ┃ Column     ┃       ┃           ┃   Bloom    ┃           ┃           ┃
-┃ Row Group ┃ Name       ┃ Index ┃ Compress… ┃   Filter   ┃ Min Value ┃ Max Value ┃
-┡━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━┩
-│     0     │ one        │   0   │ SNAPPY    │     ✅     │ -1.0      │ 2.5       │
-│     0     │ two        │   1   │ SNAPPY    │     ✅     │ bar       │ foo       │
-│     0     │ three      │   2   │ SNAPPY    │     ✅     │ False     │ True      │
-└───────────┴────────────┴───────┴───────────┴────────────┴───────────┴───────────┘
+                                     Parquet Column Information
+┏━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━┓
+┃ Row Group ┃ Column Name ┃ Index ┃ Compression ┃ Bloom ┃ Encrypted ┃ Min Value ┃ Max Value ┃ Exact ┃
+┡━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━┩
+│     0     │ one         │   0   │ SNAPPY      │  ✅   │     —     │ -1.0      │ 2.5       │  N/A  │
+│     0     │ two         │   1   │ SNAPPY      │  ✅   │     —     │ bar       │ foo       │  N/A  │
+│     0     │ three       │   2   │ SNAPPY      │  ✅   │     —     │ False     │ True      │  N/A  │
+└───────────┴─────────────┴───────┴─────────────┴───────┴───────────┴───────────┴───────────┴───────┘
 Compression codecs: {'SNAPPY'}
 ```
+### With `--sizes` flag
+```log
+iparq inspect yourfile.parquet --sizes
+                                         Parquet Column Information
+┏━━━━━━━━┳━━━━━━━━━┳━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
+┃  Row   ┃ Column  ┃       ┃        ┃       ┃         ┃ Min    ┃ Max     ┃       ┃        ┃        ┃       ┃
+┃ Group  ┃ Name    ┃ Index ┃ Compr… ┃ Bloom ┃ Encryp… ┃ Value  ┃ Value   ┃ Exact ┃ Values ┃ Compr… ┃ Ratio ┃
+┡━━━━━━━━╇━━━━━━━━━╇━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
+│   0    │ one     │   0   │ SNAPPY │  ✅   │    —    │ -1.0   │ 2.5     │  N/A  │      3 │ 104.0B │  1.0x │
+│   0    │ two     │   1   │ SNAPPY │  ✅   │    —    │ bar    │ foo     │  N/A  │      3 │  80.0B │  0.9x │
+│   0    │ three   │   2   │ SNAPPY │  ✅   │    —    │ False  │ True    │  N/A  │      3 │  42.0B │  1.0x │
+└────────┴─────────┴───────┴────────┴───────┴─────────┴────────┴─────────┴───────┴────────┴────────┴───────┘
+```

{iparq-0.4.1 → iparq-0.5.0}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "iparq"
-version = "0.4.1"
+version = "0.5.0"
 description = "Display version compression and bloom filter information about a parquet file"
 readme = "README.md"
 authors = [

{iparq-0.4.1 → iparq-0.5.0}/src/iparq/source.py RENAMED Viewed

@@ -57,6 +57,12 @@ class ColumnInfo(BaseModel):
         has_min_max (bool): Whether min/max statistics are available.
         min_value (Optional[str]): The minimum value in the column (as string for display).
         max_value (Optional[str]): The maximum value in the column (as string for display).
+        is_min_exact (Optional[bool]): Whether the min value is exact (PyArrow 22+).
+        is_max_exact (Optional[bool]): Whether the max value is exact (PyArrow 22+).
+        is_encrypted (Optional[bool]): Whether the column is encrypted.
+        num_values (Optional[int]): Number of values in this column chunk.
+        total_compressed_size (Optional[int]): Total compressed size in bytes.
+        total_uncompressed_size (Optional[int]): Total uncompressed size in bytes.
     """
     row_group: int
@@ -67,6 +73,12 @@ class ColumnInfo(BaseModel):
     has_min_max: Optional[bool] = False
     min_value: Optional[str] = None
     max_value: Optional[str] = None
+    is_min_exact: Optional[bool] = None
+    is_max_exact: Optional[bool] = None
+    is_encrypted: Optional[bool] = None
+    num_values: Optional[int] = None
+    total_compressed_size: Optional[int] = None
+    total_uncompressed_size: Optional[int] = None
 class ParquetColumnInfo(BaseModel):
@@ -158,6 +170,28 @@ def print_compression_types(parquet_metadata, column_info: ParquetColumnInfo) ->
                 compression = column_chunk.compression
                 column_name = parquet_metadata.schema.names[j]
+                # Get additional column chunk metadata
+                num_values = (
+                    column_chunk.num_values
+                    if hasattr(column_chunk, "num_values")
+                    else None
+                )
+                total_compressed = (
+                    column_chunk.total_compressed_size
+                    if hasattr(column_chunk, "total_compressed_size")
+                    else None
+                )
+                total_uncompressed = (
+                    column_chunk.total_uncompressed_size
+                    if hasattr(column_chunk, "total_uncompressed_size")
+                    else None
+                )
+                is_encrypted = (
+                    column_chunk.is_crypto_metadata_set()
+                    if hasattr(column_chunk, "is_crypto_metadata_set")
+                    else None
+                )
                 # Create or update column info
                 column_info.columns.append(
                     ColumnInfo(
@@ -165,6 +199,10 @@ def print_compression_types(parquet_metadata, column_info: ParquetColumnInfo) ->
                         column_name=column_name,
                         column_index=j,
                         compression_type=compression,
+                        num_values=num_values,
+                        total_compressed_size=total_compressed,
+                        total_uncompressed_size=total_uncompressed,
+                        is_encrypted=is_encrypted,
                     )
                 )
     except Exception as e:
@@ -252,6 +290,16 @@ def print_min_max_statistics(parquet_metadata, column_info: ParquetColumnInfo) -
                                     # Fallback for complex types that might not stringify well
                                     col.min_value = "<unable to display>"
                                     col.max_value = "<unable to display>"
+                                # PyArrow 22+ feature: check if min/max values are exact
+                                # This helps users understand if statistics can be trusted for query optimization
+                                try:
+                                    if hasattr(stats, "is_min_value_exact"):
+                                        col.is_min_exact = stats.is_min_value_exact
+                                    if hasattr(stats, "is_max_value_exact"):
+                                        col.is_max_exact = stats.is_max_value_exact
+                                except Exception:
+                                    pass  # Not available in older PyArrow versions
                         else:
                             col.has_min_max = False
                         break
@@ -262,12 +310,27 @@ def print_min_max_statistics(parquet_metadata, column_info: ParquetColumnInfo) -
         )
-def print_column_info_table(column_info: ParquetColumnInfo) -> None:
+def format_size(size_bytes: Optional[int]) -> str:
+    """Format bytes into human-readable size."""
+    if size_bytes is None:
+        return "N/A"
+    size: float = float(size_bytes)
+    for unit in ["B", "KB", "MB", "GB"]:
+        if abs(size) < 1024.0:
+            return f"{size:.1f}{unit}"
+        size /= 1024.0
+    return f"{size:.1f}TB"
+def print_column_info_table(
+    column_info: ParquetColumnInfo, show_sizes: bool = False
+) -> None:
     """
     Prints the column information using a Rich table.
     Args:
         column_info: The ParquetColumnInfo model to display.
+        show_sizes: Whether to show compressed/uncompressed size columns.
     """
     table = Table(title="Parquet Column Information")
@@ -276,9 +339,18 @@ def print_column_info_table(column_info: ParquetColumnInfo) -> None:
     table.add_column("Column Name", style="green")
     table.add_column("Index", justify="center")
     table.add_column("Compression", style="magenta")
-    table.add_column("Bloom Filter", justify="center")
+    table.add_column("Bloom", justify="center")
+    table.add_column("Encrypted", justify="center")
     table.add_column("Min Value", style="yellow")
     table.add_column("Max Value", style="yellow")
+    table.add_column(
+        "Exact", justify="center", style="dim"
+    )  # Shows if min/max are exact
+    if show_sizes:
+        table.add_column("Values", justify="right")
+        table.add_column("Compressed", justify="right", style="blue")
+        table.add_column("Ratio", justify="right", style="blue")
     # Add rows to the table
     for col in column_info.columns:
@@ -290,15 +362,48 @@ def print_column_info_table(column_info: ParquetColumnInfo) -> None:
             col.max_value if col.has_min_max and col.max_value is not None else "N/A"
         )
-        table.add_row(
+        # Format exactness indicator (PyArrow 22+ feature)
+        exact_display = "N/A"
+        if col.is_min_exact is not None and col.is_max_exact is not None:
+            if col.is_min_exact and col.is_max_exact:
+                exact_display = "✅"
+            elif col.is_min_exact or col.is_max_exact:
+                exact_display = "~"  # Partially exact
+            else:
+                exact_display = "❌"
+        # Format encryption status
+        encrypted_display = "🔒" if col.is_encrypted else "—"
+        row_data = [
             str(col.row_group),
             col.column_name,
             str(col.column_index),
             col.compression_type,
             "✅" if col.has_bloom_filter else "❌",
+            encrypted_display,
             min_display,
             max_display,
-        )
+            exact_display,
+        ]
+        if show_sizes:
+            # Calculate compression ratio
+            ratio = "N/A"
+            if col.total_compressed_size and col.total_uncompressed_size:
+                ratio = (
+                    f"{col.total_uncompressed_size / col.total_compressed_size:.1f}x"
+                )
+            row_data.extend(
+                [
+                    str(col.num_values) if col.num_values else "N/A",
+                    format_size(col.total_compressed_size),
+                    ratio,
+                ]
+            )
+        table.add_row(*row_data)
     # Print the table
     console.print(table)
@@ -331,6 +436,7 @@ def inspect_single_file(
     format: OutputFormat,
     metadata_only: bool,
     column_filter: Optional[str],
+    show_sizes: bool = False,
 ) -> None:
     """
     Inspect a single Parquet file and display its metadata, compression settings, and bloom filter information.
@@ -339,7 +445,7 @@ def inspect_single_file(
         Exception: If the file cannot be processed.
     """
     try:
-        (parquet_metadata, compression) = read_parquet_metadata(filename)
+        parquet_metadata, compression = read_parquet_metadata(filename)
     except FileNotFoundError:
         raise Exception(f"Cannot open: {filename}.")
     except Exception as e:
@@ -382,7 +488,7 @@ def inspect_single_file(
         # Print column details if not metadata only
         if not metadata_only:
-            print_column_info_table(column_info)
+            print_column_info_table(column_info, show_sizes=show_sizes)
             console.print(f"Compression codecs: {compression}")
@@ -404,6 +510,12 @@ def inspect(
     column_filter: Optional[str] = typer.Option(
         None, "--column", "-c", help="Filter results to show only specific column"
     ),
+    show_sizes: bool = typer.Option(
+        False,
+        "--sizes",
+        "-s",
+        help="Show column sizes and compression ratios",
+    ),
 ):
     """
     Inspect Parquet files and display their metadata, compression settings, and bloom filter information.
@@ -436,7 +548,9 @@ def inspect(
             console.print("─" * (len(filename) + 6))
         try:
-            inspect_single_file(filename, format, metadata_only, column_filter)
+            inspect_single_file(
+                filename, format, metadata_only, column_filter, show_sizes
+            )
         except Exception as e:
             console.print(f"Error processing {filename}: {e}", style="red")
             continue

{iparq-0.4.1 → iparq-0.5.0}/tests/test_cli.py RENAMED Viewed

@@ -3,7 +3,14 @@ from pathlib import Path
 from typer.testing import CliRunner
-from iparq.source import app
+from iparq.source import (
+    ColumnInfo,
+    ParquetColumnInfo,
+    ParquetMetaModel,
+    app,
+    format_size,
+    output_json,
+)
 # Define path to test fixtures
 FIXTURES_DIR = Path(__file__).parent
@@ -23,10 +30,7 @@ def test_parquet_info():
     assert "num_columns=3" in result.stdout
     assert "num_rows=3" in result.stdout
     assert "Parquet Column Information" in result.stdout
-    assert "Min Value" in result.stdout
-    assert (
-        "Value" in result.stdout
-    )  # This covers "Max Value" which is split across lines
+    # Check for data values (these are more reliable than table headers which may be truncated)
     assert "one" in result.stdout and "-1.0" in result.stdout and "2.5" in result.stdout
     assert "two" in result.stdout and "bar" in result.stdout and "foo" in result.stdout
     assert (
@@ -34,7 +38,7 @@ def test_parquet_info():
         and "False" in result.stdout
         and "True" in result.stdout
     )
-    assert "Compression codecs: {'SNAPPY'}" in result.stdout
+    assert "SNAPPY" in result.stdout
 def test_metadata_only_flag():
@@ -171,3 +175,187 @@ def test_error_handling_with_multiple_files():
     # Should show error for bad file
     assert "Error processing" in result.stdout
     assert "nonexistent.parquet" in result.stdout
+def test_sizes_flag():
+    """Test that the --sizes flag displays column size information."""
+    runner = CliRunner()
+    result = runner.invoke(app, ["inspect", "--sizes", str(fixture_path)])
+    assert result.exit_code == 0
+    assert "ParquetMetaModel" in result.stdout
+    # Check for size-related output (Values, compressed size, ratio)
+    # The actual values depend on the test file
+def test_sizes_flag_with_json():
+    """Test that --sizes flag works with JSON output and includes size fields."""
+    runner = CliRunner()
+    result = runner.invoke(
+        app, ["inspect", "--format", "json", "--sizes", str(fixture_path)]
+    )
+    assert result.exit_code == 0
+    data = json.loads(result.stdout)
+    # Check that size fields are present in columns
+    for column in data["columns"]:
+        assert "num_values" in column
+        assert "total_compressed_size" in column
+        assert "total_uncompressed_size" in column
+def test_format_size_bytes():
+    """Test format_size function with bytes."""
+    assert format_size(100) == "100.0B"
+    assert format_size(0) == "0.0B"
+    assert format_size(None) == "N/A"
+def test_format_size_kilobytes():
+    """Test format_size function with kilobytes."""
+    assert format_size(1024) == "1.0KB"
+    assert format_size(2048) == "2.0KB"
+def test_format_size_megabytes():
+    """Test format_size function with megabytes."""
+    assert format_size(1024 * 1024) == "1.0MB"
+    assert format_size(5 * 1024 * 1024) == "5.0MB"
+def test_format_size_gigabytes():
+    """Test format_size function with gigabytes."""
+    assert format_size(1024 * 1024 * 1024) == "1.0GB"
+def test_format_size_terabytes():
+    """Test format_size function with terabytes."""
+    assert format_size(1024 * 1024 * 1024 * 1024) == "1.0TB"
+def test_column_info_model():
+    """Test ColumnInfo model with new fields."""
+    col = ColumnInfo(
+        row_group=0,
+        column_name="test_col",
+        column_index=0,
+        compression_type="SNAPPY",
+        has_bloom_filter=True,
+        has_min_max=True,
+        min_value="1",
+        max_value="100",
+        is_min_exact=True,
+        is_max_exact=True,
+        is_encrypted=False,
+        num_values=1000,
+        total_compressed_size=512,
+        total_uncompressed_size=1024,
+    )
+    assert col.is_min_exact is True
+    assert col.is_max_exact is True
+    assert col.is_encrypted is False
+    assert col.num_values == 1000
+    assert col.total_compressed_size == 512
+    assert col.total_uncompressed_size == 1024
+def test_column_info_model_defaults():
+    """Test ColumnInfo model with default values for new fields."""
+    col = ColumnInfo(
+        row_group=0,
+        column_name="test_col",
+        column_index=0,
+        compression_type="SNAPPY",
+    )
+    assert col.is_min_exact is None
+    assert col.is_max_exact is None
+    assert col.is_encrypted is None
+    assert col.num_values is None
+def test_output_json_function():
+    """Test the output_json function directly."""
+    import io
+    import sys
+    meta = ParquetMetaModel(
+        created_by="test",
+        num_columns=2,
+        num_rows=100,
+        num_row_groups=1,
+        format_version="2.6",
+        serialized_size=1000,
+    )
+    columns = ParquetColumnInfo(
+        columns=[
+            ColumnInfo(
+                row_group=0,
+                column_name="col1",
+                column_index=0,
+                compression_type="ZSTD",
+                has_min_max=True,
+                min_value="0",
+                max_value="99",
+                is_min_exact=True,
+                is_max_exact=False,
+                num_values=100,
+                total_compressed_size=256,
+                total_uncompressed_size=512,
+            )
+        ]
+    )
+    compression_codecs = {"ZSTD"}
+    # Capture stdout
+    captured = io.StringIO()
+    old_stdout = sys.stdout
+    sys.stdout = captured
+    try:
+        output_json(meta, columns, compression_codecs)
+    finally:
+        sys.stdout = old_stdout
+    output = captured.getvalue()
+    data = json.loads(output)
+    assert data["metadata"]["num_columns"] == 2
+    assert data["columns"][0]["is_min_exact"] is True
+    assert data["columns"][0]["is_max_exact"] is False
+    assert "ZSTD" in data["compression_codecs"]
+def test_column_filter_no_match():
+    """Test filtering by a column name that doesn't exist."""
+    runner = CliRunner()
+    result = runner.invoke(
+        app, ["inspect", "--column", "nonexistent_column", str(fixture_path)]
+    )
+    assert result.exit_code == 0
+    assert "No columns match the filter" in result.stdout
+def test_nonexistent_file():
+    """Test error handling for non-existent file."""
+    runner = CliRunner()
+    result = runner.invoke(app, ["inspect", "totally_fake_file.parquet"])
+    assert result.exit_code == 0  # CLI should handle error gracefully
+    assert "Error" in result.stdout
+def test_default_command():
+    """Test that the empty command name works as default."""
+    runner = CliRunner()
+    # The app has both @app.command(name="") and @app.command(name="inspect")
+    # So 'inspect' is required but maps to the same function
+    result = runner.invoke(app, ["inspect", str(fixture_path)])
+    assert result.exit_code == 0
+    assert "ParquetMetaModel" in result.stdout

iparq 0.4.1__tar.gz → 0.5.0__tar.gz

iparq 0.4.1tar.gz → 0.5.0tar.gz